Community Engagement Insights/Sampling strategy

Correct.svg This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

The sampling strategy document dictates how we will reach out to audiences for CE Insights. This document should describe very clearly who falls into our audiences, how we will reach these audiences, and highlight any sampling bias limitations we may be aware of for reporting purposes.

Each of the following should have the following elements:

  • Very specific description of the audience, including behavior or attributes that describe the audience.
  • Sampling frame (from where/how will we identify these people)
  • Sampling method
  • Estimated size of the population
  • Target sample size/sampling rate
  • Limitation of sampling method


(needs text)

Very Active EditorsEdit

Audience & sampling frameEdit

A very active editor, as defined in Mediawiki[1], is a user who makes 100 edits on any single Wikimedia project. For this project, our definition changes in order to be able to produce a query to identify a sample of users. Thus, for this project, a very active editor is a Wikimedia contributor who has made at least 600 edits across all projects during last 12 months (Dec 2015–Nov 2016), while having at least some activity in the past 6 months.

This corresponds to being very active (100+ edits) during 6 months, although we are not making this a strict requirement (a user could have made all those edits during a single month and still fall into this bucket). We chose to look across 12 months because an editor’s activity normally varies over time; for example, of one month’s highly active editors who are not new users, only 50% will be highly active the next month. We chose 6 months because it is a majority number of months in the year and will help to control for seasonality.

We estimate that this population is about 25,000 editors. We aim to reach out to x number of editors to take the survey. We will draw a random sample of editors from a list of editors who meet the requirements stated above. The list will be drawn using the following SQL query:

	# wiki relies on the DB system taking the first value in each group
	wiki as home_wiki
	# special case to avoid confusing choices of Wikidata as home wiki
    	if(wiki = 'wikidatawiki', sum(edits) * 0.1, sum(edits)) as edits
	from staging.editor_month
month >= '2015-12-01' and month < '2016-12-01' and
local_user_id != 0 and
bot = 0
	group by user_name, wiki
	order by user_name asc, edits desc
	) user_wiki
group by user_name
having sum(edits) >= 600;	

Invitation distributionEdit

To reach out to very active editors, we will use mass messaging in order to send messages to their talk pages. Since we are aggregating contributions across all wikis, we must determine which wiki we should message each user on (their home wiki). In this case, a user’s home wiki is the one where they made the greatest number of edits over the 12 month period, with the caveat that each Wikidata edit is treated as a one tenth of an edit. This accounts for the fact that it’s very easy to make large numbers of Wikidata edits, which would lead to some users being messaged on Wikidata when it’s not the wiki where they spend the most time. (This isn’t the end of the world, since cross-wiki notifications mean that these users will still see the message, but this helps to avoid confusion and make the message more immediate.)

This method was determined based on feedback from audience advisors -- individuals at the Foundation who are experts in these audience types. The survey responses for this group will be anonymized.


A limitation of using this method is that mass messaging puts an open link on talk pages that anyone can click on, regardless if they are part of the survey or not. There are a few ways we might mitigate this: Either at the beginning or at the end of the survey, we add a separate survey collector, where respondents are asked "What is your username? We will not be linking your username to your responses; we simply want to keep a record of who is taking this survey." We can use this to help us remove respondents who were not in the original sample. A less accurate, and less aggressive approach would be to ask a question. The question will help us screen out respondents who were not originally asked to take this survey. Something like the following: We'd like to know how people found their way to this survey. Where did you find this survey? On my talk page On another user's talk page Somewhere else (please specify)

Active EditorsEdit

Audience & sampling FrameEdit

Active editors are users who create between 5 edits per month and 100 edits per month on any individual Wikimedia project. There are a total of about X number of active editors. We aim to reach out to x number of editors to take the survey. We will identify active editors using the Central Notice banner system (see details in the next section). The central notice system uses a system that randomly assigns users into the study using a banner system.

Sending survey invitationsEdit

Active editors used a similar sampling method to Very Active Editors, using mass message. The main reason we did not use central notice is that the tool does not have good sampling methodology. First, it is not possible to message from a specified list of users. Second, it is not possible to accurately target recently active editors. Central Notice sampling currently uses the following equation to calculate how active an editor is:

Total number of lifetime edits / Total number of months since account creation = edits per month

This equation does not take into account changes in editing behavior over time and does not present an accurate way of targeting users based on their activity in the previous 12 months. This type of data is essential for this survey in order to have a representative sample for longitudinal analyses (e.g. year over year).


The largest limitation of using this method is the introduction of sampling bias. The central notice system does not randomly choose a sample of users from a pool of users, and follows up with those same users. Instead, the banner will display the survey once every 5 pageviews for a logged-in user.

We are also unable to utilize exact edit counts. We will only be able to approximate which users make less than 100 edits per month.

Technical contributors/Volunteer DevelopersEdit

Audience & sampling FrameEdit

Technical contributors and volunteer developers are Wikimedia users who contribute to the improvement of Mediawiki software specifically. They might test software, submit bug reports about software issues, or even write code that becomes part of MediaWiki code base. Volunteer developers cannot be randomly sampled because we do not have a single space we can randomly sample the group. Instead, CE Insights will employ convenience and snowball sampling methods using known people and spaces to promote the survey. Since this audience is relatively new, we must gather essential demographic information, such as how many years they have been part of the movement, age, and

Sending survey invitationsEdit

Invitations to take the survey will be done in the following ways:

Mailing lists Emails Wikimedia projects Phabricator Social media contacts Wikitech-l Labs-l (not yet ready) AI-l Tool labs users Others Mediawiki banner Other?

Message on message board? Aaron Halfaker Manuel Manske



The main limitation of this method is that it uses snowball sampling mehods, which has heavy sampling bias. To control for this, we are making sure to include questions that describe the audience we want to capture - speifically what type of techncial contributions they provide, and how many years they have contributed to the projects.

Affiliates & Program LeadersEdit

Audience & sampling FrameEdit

Affiliates and Program leaders are grouped together because the type of work they do is largely to do outreach to volunteers and the general public about Wikimedia. In this way, they frequently share similar goals. Affiliates are people who are on an organization's board or are hired as staff. They may not be members of an affiliate -- for example someone who pays monthly dues. Program leaders are people who run programs. They carry out projects like conferences, edit-a-thons, and other activities.

We do not have exact numbers of program leaders. Based on the most recent evaluation report, we were able to find about (XX) program leaders in the movement. We aim to reach at least X number of program leaders.

There are approximately 100 Wikimedia affiliates, including 40 Chapters, 1 Thematic Organization, and 64 User Groups. We hope to hear from from all affiliates in the movement for this survey.

We will identify program leaders and affiliates using our existing databases and resources. You can find this list here.

Sending survey invitationsEdit

We will send survey invitations primarily through emails. We will also use social media, mailing lists, and other venues to reach out to program leaders especially.

Mailing lists Emails Wikimedia projects Facebook Twitter Wikimedia-ped Education-l Collab GLAM-l TWL-l? Others?

Outreach wiki banner?

Program Evaluation & Design Education TWL GLAM

WikiEval TWL? Jake? Others?

Non-specific AudiencesEdit

Audience & sampling frame descriptionEdit

Non-specific audiences will be a subset of the above four audiences, using the invitations stated. The data gathered through this set of question will be used to help make decisions across the 5 groups. Thus, for the non-specific audiences questions, we must specify a sampling target for each audience (or for all audiences?)

Sending survey invitationsEdit

From each audience, we will automatically route a respondents to subsections of the nonspecific audience section, based on what type of audience they come from. We are doing this to help reduce survey burden on individual users. We will aim to fill the following quotas from each audience for the non-specific audience:

All-Audience DemographicsEdit

The following data elements will be collected from all respondents. These data will help us learn more about the questions we are asking, as well understanding more about our sampling methods.

Indicator How we will collect it Gender Question item in the survey Geography Question item in the survey Wikimedia Project Question item in the survey Years active Question item in the survey Survey source (e.g. projects, social media, mailing list) Embedded data from qualtrics settings Audience Embedded data from qualtrics link