Research:Editor-data policy proposal
As part of the WMF research policy, we will be asking researchers that receive significant support from the Foundation to publicly release the data they produce as part of their research under an open license. This applies in particular to projects that receive the following kinds of support from the Foundation:
- technical support for data collection
- special API permissions
- hosting or direct financial support
- institutional endorsement
- subject recruitment (with some restrictions)
What this implies is that datasets referring to individual usernames (e.g. lists of editors ranked by some given criteria) fall in the category of derivative content extracted from publicly logged user contributions. Data obtained by processing the XML dumps or by querying the Wikimedia API with standard user access privileges (i.e. with the exclusion of nonpublic data) fall under the above definition. As a subset of the log of contributor activity that Wikimedia hosts in publicly accessible databases, this data can be freely copied, quoted, reused and adapted by third parties.
One solution to meet these concerns would be to publish research data in an aggregate form only, but this defies the very purpose of making data openly available to promote further research. Publishing them in a raw, but anonymized form, on the other hand, is pointless insofar as effective anonymization of public data is not possible. We would like to recommend that any raw research dataset derived from publicly available logs of editor activity – collected via tools such as the toolserver, wikilytics or any third party script querying the API with standard privileges or processing the XML dumps – be published without anonymization. We would like to include in the above recommendation excerpts of discussions from user talk pages, article talk pages or policy-related discussions that researchers may quote verbatim as part of datasets for qualitative research.