WSoR datasets
This page lists datasets created during the Wikimedia Foundation Summer of Research (WSoR) 2011 that are likely to be of use in the future.
Please note: we are still working out the most viable long term solution for publishing these datasets, in collaboration with Dario Taraborelli and others. If you have questions or comments about any of the data, please feel free to ask on the Talk page.
Add a dataset
editTo create a new dataset page:
- Use this form to create the sprint page. Make sure to give your dataset a name.
- Add your dataset to the list below. eg.
{{WSoR_dataset|Dataset name|Short description}}
Datasets
edit- policy_counts
yearly contribution counts to selected pages in project namespaces - bot
A curated table of bot user_ids--useful for flagging and removing bots during analysis. - user_year_month_namespace
An aggregation of user activity by namespace and month--useful for visualizing months of editor activity. - user_cohort
A list of users who made at least one edit with the dates of their first and last edits included--useful for grouping editors into cohorts. - rev_len_changed
An approximate diff size for each revision--useful for approximating the amount of content an editor has added. - user_first_msg
The first edit to a user's talk page, with metadata including message type and automated tool used. - user_activity_first_msg
A summary of editor activity before and after they receive their first message. - rev_len_changed
An approximate diff size for each revision--useful for approximating the amount of content an editor has added. - user_approx_registration
An approximate registration date for editors who started editing before registration dates were recorded. - revert
Reverting revisions with a field for whether the revert was for vandalism (guess based on RegExp of comment). - reverted
Reverted revisions with information about the reverted edit and whether the revert was for vandalism (guess based on RegExp of comment). - revision_diff
The optimal diff information for all revisions from the April, 2011 XML database dump. - trending_articles
A list of revisions of trending articles within the time period when they were trending, for 5 months.
WikiProject Datasets:
- categorylinks
Categorylinks related to WikiProjects - categorylinks_wp
Links category names with their respective WikiProjects - wikiproject_pages
List of WikiProjects and their main page id - wikiprojects_member_pages
List of WikiProject member pages - wp_claimed_pages
List of pages claimed by each WikiProject and the template(s) used to claim them - wp_template_links
List of templates that links to the WikiProjects in table wikiproject_pages