User:Neil Shah-Quinn (WMF)/Data portal draft
There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available. If you have any questions, you might find the answer in the Frequently Asked Questions about Data.
If you wish to browse pre-computed metrics and dashboards, see statistics.
If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.
If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.
Data Dumps (details)
Dumps of all WMF projects for backup, offline use, research, etc.
The API provides direct, high-level access to the data contained in MediaWiki databases through HTTP requests to the web service.
Toolforge allows you to connect to shared server resources and query a copy of the database (with some lag).
Recent changes stream (details)
Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.
Analytics Dumps (details)
Raw pageview, unique device estimates, mediacounts, etc.
Reports in 25+ languages based on data dumps and server log files.
DBpedia extracts structured data from Wikipedia, allows users to run complex queries and link Wikipedia data to other data sets.
A collection of various Wikimedia-related datasets.
Editing metadata includes information about the users, time, and revision comment, and so on, but does not include the content of the revision itself.
This data is available from:
- the action API
- the XML data dumps
- the replicas of the MediaWiki databases available on Wikimedia's toolforge
- Recent changes stream
Raw content dataEdit
Data that includes the raw content of page revisions is available from:
Structured content dataEdit
- Wikidata Query Service
In addition to the raw data described above, there is a great deal of helpful infrastructure for research and analysis provided for people contributing to Wikimedia's mission.
The web service API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.
- Meta information about the wiki and the logged-in user
- Properties of pages, including page revisions and content, external links, categories, templates,etc.
- Lists of pages that match certain criteria
To query the database you send a HTTP GET request to the desired endpoint (example http://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to "query" and defining the query details the URL.
How to and examplesEdit
Here's a simple example:
This means fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (titles=Main%20Page) of English Wikipedia (http://en.wikipedia.org/w/api.php? )in XML format (format=xml). You can paste the URL in a browser to see the output.
Further ( and more complex) examples can be found here.
Also see :
To try out the API interactively, use the Api Sandbox.
To use the API, your application or client might need to log in.
Before you start, learn about the API etiquette.
Researchers could be given Special access rights on case-to-case bases.
All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).
Mailing list: mediawiki-api
NOTE: In 2014 Toolforge replaced the "Toolserver" server cluster managed by WMDE.
Toolforge hosts command line or web-based tools, which can query copies of the database. Copies are generally real-time but sometimes replication lag occurs.
Toolforge hosts copies of the databases of all Wikimedia projects including Commons. You are allowed use the contents of the database as long as you don't violate the rules.
Learn more about the current database schema.
Using Toolforge requires familiarity with Unix/Linux command line, SSL keys, SQL/databases, and some programming.
To start using the Toolforge, see this handy guide: R:Labs2/Getting started with Toolforge. The steps are summarized as follows:
- register wikitech account (register)
- Add your public SSH key under preferences
- request access to Toolforge (request)
- SSH to
login.tools.wmflabs.orgusing your private SSH key
- On IRC: #wikimedia-clouds on Freenode, a great place to ask questions, get help, and meet other Toolforge developers. See Help:IRC for more information.
- Via mailing list: Labsemail@example.com A list for announcements and discussion related to the Wikimedia Cloud VPS project. You can find the archives here: http://lists.wikimedia.org/pipermail/labs-l/
- Found a bug?: Bugs can be posted to Bugzilla: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Wikimedia%20Labs&component=tools
Projects using Toolforge / Toolserver dataEdit
On the old toolserver:
- "Circadian patterns of Wikipedia editorial activity: A demographic analysis" analyzed "34 Wikipedias in different languages [trying] to characterize and find the universalities and differences in temporal activity patterns of editors", with the underlying data provided by the German Wikimedia chapter from the toolserver.
- "Feeling the Pulse of a Wiki: Visualization of Recent Changes in Wikipedia" describes a tool hosted on Toolserver providing Recent Changes visualization to aid admins
Recent changes streamEdit
See wikitech:EventStreams to subscribe to Recent changes to all Wikimedia wikis. This broadcasts edits and other changes as they happen; confirmation that an edit has completed is typically faster over this than through the browser.
Old IRC recent changes feedEdit
- Changes shown automatically as they happen.
- Feeds for each wiki in a separate channel.
- Filtered feeds available with cloak
Data and formatEdit
Each wiki edit is reflected in the wiki's IRC channel.Displayed URLs give the cumulative differences produced by the edit concerned and any subsequent edits. The time is not listed but timestamping may be provided by your IRC-client.
The format of each edit summary is :
[page_title] [URL_of_the_revision] * [user] * [size_of_the_edit] [edit_summary]
You can see some examples below:
<rc-pmtpa> Talk:Duke of York's Picture House, Brighton http://en.wikipedia.org/w/index.php?diff=542604907&oldid=498947324 *Fortdj33* (-14) Updated classification
<rc-pmtpa> Bloody Sunday (1887) http://en.wikipedia.org/w/index.php?diff=542604908&oldid=542604828 *0322.214.171.124* (-2371) /* Aftermath */
IRC feeds are hosted on the irc.wikimedia.org server.
- wm-bot lets you get IRC feeds filtered according to your needs. You can define a list of pages and get notifications of revisions on those pages only.
- WikiStream uses IRC feeds to illustrate the amount of activity happening on Wikimedia projects.
- wikimon is a WebSocket-oriented monitor for the IRC feeds
Anyone can access IRC feeds. However, you need a wm-bot.
Detail about each dump is listed on the home page: https://dumps.wikimedia.org/other/analytics/ Additional detail about these dumps and other Analytics data: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
http://dumps.wikimedia.org/other/pagecounts-ez/merged/ highly compacted monthly aggregates, without loss of hourly resolution
Each request of a page reaches one of Wikimedia's varnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly. New higher quality data filtered of spiders available since May 2015. Deprecated pagecounts-raw has English statistics since 2007 and non-English since 2008. And pagecounts-ez stitches the best available data at all times.
Files starting with "project" contain total hits per project per hour statistics. A separate set with repaired counts is maintained as well (several cases of multi-month underreporting could be fixed from secondary sources)
Note: These are not unique hits and changed titles/moves are counted separately.
Delimited format :
[Project] [Article_name] [Number_of_requests]
where Project is in the form
language.project using abbreviations described here.
fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1
means that the French Wikibooks page with title "Special:Recherche/Achille_Baraguey_d%5C%27Hilliers" was viewed 1 time in the last hour. Old pagecounts-raw files also have the size of the content returned as a number, for example: 624.
en Main_Page 242332
we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour.
Data in JSON format is available at http://stats.grok.se/.
You can interactively browse the page view statistis and get data in JSON format at http://stats.grok.se/.
The following tools also use pageview statistics:
- Article traffic statistics
- GLAMourous - Commons image usage on Wikimedia projects
- Top 100 articles for 2012 for each project
Research projects using data from this sourceEdit
- Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data combines the page view statistics for articles on movies from the raw page view dumps with the editorial activity data from the toolserver database to predict the financial success of movies
- Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend (in German) studies how the TV schedule influences Wikipedia pageviews
- More examples
Also see: mw:Analytics/Wikistats
Wikistats is an informal but widely recognized name for a set of reports developed by Erik Zachte since 2003, which provide monthly trend information for all Wikimedia projects and wikis based on XML data dumps and squid server traffic.
Thousands of monthly reports in 25+ languages about:
- unique visitors
- editor activity
- page views (overall and mobile only)
- article creation
- browser usage
Special reports (some are one time, some regular) about:
- growth per project and language
- pageview and edits per project and language
- server requests and traffic surges
- edits & reverts
- user feedback
- bot activity
- mailing lists
Final reports are presened in table and chart form. Intermediate files are available in CSV format.
The scripts used to generate the CSV files (WikiCounts.pl + WikiCounts*.pm) and reports (WikiReports.pl + WikiReports*.pm )are available for download here.
Maintainer: Erik Zachte
DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.
English version of the DBpedia knowledge base
- describes 3.77 million things
- 2.35 million are classified in a consistent Ontology(persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.
Localized versions of DBpedia in 111 language
- together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia
The data set also features:
- about 2 billion pieces of information (RDF triples)
- labels and abstracts for >10 million unique things in up to 111 different languages
- millions of
- links to images
- links to external web pages
- data links into external RDF data sets
- links to Wikipedia categories
- YAGO categories
- SPARQL endpoint
http://wiki.dbpedia.org/Downloads38 has download links for all the data sets, different formats and languages.
http://dbpedia.org/sparql - DBpedia's SPARQL endpoint
How to and examplesEdit
- Use cases shows the different ways you can use DBpedia data ( such as improving Wikipedia search or adding Wikipedia content to your webpage)
- Applications (broken link!) shows the various applications of DBpedia including faceted browsers, visualization, URI lookup, NLP and others.
- DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.
- RelFinder is a tool for interactive relationship discovery in RDF data
DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.
Mailing list: DBpedia Discuss
Research projects using data from this sourceEdit
- "Biographical Social Networks on Wikipedia - A cross-cultural study of links that made history" uses data extracted from DBpedia to study how biographies on Wikipedia vary depending on language/culture.
- See more DBpedia related publications, blog posts and projects here.
The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.
The Wikimedia group on DataHub points to some additional data sources not listed on this page. Some examples are:
- dbpedia lite , which uses the API to extract structured data from Wikipedia ( not affiliated with DBpedia))
- EPIC/Oxford quality assesmtent of Wikipedia by experts
- Wikipedia Banner Challenge data
- Wikipedia Editor Engagement Experiments: Timestamp position modification
- Hotels/restaurants/attractions data as CSV/OSM/OBF
- Tourism guide for offline use