Learning patterns/Working with the Wikipedia data dump for research
What problem does this solve?Edit
Scientific research has increasingly started to use Wikipedia as a data source. Given the large amount of information produced throughout the past twenty years and the fact that it is publicly available, Wikipedia data can be a gold mine for researchers needing text data or interested in successful online peer production. While having access to large amount of data is generally an advantage and useful, downloading and working with the files containing all this information can become a comptational problem. Wanting to download and work with all revisions of all pages in the English Wikipedia leads to multiple terabytes of text: too much for the casual researchers without access to specialised infrastructure. In this learning pattern, I want to highlight how the computational load can be reduced and what steps can be taken to be able to conduct research with Wikipedia data dumps.
What is the solution?Edit
Step 1: Ask yourself which data you actually needEdit
When working with Wikipedia data, the first and most important step is to answer the question which data one really needs. Do you need the text of all articles of Wikipedia in every state they have ever been in? Most likely not. Wikimedia provides a number of different data dumps with different levels of information, see for example the different dumps of the latest data dumps of the English Wikipedia.
You should ask yourself the following questions:
- Do you need meta data or content?
- Stubs do not contain any page or revision content, but include metadata like information on which user edited what page to what extent. Content files like pages-articles also contain raw revision content and are much larger in size. If you are a researcher not interested in actual text but more in contribution behaviour, you might only need to focus on stubs.
- Do you need information on articles or everything?
- Are you interested in the encyclopediac content of Wikipedia only, or do you need data on discussions and users? Articles files contain everything in the main namespace, but do not contain user pages and talk pages.
- Do you need a current snapshot of Wikipedia or do you care about its history?
- There are data dumps which include only the current revision of each page, while the history files go back in time.
Depending on what data you need, there is a different file you need to download. For example, if you are not interested in textual content, but in the number of contributions per user across different name spaces and across time, you need to work with the meta stubs history files. If you are interested in the most current version of articles, you need to work with the current articles files.
Related to this: Also make sure whether you really need to work with the data dumps. Depending on your interest, the information might be available somewhere else. For example, if you are trying to find the longest Wikipedia pages, there is a special page listing them. In other cases, a single API call might also suffice.
Step 2: Work with a toy exampleEdit
Whatever data file contains the information you need and subsequently download, it is probably still very large. Instead of pre-processing, working with and analysing this huge file, work with a toy example. A small toy example will allow you to work with the data structure and try out solutions and potential approaches without long waiting times. Working with a toy example will allow you to fail fast and fail often. Once your approach and your code works fine with the toy example, you can apply it to the data you are actually interested in.
Data dumps from small language versions can work as useful toy examples.
Step 3: Pre-process and filter the dataEdit
You have identified the data file you need and want to start working with it. The XML file you downloaded is still very large as XML is a verbose language. The data will become easier to work with if transformed into a table format like a CSV file. The next step is thus to parse the Wikipedia dump, for example with the Python wiki dump parser. The parser can and should be adjusted to your needs to output exactly the data you need. Again, ask yourself if there is any infomation that could be skipped: Do you need information on the edit summaries of edits, do you need edits done by IPs, are you only interested in edits made on Mondays? Filter the data now to end up with workable files.
Step 4: Work with a toy exampleEdit
See step 2. Always work with a toy example, even if you think it is not necessary. It will save time in the long run.
Step 5: Use a databaseEdit
If the data file you want to analyse still spans multiple gigabyte, it can make sense to store your data as a database. Most programming languages can work directly with database files and the aggregation and filtering via SQL commands can often be quicker and computationally cheaper.
Step 6: Work with a toy exampleEdit
See step 2. Always work with a toy example, really. Load your toy example into the database.
When to useEdit
- Use this pattern when you want to work with Wikipedia data but do not have the computational power, facilities, and time to work with very large amount of data.
- I have used this approach in my project.
- Information on the data download
- General information on data dumps
- What the page meta data includes
- A tool which allows you to create customised subsets of dumps