Working with data in Wikimedia and MediaWiki

This course has ended. The Wikimedia Labs instances mentioned here and in the slides no longer exist.

Working with data in Wikimedia and MediaWiki is a course taught by Niklas Laxström and Susanna Ånäs.

Information edit

Place
Language Technology, Department of Modern Languages, University of Helsinki, Helsinki, Finland
Time
September-December 2016
Course info and sign-up
WebOodi

5.9. Wiki edit

Slides edit

Reading edit

Home assignment edit

Submission: send your written replies to niklas.laxstrom AT helsinki.fi using subject wmw-01 before Monday 12.9.

Content organization edit

  1. Go to Special:AllPages
  2. Go over all the non-talk namespaces from the namespace dropdown
  3. Open some pages from each namespace to see what kind of content there is

What do you think the namespace is used for? Do you see other patterns in the the way pages are named besides the namespace? Is there anything special about the name of the page Special:AllPages itself? Write down your observations and thoughts.

Basic wiki edit

I have created an uncustomized wiki installation. Compare it to this wiki and document what differences you see in the appearance and functionality. For example, try editing pages (but don't save anything). You can use Special:Version on this wiki and Special:Version of the uncustomized wiki to compare installed extension to help you find more differences.

12.9. MediaWiki edit

Slides edit

Reading edit

Home assignment edit

  1. Pick two unique names, hereafter called A and B. You can use Special:Random as an inspiration.
  2. Go to the previously empty wiki of last week's assignment.
  3. It is not necessary to register in this wiki to create pages. Create page Template:A with contents This is the _ of the page _, so that the underscores are replaced with appropriate wikicode: first one should output the content of first unnamed parameter. The second one should output the name of the current page. See the help links in the reading section or the slides.
  4. Create page B with any creative content, such as "Hello world!". Edit the page B again and use the template A twice. Place {{A|beginning}} in the beginning and {{A|end}} at the end of the page and save your edits.
  5. Document how to use the Template:A using <noinclude> tags.

Make sure the page looks okay, for example that the text does not run together. Send the link to page B by email per instructions above.

Submission: send your answers to niklas.laxstrom AT helsinki.fi using subject wmw-02 before Monday 19.9.

19.9. MediaWiki extensions edit

Slides edit

Reading edit

Home assignment edit

If you are sure that you are going to install MediaWiki on your own, you can skip these steps, but do send an email to inform me that you are doing it.

  1. Familiarize yourself with Wikimedia Labs terms of use
  2. Create a Wikimedia Labs account (also known as Wikitech account)
  3. Create a ssh key if necessary and set it up for Wikimedia Labs

Submission: send your account name to niklas.laxstrom AT helsinki.fi using subject wmw-03 before Monday 3.10.

26.9. Wikimedia projects edit

Slides edit

Reading edit

Home assignment edit

  1. Define a list, map or a timeline for a topic. Choose a topic that could illustrate a Wikipedia article.
  2. Make a Wikidata query that returns all necessary information.
  3. Include dates, locations (points or areas) and images in the query.

Submission: send a link to your query to niklas.laxstrom AT helsinki.fi using subject wmw-04 before Monday 3.10.

3.10. Lists, maps and timelines edit

Slides edit

Reading edit

Home assignment edit

  • Create or polish your map, timeline or list
  • Fix the data, if needed

Listeria list edit

  1. Create a Listeria list in your preferred wiki
  2. Add parameters
    • Use only one SELECT parameter: ?item
    • Listeria will take care of language, multiple values, grouping etc.
  3. Insert the list to your preferred wiki

Histropedia timeline edit

  1. Create a Histropedia timeline of your SPARQL query
  2. Add parameters
    • Add parameter name without the question mark as title, URL, dates, image etc. Use the textual representation for texts, not the ID.
    • You can group items based on one parameter.
  3. Insert a link to your preferred wiki

Kartographer <maplink> or <mapframe> map with SPARQL query and geoshapes from OpenStreetMap edit

Examples

  1. Where is Finland?
  2. Helsinki neighbourhoods
  3. Municipalities of Finland

Home assignment option

  • Create a map based on Finnish municipalities or neighbourhoods of Helsinki
  • Use parameters ?img, ?title, ?description, ?link and ?fill in your SPARQL query
  • All municipality geoshapes are needed for this to work, therefore, take part in talkoot!
  • It may take up to 2 days for the geoshapes to appear in the map

Additional talkoot for everyone! edit

  • Create an account in OpenStreetMap
  • In OSM, add 15 Wikidata IDs to OpenStreetMap features for Finnish municipalities. See this blog post for help.
    • Log in
    • Go to edit mode
    • Search for your municipality
    • Select the administrative unit (an area) from the list if there are several options
    • Add field: "Wikipedia". Use any language, select the name of the municipality. Wikidata ID follows automatically.
    • If Wikipedia article exists, but no Wikidata ID, select the Wikipedia article again, and the Wikidata ID appears.
    • Remember to save
  • For those who have already completed their first 15 and those who have not started, select your set from the sets below, approx. 15 items :)
Cities Reserved by Completed!
Akaa–Evijärvi taken done
Finström–Hattula Kim done
Hausjärvi–Iisalmi taken done
Iitti–Joroinen Virpi done
Joutsa–Kangasniemi taken done
Kankaanpää–Keminmaa Julia done
Kemiönsaari–Konnevesi
Kontiolahti–Kyyjärvi
Kärkölä–Leppävirta Anna done
Lestijärvi–Maalahti Ville done
Maarianhamina–Mänttä-Vilppula Niklas done
Mäntyharju–Padasjoki
Paimio–Pirkkala Sabine done
Polvijärvi–Pyhäranta Kim done
Pälkäne–Ruokolahti Susanna done
Ruovesi–Siikajoki Susanna done
Siikalatva–Sysmä Susanna done
Säkylä–Tuusula Susanna done
Tyrnävä–Vesanto Susanna done
Vesilahti–Äänekoski Susanna done

Submission: send a link to your list, timeline or map to niklas.laxstrom AT helsinki.fi using subject wmw-05 before Monday 10.10.

Comments edit

  • Data may modeled differently in different cases
  • Many ways to deal with duplicates
  • Good for education purposes, for visual learners. Specifically history teaching. Also high school level.

10.10. Extracting content edit

Slides edit

Reading edit

Home assignment edit

Produce a plain text dump of Finnish Wikipedia articles having a name that starts with Abe. Do not include redirects.

Place the extracted text of each article to a separate file named after the article. Remove characters such as (, ), or & that can be problematic in file names. Use UTF-8 encoding.

You can use the database dumps or the API. Use latest version of the article available with your source.

You can use mediawiki-utilities and/or other libraries (for example BeautifulSoup) or programming languages. The goal is to extract sentences from the articles. All wikitext mark-up or HTML mark-up should be removed as much as possible, as well as headings, infoboxes, citations, tables, etc.

If you decide to use the dumps, you can do this exercise on prugna.wmwcourse.eqiad.wmflabs (how to access), where the dump file is under /data and mediawiki-utilities and BeautifulSoup4 is already installed. You need to use python3 command to run your script. Since just iterating the dump takes over 5 minutes, consider splitting your script into two parts: first extract the relevant pages with their content, then clean-up the output.

Write down notes about problematic cases that you encounter. Finally, give an estimate how long it would take to do this kind of dump from all of Finnish Wikipedia.

Submission: send your notes, and script and text files in an archive to niklas.laxstrom AT helsinki.fi using subject wmw-06 before Monday 17.10.

17.10. Wikimania edit

There is no lecture on 17.10.

Wikimania is the largest annual conference of the Wikimedia movement. It has presentations on both technical and social topics and it provides a window to what is happening the movement.

Home assignment edit

Watch 2 or 3 presentations from Wikimania 2016 based on your interest. Summarize each presentation and what you learned in a few paragraphs. Be prepared to share highlights with others on the next lecture.

Submission: send your summaries to niklas.laxstrom AT helsinki.fi using subject wmw-07 before Monday 31.10.

24.10. Period break edit

There is no lecture on 24.10.

31.10. Semantic MediaWiki edit

Slides edit

Reading edit

Home assignment edit

You have received name of your Vagrant wiki on the lecture or via email. If you have not, contact Niklas.

  1. Check that http://wmwcourse-name.wmflabs.org has a working wiki.
  2. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh. See wikitech:Help:Getting_Started#Project_Instances for how to do this, if you haven't already.
  3. Enable the semanticmediawiki role with cd /srv/mediawiki-vagrant; vagrant roles enable semanticmediawiki && vagrant provision.
  4. Log in to your wiki using admin account and change the password.
  5. Add some pages with semantic annotations to your wiki using the template approach. For example countries and capitals, but feel free to use imagination.
  6. Create a page with semantic query ({{#ask:}}) that displays some data from those pages. For example countries with their capitals and population in descending order.

Submission: send link to your query page to niklas.laxstrom AT helsinki.fi using subject wmw-08 before Monday 7.11.

7.11. Forms edit

Slides edit

Reading edit

Home assignment edit

Use same vagrant wiki as you did last week.

  1. Check that http://wmwcourse-name.wmflabs.org has a working wiki.
  2. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh.
  3. Install Page Forms. It does not have a role yet, so we are going to install it manually.
    1. Go inside your Vagrant virtual machine: cd /srv/mediawiki-vagrant; vagrant ssh
    2. Download PageForms extension cd /vagrant/mediawiki/extensions; git clone https://gerrit.wikimedia.org/r/p/mediawiki/extensions/PageForms
    3. Exit the virtual machine: exit
    4. Register the extension (you can use your favorite editor) by creating a new settings file: nano /srv/mediawiki-vagrant/settings.d/20-pageforms.php with contents:
      <?php
      
      wfLoadExtension( 'PageForms' );
      
    5. Check Special:Version of your wiki to confirm it is installed properly.
  4. Use Special:CreateForm to create a new form
  5. Edit your form page to better suit your input by selecting input types, possible values etc.

Submission: send link to your form page to niklas.laxstrom AT helsinki.fi using subject wmw-09 before Monday 14.11.

14.11. Translate extension edit

Slides edit

Reading edit

Home assignment edit

Use same vagrant wiki as you did last week.

  1. Connect to your server name.wmwcourse.eqiad.wmflabs with ssh.
  2. Install MediaWiki Language Extension Bundle. It does have a vagrant role.
    1. Go inside your Vagrant virtual machine: cd /srv/mediawiki-vagrant; vagrant roles enable mleb; vagrant provision
    2. Add some basic configuration: nano /srv/mediawiki-vagrant/LocalSettings.php with contents:
      $wgGroupPermissions['user']['translate'] = true;
      $wgGroupPermissions['user']['translate-messagereview'] = true;
      $wgGroupPermissions['user']['translate-groupreview'] = true;
      $wgGroupPermissions['user']['pagetranslation'] = true;
      $wgTranslateDocumentationLanguageCode = 'qqq';
      $wgExtraLanguageNames['qqq'] = 'Message documentation';
      
    3. Check Special:Version of your wiki to confirm it is installed properly.
  3. Make your query results page and form translatable. You can use either page translation or unstructured element translation. Remember that some form labels do not support {{int}}, so it is okay to leave those untranslated.
  4. Translate your query results page and form to one language other than English.

Submission: send link to your pages to niklas.laxstrom AT helsinki.fi using subject wmw-10 before Monday 21.11.

21.11. Content Translation & Project work edit

Slides edit

Reading edit

Home assignment edit

Try Content Translation edit

  1. Log in to Wikipedia
  2. Go to beta features tab in your preferences and enable content translation
  3. Go to Special:ContentTranslation and do a translation (you don't need to publish)
  4. Write comments answering the following questions:
    1. Did you encounter any bugs or issues during translation
    2. Compare the actual source article and what you see in the translation tool's source column. What differences there are?
    3. Now that you have tried different kind of translation tools, what are the benefits and downsides of each tool?
    4. How would you decide which tool to use for different types of content?
    5. If you decide to publish your translation, include a link to the published page

Choose a data set edit

If you want to do a project work, choose a data set. Refer to the slides for what is available.

Submission: send your answers to your pages to niklas.laxstrom AT helsinki.fi using subject wmw-11 before Monday 28.11.

28.11. Pywikibot and tips for running MediaWiki edit

Slides edit

Reading edit

Home assignment edit

  • No home assignment this week.

5.12. Examples on subobjects and custom parser functions edit

Slides edit

Reading edit

Home assignment edit

  • No home assignment this week either.

12.12. Guest presentation, course summary and life after the course edit

Slides edit