Research:Investigate frequency of section titles in 5 large Wikipedias

Created
13:07, 20 October 2016 (UTC)
Duration:  2016-10 – 2016-12


This page documents a completed research project.
Tracked in Phabricator:
Task T148260

Wikipedia article pages are separated by section headings to organize the content of each page. This project aims to investigate the 100 most frequent section headings in 5 large Wikipedias and produce general insights around section headings.

Methods edit

This research utilizes the “Articles, templates, media/file descriptions, and primary meta-pages” data dump to generate a dataset consisting of all page headings (article section titles) for the English, French, German, Italian and Spanish Wikipedias on PAWS. The page id, page title, page namespace, heading level and heading text of each article page in that language edition is obtained using Aaron Halfaker's method of extracting headings, using the mwparserfromhell library. The data dumps contain page information for all namespaces, but this research project aims to explore only the main/article namespace (0 for all languages), so only pages with namespace 0 are added to the dataset.

To ensure data integrity, there are a variety of data quality checks on the page headings dataset. These checks confirm the dataset contains all article pages with headings from the data dump, all sections headings in any given article page, and that the sections headings are in sequential order as they appear in the article. This was done by pulling out the last 500 lines of the data dump and confirming all article pages with headings in these 500 lines were in the headers dataset. It is important to note that the code to extract headings does not always parse through the entire data dump and many revisions of the headers dataset were generated before ensuring data integrity. The random article generator is used to spot check articles, ensuring that they are present with their correct heading sequence in the derived dataset. The total article count is derived to calculate the percentage of articles in which each heading appears. The total article count includes all pages from the data dump with namespace 0 and no redirects.

In some cases, the wikitext for headings is formatted as e.g. “== See also ==”, but in other cases it appears as “==See also==”. The former method produces the heading “ See also ”, while the latter produces “See also”. For this analysis, these can be regarded as the same, so all leading and trailing whitespace in the heading text is removed to avoid duplicate titles.

Due to the large datasets and memory limitations in PAWS (1 GB), there are many workarounds in the python code for this analysis. For example, the section headers tsv files are read in small chunks of 100,000 rows, then concatenated together into one pandas dataframe to avoid reading the entire file into memory at once. Also, three columns of the dataframe are converted from the standard np.int64 to np.int32/16/8 data types to conserve memory. Still, the English, French, and German languages produce header files which are too large to analyze in PAWS, so the results for these were completed on a personal laptop. The PAWS notebooks for these languages contain commented out code which one can run if and when the PAWS memory limits are increased.

Results edit

Top 10 Section Headings
en en % fr fr % de de % it it % es es %
1 References 78.19 Notes and references

Notes et références

43.85 Web links

Weblinks

63.55 External links

Collegamenti esterni

52.42 References

Referencias

66.58
2 External links 44.33 External links

Liens externes

40.48 Individual citations

Einzelnachweise

46.61 Notes

Note

44.73 External links

Enlaces externos

56.81
3 See also 21.51 See also / External links

Voir aussi

23.63 Literature

Literatur

28.19 Other projects

Altri progetti

31.43 See also

Véase también

21.89
4 History 10.11 References

Références

18.6 History

Geschichte

14.92 See also

Voci correlate

18.8 Bibliography

Bibliografía

11.49
5 Notes 5.37 See also

Articles connexes

18.17 Life

Leben

13.1 Bibliography

Bibliografia

16.86 History

Historia

10.68
6 Career 3.34 Biography

Biographie

16.0 See also

Siehe auch

9.16 Biography

Biografia

10.63 Demography

Demografía

9.13
7 Biography 2.89 History

Histoire

12.82 Sources

Quellen

3.92 Plot

Trama

8.91 Biography

Biografía

7.62
8 Further reading 2.81 Bibliography

Bibliographie

12.75 Awards

Auszeichnungen

2.99 History

Storia

5.95 Geography

Geografía

6.62
9 Track listing 2.75 External link

Lien externe

8.47 Career

Karriere

2.93 Career

Carriera

5.17 Geographic Distribution

Distribución geográfica

4.32
10 Bibliography 2.32 Geography

Géographie

6.6 Geography

Geographie

2.87 Achievements

Palmarès

4.64 Notes

Notas

3.39

There are a few headings which are applicable for any subjects such as “References”, “External Links”, and “See Also” which appear frequently in all languages analyzed. After that, the headings are more specific to the topic of the article, e.g. if it’s an article about a person then “Career”, “Biography”, or “Life”.

All analyzed language editions have some version of “References” and "External Links" as the top 2 frequent heading titles. Italian and German language don’t use an exact equivalence of the “References” section heading. In Italian, "Notes" ("Note") and “Bibliography” ("Bibliografia") sections serve the same purpose as "References" (but "Bibliografia" also contains what is called "Further reading" on English Wikipedia, and "Note" also contains non-reference footnotes and remarks). In German, “Individual citations” ("Hilfe:Einzelnachweise") is used for references, but they also appear in "Literature" ("Literatur") sections alongside further reading material. Both of these languages use “External links” ("Collegamenti esterni"/"Weblinks") more than those counterparts of “References”. In French, there is a “Notes and References” ("Notes et références") header which is more popular than the similar “References” ("Références") heading.

In the French language, both "Articles connexes" and "Voir aussi" serve as "See also" headings, but "Voir aussi" can be used for a section encompassing both "External links" and "See also" (example article).

There are comparable, but different headings more frequently used in different language editions. For example, in German, “Life” (“Leben”) appears in 13% of articles (presumably most of these are articles about people) and is the 5th most frequent heading. However, in English Wikipedia “Life” appears in less than 1% of articles and is only the 27th most frequent heading. Instead, “Biography” is used more frequently in 3% of English articles and is the 7th most frequent heading. In German, “Biography” (“Biografie”) is the 32nd most frequent heading and appears in under 1% of articles.

The headers datasets below contain the following columns:

  • page_id: identifier of the page
  • page_title: title of the page
  • page_ns: namespace of page
  • heading_level: level of heading on page
  • heading_text: text of heading

The headings for article pages are in sequential order, but not always in consecutive order. So while all headings are in the dataset in the order in which they appear on their article page, there may be other article pages mixed in before one completes.

English edit

French edit

German edit

Italian edit

Spanish edit

See also edit