Wikipedia Diversity Observatory/Cultural Context Content

This page explains what the Cultural Context Content (CCC) is for Wikipedia language editions, and the method employed in order to obtain it.

The following sections describe all the steps employed to map Wikipedia articles to the cultural context(s) with the aim of constructing a dataset. Previous versions of this method have been published ever since 2011[1][2][3][4][5].

Introduction

Cultural Context Content is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.).

Cultural Context Content is collected as a dataset, which is available in a monthly basis, and allows the Wikipedia Diversity Observatory project to show and depict several statistics on the state of knowledge equality and cross-cultural coverage.

Language-Territories Mapping

Obtaining this collection of articles requires associating each language to the territories where it is spoken officially or as native indigenous, and then, collecting articles that relate to each territory to create a single dataset for each language edition. In order to simplify, we consider the political divisions of first and second level (this is countries and recognized regions), which correspond to the ISO 3166 and ISO 3166-2 codes. These codes are widely used on the Internet.

 
In dark blue the territories where Italian is spoken natively. In light blue where it is used as a secondary language.

For instance, for the Italian Wikipedia, the CCC comprises all the topics related to the territories (see dark blue in the map): Italy (IT), Vaticano (VA), San Marino (SM), Canton Ticino (CH-TI), Istria county (HR-18), Pirano (SI-090) and Isola (SI-040), whereas, for the Czech language, it only contains Czech Repubic (CZ). Wide-spread languages like English comprise 90 territories, considering all the countries where it is native and the excolonies where it remains as an official language, which implies that the CCC is composed by several different contexts.

Once established, the language territories mappings compound a database (see github) with the groundtruth information which is later used in the different article retrieval strategies. For each language edition, we obtain its native words to denominate each territory, their Wikidata Qitems, their inhabitants’ demonyms and the language names (e.g., eswiki españa mexico … español castellano).

This word list has been initially generated by automatically crossing language ISO codes, Wikidata, Unicode and Ethnologue databases, which contain the territories where a language is spoken and their names in their corresponding language. This is especially relevant for those languages which are only spoken or official in a specific region of a country.

The generated list is subsequently manually revised and extended (using information from the specific articles in the correspondent Wikipedia language edition). Wikimedians are invited to suggest changes by e-mailing tools.wcdo@tools.wmflabs.org or in the Language-Territories Mapping talk page.

Method

Once the language-territory mapping database is obtained, a computational implementation of the method is developed (ccc_selection.py script).

The method runs monthly using a cron to connect to the databases of each Wikipedia language edition (MySQL replicas), which are updated in real time the last JSON Wikidata dump.

The method integrates several strategies described below in order to qualify all the Wikipedia language edition articles as (1) reliably CCC, (2) potentially CCC, (3) unlikely CCC and (4) not-CCC. These features are fed into a machine learning classifier (Random Forest) in order to compose the final dataset.

Reliable CCC and Potential CCC Features

Feature 1: Geolocated articles (reliable CCC)

This first Feature consists in examining the geocoordinates and the ISO code found in Wikidata and the mysql geotags table. The implementation of geocoordinates and ISO codes is unequal throughout the different language editions and may contain errors. Therefore, articles with coordinates are verified using a reverse geocoder tool in Python and a database of equivalences between territory names and their ISO 3166-2 code (see world_subdivisions.csv). Such tool returns an ISO code that needs to be verified in the database to see whether the article is located in a territory associated to the language or not.

Feature 2: Keywords on title (reliable CCC)

The second Feature implies examining the articles that contain in their title keywords related to the language or to the corresponding territories (e.g., “England National football team”, “English law”, etc.). Each language edition has a list of keywords according to the denomynms and language names that might retrieve a short list of generalist articles that are reliably part of the final CCC.

Feature 3: Category crawling (potential CCC)

 
Crawling down the category graph with keywords for English language.

The third Feature aims at qualifying articles according to how their position in the category graph relates to the same keywords. In a similar way to the second Feature, we start from the list of keywords associated to a language and the corresponding territories, and retrieve all the categories that include such keywords in their titles; for example: “Performing Arts in England” or “Disputes in English Grammar” in the English Wikipedia. These categories contain articles and other categories which contain in turn more specific articles (see Figure), until at a certain level the process of crawling and gathering articles finishes. The precise point where the process ends depends on how the category structures have been constructed (smaller Wikipedia language editions in number of articles also tend to have a less developed category graph).

The main advantage of this Feature is that it allows to obtain articles related to keywords. However, the distance to the top also matters: while the category “Films directed by Charlie Chaplin,” is part of the category "Performing Arts in England”, its content will be considerably more specific. The further from the top category containing the keyword, the more specific and less related to the original top category topic the articles will be. The drawback of this category crawling is that sometimes the categorization includes circular references or does not follow a specialization path (e.g., occasionally a more general category appears under a more specific one, other times a category appears to be related to the immediately preceding one, but totally unrelated to the preceding ones). Such phenomena may produce interferences in the collection (e.g. the category “Wars involving the United States” includes the category "World War II”, which in turn leads to articles about the German army and makes them appear as related to the English Wikipedia cultural contexts).

For each article encountered with this Feature we qualify it with the level in which it is found and the number of paths in which it was found. Although most of the content relates to the cultural context, not all of it does. Therefore, we consider that articles with these two qualifiers are potentially part of the final CCC.

Feature 4: Wikidata properties (reliable and potential CCC)

The fourth Feature aims at qualifying articles according to its Wikidata Qitem, its properties and their subsequent Qitems that relate to them. We created groups of properties according to how reliably its related qitems could ascertain one Qitem is part of CCC.

Those that qualify as reliable CCC:

  • country properties: P17 (country), P27 (country of citizenship), P495 (country of origin), P1532 (country for sport).

Qitems employing these properties with qitems of countries mapped to the language are directly qualified as reliable part of CCC.

  • location properties: P276 (location), P131 (located in the administrative territorial entity), P1376 (capital of), P669 (located on street), P2825 (via), P609 (terminus location), P1001 (applies to jurisdiction), P3842 (located in present-day administrative territorial entity), P3018 (located in protected area), P115 (home venue), P485 (archives at), P291 (place of publication), P840 (narrative location), P1444 (destination point), P1071 (location of final assembly), P740 (location of formation), P159 (headquarters location), P2541 (operating area).

Qitems employing these properties with qitems of countries mapped to the language are directly qualified as reliable part of CCC. Most usually, these properties are employed with cities or other more specific places. Hence, the method employed uses in first place the territories from the Languages Territories Mapping in order to obtain a first group of items, and next, it iterates several times to crawl down to more specific entitise (regions, subregions, cities, towns, etcetera.). Therefore, all articles are finally qualified as located in a territory or in any of its contained places. It is good to remark that not all of the location properties imply the same relationship strength.

  • language properties (strong): P37 (official language), P364 (original language of work), P103 (native language).

Qitems employing these properties with the qitem of the language (or one of its dialects) are directly qualified as reliable part of CCC. This property is used both for characterizing works (from theatre plays to institutions) and people.

  • created by properties: P19 (place of birth), P112 (founded by), P170 (creator), P84 (architect), P50 (author), P178 (developer), P943 (programmer), P676 (lyrics by), P86 (composer).

Qitems employing these properties with one of the qitems already qualified as reliable CCC are also directly qualified as reliable part of CCC. Although some of these relationships can be fortuitious, we consider them as important enough in order to qualify one article including them as CCC in a wide interpretation of which entities are involved in a cultural context. This property is usually used for characterizing people and works.

  • part of properties: P361 (part of).

Qitems employing these properties with one of the qitems already qualified as reliable CCC are also directly qualified as part of CCC. This property is used mainly for characterizing groups, places and work collections.

Those that qualify as potential CCC:

  • language properties (weak): P407 (language of work or name), P1412 (language spoken), P2936 (language used).

These properties are related to language but present a weaker relationship towards it. Therefore, Qitems employing them with the language (or one if its dialects) may be related to it in a more complementary way and not be reliably part of CCC. Hence, they are qualified as potential CCC.

  • affiliation properties: P463 (member of), P102 (member of political party), P54 (member of sports team), P69 (educated at), P108 (employer), P39 (position held), P937 (work location), P1027 (conferred by), P166 (award received), P118 (league), P611 (religious order), P1416 (affiliation), P551 (residence).

Qitems employing these properties with one of the qitems already qualified as reliable CCC are potential part of CCC. Affiliation properties present a weaker relationship from an item towards another than created by. It is not possible to assess how central this property is in the Qitems having it, hence these are qualified as potential CCC.

  • has part properties: P527 (has part), P150 (contains administrative territorial entity).

Qitems employing these properties with one of the qitems already qualified as reliable CCC are potential part of CCC, as they could be bigger instances of the territory that might include other territories outside the language context.

This fifth feature aims at qualifying articles according to where they receive incoming links and where their outgoing links point at. Links are valuable qualifiers and very related to the article's actual text and their value for other articles. Hence, for each article we count the number of links coming from other articles already qualified as reliable CCC (inlinks from CCC), and compute the percentage in relation to all the incoming links (percent of inlinks from CCC) in order to know how likely it is that this article is only used in order to extend the content from CCC articles. Likewise, for each article we count the number of links going to other articles already qualified as reliable CCC (outlinks to CCC) and their equivalent percentage with their total number of outlinks inscribed in its text (percent of outlinks to CCC). A high percentage of outlinks to CCC implies that an article is very likely to be part of CCC.

Other Languages CCC Features

Feature 6: Geolocated articles not in CCC (reliable non CCC)

Articles that are geolocated to territories that belong to other languages mapping are directly excluded from being part of this language CCC. Even though there might be some exceptions, articles geolocated not in the territories specified in the language-territory mapping for a language are reliable part of some other language CCC.

Feature 7: Wikidata properties (reliable non CCC)

For the previous Wikidata properties country_wd and location_wd we previously presented, we check whether each article has them but none of them do not link to CCC at the same time. Hence, similarly to the previous feature, they are reliably related to some other language CCC.

Feature 8: Inlinks / Outlinks to geolocated articles not in CCC (potential non CCC)

The eight and last Feature aims at qualifying articles according to how much their links relate to the territories which are not mapped to the language. Similarly to Feature 5, the number of inlinks and outlinks to geolocated articles not mapped to the language are counted along with their percentual equivalent (i.e. inlinks from geolocated not in CCC, percent inlinks from geolocated not in CCC, outlinks to geolocated not in CCC, percent outlinks to geolocated not in CCC). Articles qualified by this feature are potential part of other languages CCC.

This sixth Feature aims at qualifying articles as potential part of other languages CCC so it might be easier to discard them as not rellevant the actual language CCC.

  • Geolocated articles not in CCC

Articles that are geolocated to territories that belong to other languages mapping are directly excluded from being part of this language CCC. Hence, they are reliable part of some other language CCC.

  • Other CCC Wikidata properties

For each the previous Wikidata groups of properties presented previously, we count the number of times a property is used but does not relate to CCC. Hence, it is potentially related to some other language CCC.

  • Other CCC Category Crawling

Articles which appear in the category crawling of other language editions are likely to belong to their CCC. Therefore, we qualify as potentially part of other CCC articles that appear in them with the level in which they were found.

The seventh and last Feature aims at qualifying articles according to how much their links relate to the territories which are not mapped to the language. Similarly to Feature 5, the number of inlinks and outlinks to geolocated articles not mapped to the language are counted along with their percentual equivalent (i.e. inlinks from geolocated not in CCC, percent inlinks from geolocated not in CCC, outlinks to geolocated not in CCC, percent outlinks to geolocated not in CCC).

Machine Learning (Random Forest) and Manual Assessment

The previous seven strategies are used to qualify all the articles from each Wikipedia language edition and provide them to a classifier in order to expand the reliable CCC collected to this point into the final CCC Dataset.

In order to do so, the scikit implementation of the machine learning classifier Random Forest is used. In order to train the classifier, all the articles collected as reliable part of CCC as first introduced into it as the positive group (class 1).

Later, considering that the only articles which are reliable part of other CCC are geolocated (which tends to be a small group of articles) we cannot use them as the negative group (class 0). In this case, a negative sampling process is employed, in which all the articles which are not class 1 are introduced up to 10 times as class 0, even though they include unqualified articles, articles qualified as potential CCC, articles qualified as potential other CCC and articles qualified as a reliable part of other CCC.

Finally, the classifier is fed with the fitting data which needs to be categorized as class 1 or class 0. The data introduced are all the potential CCC articles. The machine learning classifier uses a multiple path algorithm in order to calculate the weight of each feature to determine whether an article can be class 1 or 0.

The accuracy provided estimated by the classifier is in the order of .999, and some particular features like the percentage of outlinks to CCC, percentage of outlinks to other CCC, category crawling level and particularly rellevant.

In order to test the accuracy of the classifier, a manual assessment process is normally employed. In the case of this project, the average number of false positive and false negative were approximately 5-5%, which are positive results.

Datasets

Description

The CCC datasets are generated in a CSV filetype in order to facilitate further processing. They are compressed as bzip2 files.

Content

There is a dataset for each language edition including all the articles from the CCC collection. In case you want to obtain the full SQLite3 database with all the datasets, and both CCC and non-CCC articles, you can contact us at tools.wcdo@tools.wmflabs.org.

In each dataset, you may find one line per article with different groups of features:

  • general data:

qitem (from Wikidata),

pageid (in the local Wikipedia),

page title,

date created (creation date timestamp),

geocoordinates,

ISO 3166,

ISO 3166-2.

  • ccc:

ccc binary (1 when the article is CCC, 0 when it is not),

main territory (qitem of the territory the article relates to),

number of retrieval strategies (number of different types of relationships to CCC, relatiable or potential, whether they are geocoordinates, category crawling, etcetera.).

  • reliable ccc features:

ccc geolocated (1 when it is part of CCC, -1 when it belongs to another language's CCC),

country wdproperties (property:qitem of the country it relates to),

location wdproperties (property id and qitem),

language strong wdproperties (property id and qitem),

created by wdproperties (property id and qitem),

part of wdproperties (property id and qitem),

keyword on title (qitem associated to the territory or language name).

  • potential ccc features:

category crawling territories (territory qitem which was used to run the category crawling and found this article),

category crawling level (category graph level where the article has been found),

language weak wdproperties (property id and qitem),

affiliation wdproperties (property id and qitem),

has part wdproperties (property id and qitem),

number of inlinks from CCC (number of incoming links to the article from those articles with the reliable CCC features such as geolocated articles),

number of outlinks to CCC (number of outgoing links from the article to those articles with the reliable CCC features such as geolocated articles),

percent inlinks from CCC (number of such incoming links divided by all the incoming links),

percent outlinks to CCC (number of such outgoing links divided by all the outgoing links).

  • potential negative CCC features:

other ccc country wdproperties (property id and qitem),

other ccc location wdproperties (property id and qitem),

other ccc language strong wdproperties (property id and qitem),

other ccc created by wdproperties (property id and qitem),

other ccc part of wdproperties (property id and qitem),

other ccc language weak wdproperties (property id and qitem),

other ccc afiliation wdproperties (property id and qitem),

other ccc has part wdproperties (property id and qitem),

number of inlinks from geolocated abroad (number of incoming links to the article from those articles that reliably associate to another CCC features such as geolocated articles in other CCC),

number of outlinks to geolocated abroad (number of outgoing links from the article to those articles which reliably associate to another language CCC such as geolocated articles in other CCC),

percent inlinks from geolocated abroad (number of such incoming links divided by all the incoming links),

percent outlinks to geolocated abroad (number of such outgoing links divided by all the outgoing links).

  • relevance features:

number of inlinks,

number of outlinks,

number of bytes,

number of references,

number of edits,

number of editors,

number of edits in discussions,

number of pageviews (during the last month),

number of wdproperties,

number of interwiki links,

featured article (1 or 0 whether it is or not).


Download

You can download the latest CCC datasets (on a monthly basis) here.

The latest databases generated by the previous scripts are also available at: databases. The latest CCC dataset as a database is named as ccc_old.db (Sqlite3 file).

Purposes

We keep these datasets for:

  • for archival/backup purposes
  • for offline use
  • for academic research
  • for bot use
  • for republishing (don't forget to follow the license terms)
  • for fun!

Monthly Results

The most important results that can be obtained through this dataset are extent (percentage) of CCC in each Wikipedia langauge edition, and the culture gap (i.e. the extent of one language CCC in other language editions, and the coverage of other language CCC by a specific language). A more practical application like top CCC article list generation to be translated across languages is also be derived from this dataset.

References

  1. Miquel-Ribé, M., & Rodríguez, H. (2011). Cultural configuration of Wikipedia: measuring Autoreferentiality in different languages. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011 (pp. 316-322).
  2. Miquel-Ribé, M., & Laniado, D. (2016, July). Cultural identities in wikipedias. In Proceedings of the 7th 2016 International Conference on Social Media & Society (p. 24). ACM.
  3. Miquel Ribé, M. (2017) Identity-based motivation in digital engagement: the influence of community and cultural identity on participation in Wikipedia (Doctoral dissertation, Universitat Pompeu Fabra).
  4. Miquel-Ribé, M., & Laniado, D. (2018). Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions. Frontiers in Physics.
  5. Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM. 2334-0770