Research:Mapping Citation Quality

Created

19:34, 21 May 2018 (UTC)

Contact

Miriam Redi

Wikimedia Foundation

Collaborators

DarioTaraborelli

Wikimedia Foundation

Duration: 2019-January – 2019-?

citations, references, accessibility, machine learning

Research:Projects

This page documents a completed research project.

Idea

This project is a follow-up "Citation Needed" research project, where we developed multilingual models to automatically detect sentences needing citations in Wikipedia articles. Here, we apply these models at scale on hundreds of thousands of articles from English, French, and Italian Wikipedias, and quantify the proportion of unsourced content in these spaces, and its distribution over different topics.

Data

We want to check for unsourced content across different types of articles.

Article Content

We sample 5% of articles in English, French and Italian Wikipedia, and then randomly sample sentences across these subsets. Below a summary of the data used for these experiments.

Language	Number of Articles	Number of Sentences	Average Sentences / Article
English	4120432	421101	9.8
Italian	483460	72862	6.6
French	738477	109310	6.7

Article Characteristics

Using mw:ORES ORES and the Pageviews API, we tag articles with 3 dimensions, when possible:

Topic: using ORES' topic model, we label all English Wikipedia articles with one or more predicted topics. For non-English articles, we propagate the topic assigned to their corresponding article in English Wikipedia (when existing), found through Wikidata
Quality: for English and French Wikipedias, we retain the article quality scores assigned by ORES
Popularity: for English Wikipedia, we also use the Page views API to get the number of page views up to May 2019.

Lists of Articles Missing Citations

To compare the results of our algorithms with real annotated data, we download the lists of articles that have been marked by editors as "Missing Sources". For example, for english Wikipedia, we use Quarry to get the list of all articles in the category All_articles_needing_additional_references (353070), for Italian, all articles in the Categoria:Senza Fonti (167863), and for French, the Catégorie:Article_manquant_de_références (84911).

Methods

After tagging each article according to topic, quality and popularity, we compute how "well sourced" an article is. To do so, for each article, we calculate the citation quality, namely the proportion of "well-sourced" sentences among the sample in our data. To label each sentence according to their citation need, we run the citation need models, and annotate each sentence with a binary "citation need" label $y$ according to the model output: $y=[{\hat {y}}]$ , where $[\cdot ]$ is the rounding function and \hat{y} is the predicted continuous label.

Next, if we consider:

$p$ as the number of sentences predicted as "needing citations"
$c$ as the real "citation label" for a sentence
- $c=0$ if the sentence doesn't have an inline citation in the original text
- $c=1$ if the sentence has an inline citation in the original text
$P$ is the set of $p$ sentences needing citations according to the algorithm, namely the ones for which $y=1$ .

The citation quality $Q$ for an article is then calculated as:

Q={\frac {1}{p}}\sum _{i\in P}c

when $Q=0$ the quality is very low, none of the sentences classified by the model as needing citations are actually recognized as

We consider articles for which $n\geq 5$

Results: Quantitative

We present here some results correlating overall article quality and article citation quality

Citation Quality VS Article Quality (and popularity)

We correlate, for each article, the citation quality score $Q$ with the article quality score as output by ORES $AQ$ , using the Pearson correlation coefficients. For the two languages where ORES' quality scores are available, we observe a strong correlation (statistically significant) between these 2 quantities. While this is somehow expected, these results provide a "sanity check" for the statistical soundness and accuracy of our article citation quality score.

Language	$\rho (Q,AQ)$
English	0.34
French	0.26

For English, we also computed the correlation between citation quality and article popularity. We found here a correlation of $\rho =0.09$ , a significant value, though weaker than the correlation between citation quality and article quality. This correlation is probably due to the fact that very popular articles tend also to be of high quality (there is a significant correlation of $\rho =0.14$ between article quality and popularity).

Breakdown of Citation Quality by Topic

We cross the topic information we have about each article with the overall citation quality. We find that, across languages, the most well sourced articles are the Medicine and Biology articles. "Language and Literature", the topic category hosting most biographies, also rank among the top well-sourced topics. We find that articles in Mathematics and Physics tend to be marked as poorly sourced. This is probably due to the fact that these articles don't report many inline citations, as the proof of the scientific claims is in the formulas/equations that follow, and these articles tend to have a few references cited in the general. We will see more insights about these corner cases in the qualitative analysis section.

Breakdown of citation quality by article topic for articles in French Wikipedia

Breakdown of citation quality by article topic for articles in English Wikipedia

Breakdown of citation quality by article topic for articles in Italian Wikipedia

Citation Quality of Articles Marked as "Missing Sources"

To get an aggregated view of the citation quality scores for articles marked by the community as "missing sources", and compare it with the articles not marked as "missing sources" we compute the average scores assigned by our models on the two groups of articles. We see that, in average, articles marked as "missing sources" receive a much lower citation quality score.

Language	$Q$ for Marked Articles	$Q$ for non-marked Articles	Average $Q$
English	0.49	0.66	0.64
French	0.28	0.49	0.47
Italian	0.23	0.45	0.41

To get a closer look at the behavior of individual articles, we plot below the citation quality scores of articles marked as "missing sources" for English Wikipedia. Each vertical line in the plot below represents one article. The color indicates whether the article is marked as "missing sources" (magenta) or not (green). The height of the line is the citation quality score assigned to the article. As we can see, most articles marked as "missing sources" have low citation quality scores. However, we see some cases where articles with low-quality scores are not marked as "missing sources", as well as articles with high-quality scores are marked as "missing sources". We will give an in-depth view of these cases in the Qualitative Analysis section. We get similar results for French and Italian Wikipedias.

Citation Quality Scores for Articles marked as "Missing Sources"

Results: Qualitative Examples

We show here some specific examples of articles with high/low citation quality score. We limit this qualitative analysis to English Wikipedia, for ease of understanding.

Low Citation Quality Examples

Some articles with Low Citation Quality scores have been already marked as "missing sources" by the Wikimedia community, for example:

Places en:Riverview School District (Pennsylvania)
People en:Kenges Rakishev

In other cases, articles detected as "low citation quality" by our models have not been recognized as missing citations. Some of them are also biographies:

Literature: en:Norwegian literature
Biographies: en:Bihari brothers

Some articles about scientific topics are detected as low citation quality. Should we consider those articles as missing sources?

Chemistry: en:Stoichiometry has a 0 citation quality score due to unsourced sentences like

"A stoichiometric reactant is a reactant that is consumed in a reaction, as opposed to a catalytic reactant, which is not consumed in the overall reaction because it reacts in one step and is regenerated in another step."

Computing: en:Analogical modeling has entire paragraphs left unsourced, and the model detects as missing citations sentences like

"In bitwise logical operations (e.g., logical AND, logical OR), the operand fragments may be processed in any arbitrary order because each partial depends only on the corresponding operand fragments (the stored carry bit from the previous ALU operation is ignored)."

In some cases the model makes mistakes, one of the most common is about lists of fictional characters

Books: en:List_of_Wild_Cards_characters
TV Series: en:List_of_Marvel_Comics_characters

High Citation Quality Examples

Some articles detected by our model as "high citation quality" are clear examples of very well sourced pages:

For example, Stephen Hawking's article is a formerly featured article
Important articles for knowledge dissemination, such as the one Vaccine controversies, ⁣ are examples of very well sourced articles.
The model recognizes that sentences in the "plot" section of an article about movies/books shouldn't be cited, and therefore considers b-level quality articles like The Mummy, Tomb of the Dragon Emperor as high citation quality articles.
When Physics articles contain well sourced sections about historical facts, as well as technical sections, the model detects them as "high citation quality", see for example the article on Catenary.

The model samples sentences from the article. In some cases, the unsampled sentences are marked as citation needed, and therefore they fall in the category "missing sources".

For example, generally well sourced articles, such as the [[en:Cavaquinho | Cavaquinho] article, have a few sentences marked as citation needed, but the model outputs a citation quality score of 1.0.
Similarly, in Joe Dolan's biography, there is one sentence marked as "citation needed". However, this is a generally well sourced articles, and thus our model gives a high citation quality score.
This article about the 1974 Greek Referendum has a section marked as "missing references". However, the model is not sampling from that section, thus giving a very high citation quality score to that article.