Research:Prioritization of Wikipedia Articles/Importance/WikiProjects
There has been a fair bit of research in the past into how WikiProjects on English Wikipedia assess article importance and how "predictable" these assessments are from basic details about the article such as how many other articles link to it or the pageviews it receives. This page has a few goals:
- Present a simple approach for data collection around importance assessments
- Provide some basic insights into how "contextual" an importance assessment is -- i.e. how much does it depend on the particular WikiProject
Code: https://github.com/geohci/wiki-prioritization/tree/master/wikiproject_importance
Gathering data
editPast research gives an excellent overview of the particulars of the article importance scale and notes some of the challenges of gathering this data. Luckily, this importance data is now collated in a much more structured format in the page_assessments table for English Wikipedia (and a few other languages). As such, it is a much more straightforward query to gather importance assessments for all articles in a given wiki:
Get Wikipedia articles tagged with WikiProject templates |
---|
SELECT
pa.pa_page_id AS article_pid,
pap.pap_project_title AS wp_template,
p.page_latest AS article_revid,
p.page_title AS title,
ptalk.page_id AS talk_pid,
ptalk.page_latest AS talk_revid,
pa.pa_importance AS importance
FROM page_assessments pa
INNER JOIN page_assessments_projects pap
ON (pa.pa_project_id = pap.pap_project_id)
INNER JOIN page p
ON (pa.pa_page_id = p.page_id
AND p.page_namespace = 0
AND p.page_is_redirect = 0)
INNER JOIN page ptalk
ON (p.page_title = ptalk.page_title
AND ptalk.page_namespace = 1)
|
The raw assessments have to be standardized then as they are not strict in what values are accepted. Specifically, I do the following with the data:
- I drop the following values:
- Unknown, NA, na
- And map the rest to four categories:
- Low: Low, low, Bottom, Related
- Mid: Mid, mid
- High: High, high
- Top: Top, top
Results
editFor English Wikipedia, we get the following counts of how many times each assessment level appeared (articles can have multiple assessments):
- Low: 3,687,536 assessments (79%)
- Mid: 751,592 assessments (16%)
- High: 174,025 assessments (4%)
- Top: 40,411 assessments (1%)
As one can see, most articles are most often assessed as low importance and quite rarely as top importance.
Importance by Topic
editBelow is data on the distribution of importance templates by article for articles on English Wikipedia, split into the different ORES taxonomy topics. Note, the topic assessments are based on what WikiProjects have tagged an article, not topics as predicted by a classification model. Many articles fall under multiple topics and are counted for each. A single WikiProject can only contribute a single assessment per article, but multiple WikiProjects might tag an article and provide assessments (example). The columns are:
- # articles: number of articles that are part of the topic on English Wikipedia (the rest of the values are proportions of this number)
- no assess.: proportion of articles that don't have a single importance assessment
- single assess.: proportion of articles with a single importance assessment
- mult assess.: proportion of articles with more than one importance assessment
- agreed: multiple assessments but they all were the same level -- e.g., Low and Low
- adjacent: multiple assessments but they were adjacent levels -- e.g., Low and Mid
- two steps: multiple assessments that were two steps away -- e.g., Low and High
- full: multiple assessments that were the full range -- i.e. Low and Top
From this table, we can see that topics like Society, Philosophy/Religion, and History are most likely to show a wider range of importance assessments, but this seems to be arise from articles in those topics being more likely to be tagged by multiple WikiProjects. It is also very rare that articles tagged with multiple WikiProjects have consistent assessments of importance -- they are most likely to be adjacent assessments such as Mid and High. Finally, it's clear that articles tagged as Low importance (79% of assessments) are very unlikely to be tagged by multiple WikiProjects -- i.e. they tend to have a narrow scope -- because otherwise we would expect much greater agreement between WikiProjects tagging the same article.
Topic | # articles | no assess. | single assess. | mult assess. | agreed | adjacent | two steps | full |
---|---|---|---|---|---|---|---|---|
History and Society.Society | 64148 | 0.185758 | 0.565302 | 0.24894 | 0.000452 | 0.170434 | 0.058724 | 0.01933 |
Culture.Philosophy and religion | 158108 | 0.09546 | 0.696429 | 0.208111 | 0.000829 | 0.150935 | 0.042578 | 0.013769 |
Geography.Regions.Africa.Central Africa | 11458 | 0.266539 | 0.533165 | 0.200297 | 0.000175 | 0.155263 | 0.034561 | 0.010298 |
History and Society.History | 173010 | 0.170921 | 0.629796 | 0.199283 | 0.000329 | 0.148043 | 0.040298 | 0.010612 |
STEM.Medicine & Health | 76955 | 0.107738 | 0.69846 | 0.193802 | 0.000195 | 0.148554 | 0.038022 | 0.00703 |
Culture.Media.Software | 23391 | 0.278098 | 0.530033 | 0.191869 | 0.001069 | 0.141422 | 0.043265 | 0.006113 |
Geography.Regions.Africa.Eastern Africa | 35776 | 0.266547 | 0.545114 | 0.188339 | -- | 0.136432 | 0.041564 | 0.010342 |
Culture.Visual arts.Architecture | 167657 | 0.076853 | 0.735901 | 0.187245 | 0.004754 | 0.148446 | 0.030556 | 0.003489 |
STEM.Physics | 19164 | 0.013671 | 0.802233 | 0.184095 | 0.015028 | 0.126905 | 0.035796 | 0.006366 |
STEM.Mathematics | 23688 | 0.067671 | 0.74886 | 0.183468 | 0.078141 | 0.07717 | 0.023092 | 0.005066 |
STEM.Libraries & Information | 9965 | 0.080983 | 0.738083 | 0.180933 | 0.000803 | 0.138083 | 0.033718 | 0.008329 |
Culture.Visual arts.Visual arts* | 280611 | 0.158746 | 0.694171 | 0.147083 | 0.007387 | 0.112248 | 0.02409 | 0.003357 |
History and Society.Education | 98776 | 0.151282 | 0.701982 | 0.146736 | 0.001002 | 0.107395 | 0.032467 | 0.005872 |
Geography.Regions.Africa.Africa* | 147027 | 0.278507 | 0.575037 | 0.146456 | 0.000109 | 0.111102 | 0.027886 | 0.007359 |
Geography.Regions.Africa.Southern Africa | 25755 | 0.226286 | 0.627412 | 0.146302 | 0.000078 | 0.123199 | 0.01821 | 0.004815 |
STEM.Earth and environment | 80403 | 0.077049 | 0.78063 | 0.142321 | 0.000211 | 0.107222 | 0.029912 | 0.004975 |
Geography.Geographical | 313358 | 0.073446 | 0.785226 | 0.141327 | 0.000109 | 0.116123 | 0.021123 | 0.003973 |
Geography.Regions.Africa.Northern Africa | 30895 | 0.210714 | 0.650688 | 0.138598 | 0.000356 | 0.103447 | 0.0268 | 0.007995 |
Culture.Literature | 203361 | 0.183934 | 0.678193 | 0.137873 | 0.006402 | 0.107449 | 0.019812 | 0.004209 |
Geography.Regions.Asia.Central Asia | 11776 | 0.115489 | 0.746858 | 0.137653 | -- | 0.096637 | 0.033033 | 0.007982 |
Geography.Regions.Asia.South Asia | 241528 | 0.087133 | 0.77827 | 0.134597 | 0.000248 | 0.106712 | 0.024001 | 0.003635 |
Culture.Visual arts.Comics and Anime | 37093 | 0.018818 | 0.847222 | 0.133961 | 0.03405 | 0.081093 | 0.015879 | 0.002939 |
Culture.Biography.Women | 263244 | 0.211568 | 0.660893 | 0.127539 | 0.000874 | 0.107334 | 0.016578 | 0.002754 |
Culture.Visual arts.Fashion | 12653 | 0.257251 | 0.61685 | 0.125899 | 0.000079 | 0.096183 | 0.025844 | 0.003794 |
Geography.Regions.Asia.North Asia | 85780 | 0.059827 | 0.818163 | 0.12201 | 0.000455 | 0.096258 | 0.021042 | 0.004255 |
Culture.Media.Books | 60291 | 0.226137 | 0.656682 | 0.117182 | 0.001028 | 0.097295 | 0.016553 | 0.002305 |
Geography.Regions.Oceania | 241103 | 0.040215 | 0.84276 | 0.117025 | 0.000174 | 0.099119 | 0.015782 | 0.001949 |
Geography.Regions.Europe.Northern Europe | 429847 | 0.136248 | 0.748145 | 0.115606 | 0.000384 | 0.095529 | 0.016859 | 0.002834 |
Geography.Regions.Africa.Western Africa | 36403 | 0.358899 | 0.526495 | 0.114606 | 0.000027 | 0.082933 | 0.024861 | 0.006785 |
STEM.Space | 33298 | 0.043997 | 0.843474 | 0.112529 | 0.003814 | 0.083458 | 0.021293 | 0.003964 |
History and Society.Military and warfare | 213470 | 0.365293 | 0.525994 | 0.108713 | 0.000309 | 0.082316 | 0.021802 | 0.004286 |
Culture.Media.Television | 116103 | 0.2425 | 0.649303 | 0.108197 | 0.003988 | 0.078473 | 0.022015 | 0.003721 |
Geography.Regions.Asia.Asia* | 785879 | 0.13687 | 0.761822 | 0.101308 | 0.000328 | 0.081384 | 0.01684 | 0.002756 |
Culture.Internet culture | 48879 | 0.055648 | 0.843859 | 0.100493 | 0.007181 | 0.069969 | 0.020029 | 0.003314 |
STEM.Chemistry | 29487 | 0.131109 | 0.769661 | 0.09923 | 0.000441 | 0.074677 | 0.020958 | 0.003154 |
Geography.Regions.Asia.Southeast Asia | 92044 | 0.140009 | 0.762385 | 0.097605 | 0.000261 | 0.080364 | 0.014776 | 0.002205 |
Geography.Regions.Americas.South America | 102947 | 0.142996 | 0.759478 | 0.097526 | 0.000204 | 0.079439 | 0.01526 | 0.002623 |
Geography.Regions.Asia.East Asia | 180427 | 0.146369 | 0.757276 | 0.096355 | 0.000571 | 0.078985 | 0.014787 | 0.002012 |
Culture.Performing arts | 41368 | 0.294068 | 0.610689 | 0.095243 | 0.000266 | 0.073414 | 0.017356 | 0.004206 |
Geography.Regions.Europe.Europe* | 1197367 | 0.174059 | 0.731506 | 0.094435 | 0.000408 | 0.078784 | 0.012933 | 0.00231 |
STEM.Engineering | 84612 | 0.410261 | 0.496797 | 0.092942 | 0.000461 | 0.073595 | 0.016109 | 0.002777 |
Geography.Regions.Europe.Eastern Europe | 267187 | 0.134206 | 0.775464 | 0.09033 | 0.000389 | 0.077137 | 0.010816 | 0.001987 |
Culture.Sports | 933564 | 0.18111 | 0.729999 | 0.088891 | 0.001055 | 0.074605 | 0.011495 | 0.001735 |
History and Society.Transportation | 223236 | 0.315558 | 0.595921 | 0.088521 | 0.000287 | 0.074316 | 0.012435 | 0.001483 |
Geography.Regions.Europe.Western Europe | 304552 | 0.209015 | 0.706014 | 0.084971 | 0.000581 | 0.0712 | 0.011253 | 0.001937 |
Geography.Regions.Europe.Southern Europe | 209630 | 0.24216 | 0.673978 | 0.083862 | 0.000234 | 0.066641 | 0.013748 | 0.003239 |
Culture.Biography.Biography* | 1838864 | 0.339359 | 0.579076 | 0.081565 | 0.001282 | 0.065869 | 0.012143 | 0.00227 |
STEM.STEM* | 880243 | 0.107787 | 0.811999 | 0.080214 | 0.002273 | 0.061126 | 0.014377 | 0.002438 |
All Articles | 5890819 | 0.292323 | 0.631763 | 0.075915 | 0.000895 | 0.060812 | 0.012164 | 0.002044 |
Culture.Media.Video games | 37133 | 0.004632 | 0.927288 | 0.06808 | 0.009183 | 0.045539 | 0.011607 | 0.00175 |
Culture.Media.Media* | 924129 | 0.366187 | 0.567583 | 0.06623 | 0.001308 | 0.05088 | 0.011902 | 0.002139 |
Geography.Regions.Asia.West Asia | 182848 | 0.22406 | 0.710175 | 0.065765 | 0.000175 | 0.051261 | 0.011583 | 0.002745 |
Culture.Media.Radio | 29169 | 0.219685 | 0.726696 | 0.053619 | 0.000137 | 0.041962 | 0.009634 | 0.001886 |
Culture.Media.Music | 389662 | 0.431751 | 0.518726 | 0.049522 | 0.000252 | 0.038626 | 0.00879 | 0.001855 |
Culture.Media.Films | 242294 | 0.429751 | 0.523731 | 0.046518 | 0.001284 | 0.035527 | 0.008052 | 0.001655 |
Culture.Linguistics | 98242 | 0.745934 | 0.214043 | 0.040024 | 0.000234 | 0.028002 | 0.010047 | 0.001741 |
STEM.Biology | 462863 | 0.013555 | 0.94666 | 0.039785 | 0.000065 | 0.032489 | 0.00619 | 0.001041 |