Research talk:Automated classification of article importance/Work log/2017-06-30
Friday, June 30, 2017 edit
Today I'll wrap up some documentation work, tie up loose ends, and do a it of additional gap analysis.
WikiProjects Quality/Importance Analysis edit
Similar to how we created confusion matrices based on predicted and actual importance, we can also make similar matrices based on predicted article quality and predicted/actual importance, for each of the WikiProjects we have studied. We use the Objective Revision Scoring Service to predict article quality, and we use the revision ID of the article at the time each WikiProject's dataset was gathered.
WikiProject Africa edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 14,697 | 1,140 | 214 | 136 |
Start | 7,198 | 1,412 | 419 | 436 |
C | 2,913 | 970 | 362 | 947 |
B | 473 | 240 | 123 | 394 |
GA | 675 | 250 | 93 | 232 |
FA | 195 | 92 | 53 | 120 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 10,757 | 3,574 | 1,552 | 304 |
Start | 3,659 | 2,800 | 2,204 | 802 |
C | 931 | 1,347 | 1,448 | 1,466 |
B | 156 | 258 | 337 | 479 |
GA | 186 | 305 | 347 | 412 |
FA | 32 | 85 | 134 | 209 |
WikiProject China edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 5,176 | 2,864 | 160 | 1 |
Start | 4,805 | 3,146 | 421 | 19 |
C | 2,576 | 2,409 | 633 | 191 |
B | 599 | 684 | 266 | 108 |
GA | 386 | 368 | 102 | 45 |
FA | 111 | 156 | 104 | 48 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 4,446 | 3,117 | 634 | 4 |
Start | 3,690 | 2,993 | 1,624 | 84 |
C | 1,590 | 1,633 | 2,074 | 512 |
B | 394 | 469 | 609 | 185 |
GA | 236 | 216 | 338 | 111 |
FA | 62 | 72 | 198 | 87 |
WikiProject Judaism edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 795 | 172 | 6 | 0 |
Start | 1,587 | 516 | 68 | 17 |
C | 1,099 | 523 | 224 | 96 |
B | 243 | 218 | 142 | 69 |
GA | 232 | 88 | 30 | 34 |
FA | 40 | 81 | 27 | 18 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 657 | 272 | 38 | 6 |
Start | 1,269 | 658 | 217 | 44 |
C | 706 | 575 | 473 | 188 |
B | 162 | 178 | 223 | 109 |
GA | 141 | 105 | 80 | 58 |
FA | 32 | 63 | 38 | 33 |
WikiProject Medicine edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 5,480 | 1,880 | 14 | 0 |
Start | 6,718 | 1,967 | 68 | 0 |
C | 4,849 | 2,721 | 313 | 2 |
B | 1,175 | 1,016 | 246 | 22 |
GA | 1,237 | 1,004 | 192 | 32 |
FA | 261 | 285 | 149 | 36 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 5,154 | 2,094 | 126 | 0 |
Start | 6,090 | 2,143 | 519 | 1 |
C | 4,156 | 2,214 | 1,486 | 29 |
B | 993 | 650 | 733 | 83 |
GA | 987 | 731 | 692 | 55 |
FA | 220 | 136 | 300 | 75 |
WikiProject National Football League edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 1,584 | 350 | 14 | 0 |
Start | 1,504 | 1,311 | 98 | 41 |
C | 1,032 | 881 | 248 | 170 |
B | 83 | 214 | 59 | 42 |
GA | 255 | 232 | 86 | 90 |
FA | 22 | 47 | 16 | 16 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 1,581 | 343 | 24 | 0 |
Start | 1,383 | 1,349 | 179 | 43 |
C | 821 | 829 | 492 | 189 |
B | 74 | 159 | 114 | 51 |
GA | 189 | 188 | 184 | 102 |
FA | 26 | 29 | 28 | 18 |
WikiProject Politics edit
In the first table below, columns are true importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 7,208 | 202 | 80 | 1 |
Start | 5,942 | 761 | 177 | 9 |
C | 3,156 | 1,465 | 398 | 32 |
B | 980 | 794 | 256 | 42 |
GA | 1,114 | 498 | 108 | 11 |
FA | 518 | 381 | 99 | 16 |
In this second table, columns are predicted importance ratings, rows are predicted article quality.
Low | Mid | High | Top | |
---|---|---|---|---|
Stub | 6,368 | 565 | 551 | 7 |
Start | 4,772 | 1,365 | 691 | 61 |
C | 1,531 | 2,065 | 1,284 | 171 |
B | 432 | 798 | 710 | 132 |
GA | 483 | 707 | 474 | 67 |
FA | 151 | 471 | 315 | 77 |
Correlations between quality and importance edit
Both ORES and our importance prediction model provides per-class probabilities, which we can utilize to understand the correlation between quality and importance. We apply an approach similar to that used by Aaron Halfaker for studying studying quality dynamics in Wikipedia, and by Sage Ross in FixMeBot and in the Wiki Education dashboard. For simplicity, we adopt Halfaker's approach and calculate an importance score as:
We then calculate the correlation coefficient between quality and importance for each of the WikiProjects, finding as follows:
Project name | Correlation |
---|---|
Africa | 0.561 |
China | 0.456 |
Judaism | 0.410 |
Medicine | 0.395 |
National Football League | 0.487 |
Politics | 0.519 |