Research talk:Automated classification of article importance/Work log/2017-06-30

Friday, June 30, 2017 edit

Today I'll wrap up some documentation work, tie up loose ends, and do a it of additional gap analysis.

WikiProjects Quality/Importance Analysis edit

Similar to how we created confusion matrices based on predicted and actual importance, we can also make similar matrices based on predicted article quality and predicted/actual importance, for each of the WikiProjects we have studied. We use the Objective Revision Scoring Service to predict article quality, and we use the revision ID of the article at the time each WikiProject's dataset was gathered.

WikiProject Africa edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 14,697 1,140 214 136
Start 7,198 1,412 419 436
C 2,913 970 362 947
B 473 240 123 394
GA 675 250 93 232
FA 195 92 53 120

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 10,757 3,574 1,552 304
Start 3,659 2,800 2,204 802
C 931 1,347 1,448 1,466
B 156 258 337 479
GA 186 305 347 412
FA 32 85 134 209

WikiProject China edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 5,176 2,864 160 1
Start 4,805 3,146 421 19
C 2,576 2,409 633 191
B 599 684 266 108
GA 386 368 102 45
FA 111 156 104 48

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 4,446 3,117 634 4
Start 3,690 2,993 1,624 84
C 1,590 1,633 2,074 512
B 394 469 609 185
GA 236 216 338 111
FA 62 72 198 87

WikiProject Judaism edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 795 172 6 0
Start 1,587 516 68 17
C 1,099 523 224 96
B 243 218 142 69
GA 232 88 30 34
FA 40 81 27 18

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 657 272 38 6
Start 1,269 658 217 44
C 706 575 473 188
B 162 178 223 109
GA 141 105 80 58
FA 32 63 38 33

WikiProject Medicine edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 5,480 1,880 14 0
Start 6,718 1,967 68 0
C 4,849 2,721 313 2
B 1,175 1,016 246 22
GA 1,237 1,004 192 32
FA 261 285 149 36

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 5,154 2,094 126 0
Start 6,090 2,143 519 1
C 4,156 2,214 1,486 29
B 993 650 733 83
GA 987 731 692 55
FA 220 136 300 75

WikiProject National Football League edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 1,584 350 14 0
Start 1,504 1,311 98 41
C 1,032 881 248 170
B 83 214 59 42
GA 255 232 86 90
FA 22 47 16 16

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 1,581 343 24 0
Start 1,383 1,349 179 43
C 821 829 492 189
B 74 159 114 51
GA 189 188 184 102
FA 26 29 28 18

WikiProject Politics edit

In the first table below, columns are true importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 7,208 202 80 1
Start 5,942 761 177 9
C 3,156 1,465 398 32
B 980 794 256 42
GA 1,114 498 108 11
FA 518 381 99 16

In this second table, columns are predicted importance ratings, rows are predicted article quality.

Low Mid High Top
Stub 6,368 565 551 7
Start 4,772 1,365 691 61
C 1,531 2,065 1,284 171
B 432 798 710 132
GA 483 707 474 67
FA 151 471 315 77

Correlations between quality and importance edit

Both ORES and our importance prediction model provides per-class probabilities, which we can utilize to understand the correlation between quality and importance. We apply an approach similar to that used by Aaron Halfaker for studying studying quality dynamics in Wikipedia, and by Sage Ross in FixMeBot and in the Wiki Education dashboard. For simplicity, we adopt Halfaker's approach and calculate an importance score as:

 

We then calculate the correlation coefficient between quality and importance for each of the WikiProjects, finding as follows:

Project name Correlation
Africa 0.561
China 0.456
Judaism 0.410
Medicine 0.395
National Football League 0.487
Politics 0.519

Discussion edit

Return to "Automated classification of article importance/Work log/2017-06-30" page.