Research talk:Automated classification of article quality/Work log/2016-03-29

Tuesday, March 29, 2016 edit

New data day! Actually, I've been sitting on this for a few days while I did ORES infra stuff.

$ wc enwiki.observations.first_labelings.20160204.json
  4944744  49153126 547278651 enwiki.observations.first_labelings.20160204.json
$ cat enwiki.observations.first_labelings.20160204.json | grep '"stub"' | wc
3052675 29931255 336187196
$ cat enwiki.observations.first_labelings.20160204.json | grep '"start"' | wc
1473066 14938507 165572041
$ cat enwiki.observations.first_labelings.20160204.json | grep '"c"' | wc
 223962 2287919 24303410
$ cat enwiki.observations.first_labelings.20160204.json | grep '"b"' | wc
 151493 1544712 16445691
$ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | wc
  31046  322381 3408305
$ cat enwiki.observations.first_labelings.20160204.json | grep '"fa"' | wc
   7019   72801  770140
$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | wc
   5623   56939  606087

Cool! This looks more reasonable than the last time. Let's look at some examples to make sure they are not crazy. I'm most skeptical of the "a" label, so let's look at that first.

$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | head
{"timestamp": "20070723212122", "page_title": "Caligula", "label": "a", "project": "lgbtproject"}
{"timestamp": "20060809132422", "page_title": "Noam Chomsky", "label": "a", "project": "biography"}
{"timestamp": "20061006122548", "page_title": "Pythagorean theorem", "label": "a", "project": "mathematics"}
{"timestamp": "20060609204743", "page_title": "Poetry", "label": "a", "project": "v0.5"}
{"timestamp": "20060809153945", "page_title": "Albinism", "label": "a", "project": "medgen"}
{"timestamp": "20070601160648", "page_title": "Albinism", "label": "a", "project": "bannershell"}
{"timestamp": "20060621071831", "page_title": "Moon", "label": "a", "project": "core topic"}
{"timestamp": "20070321171727", "page_title": "Moon", "label": "a", "project": "ss"}
{"timestamp": "20070221233233", "page_title": "Jazz", "label": "a", "project": "new orleans"}
{"timestamp": "20060518044326", "page_title": "Archaeology", "label": "a", "project": "core topic"}
It looks like we're seeing the transition of these two template blocks resulting in a new "A" label
{{WikiProjectBanners
|1={{WP1.0|class=A|importance=Mid|category=Natsci|v0.5=pass|WPCD=yes}}
|2={{MedGen|class=A|importance=Mid}}
|3={{WPMED|class=B|importance=low}}
}}
{{WikiProjectBannerShell|1=
  {{BannerShell|class=A|topic=Version 1.0 Editorial Team|1={{WP1.0|class=A|importance=Mid|category=Natsci|v0.5=pass|WPCD=yes}} }}
 {{BannerShell|class=A|topic=Medical Genetics|1={{MedGen|class=A|importance=Mid}} }}
 {{BannerShell|class=B|topic=Medicine|1={{WPMED|class=B|importance=Low}} }}
}}

So, I think that what is happening here is something that looks like a re-assessment by WikiProject "BannerShell". It's not a re-assessment though. I think we'll need to adjust our strategy for detecting re-assessments to identify new classes being added to set already present on the page. I think I'll need to go back to the code to address this. --Halfak (WMF) (talk) 14:44, 29 March 2016 (UTC)Reply


OK. I've implemented two changes. First, we don't use the WikiProject as part of the key anymore. So, only changes to the quality class will be noticed. I've also set a bound on the duration that a quality class must be current before it will be considered of 48 hours. This seems to work well in practice. Let's do some more spot-checking.

$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | head
{"project": "wikiproject", "timestamp": "20070723212122", "label": "a", "page_title": "Caligula"}
{"project": "wikiproject", "timestamp": "20060809132422", "label": "a", "page_title": "Noam Chomsky"}
{"project": "wikiproject", "timestamp": "20061006122548", "label": "a", "page_title": "Pythagorean theorem"}
{"project": "wikiproject", "timestamp": "20060609204743", "label": "a", "page_title": "Poetry"}
{"project": "wikiproject", "timestamp": "20060621071831", "label": "a", "page_title": "Moon"}
{"project": "wikiproject", "timestamp": "20070909033444", "label": "a", "page_title": "Robot"}

No repeats in this list!

$ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | head
{"project": "wikiproject", "timestamp": "20071205084657", "label": "ga", "page_title": "Caligula"}
{"project": "wikiproject", "timestamp": "20131012212120", "label": "ga", "page_title": "Fat Man"}
{"project": "wikiproject", "timestamp": "20060928160937", "label": "ga", "page_title": "Sanskrit"}
{"project": "wikiproject", "timestamp": "20060903095953", "label": "ga", "page_title": "Pythagorean theorem"}
{"project": "wikiproject", "timestamp": "20061109074006", "label": "ga", "page_title": "Nutrition"}
{"project": "wikiproject", "timestamp": "20060801033223", "label": "ga", "page_title": "Algeria"}
{"project": "wikiproject", "timestamp": "20080515104134", "label": "ga", "page_title": "Group (mathematics)"}
{"project": "wikiproject", "timestamp": "20070113233412", "label": "ga", "page_title": "Boeing B-17 Flying Fortress"}
{"project": "wikiproject", "timestamp": "20120823105617", "label": "ga", "page_title": "Roman Empire"}
{"project": "wikiproject", "timestamp": "20060612161218", "label": "ga", "page_title": "Moon"}
$ cat enwiki.observations.first_labelings.20160204.json | grep '"stub"' | head
{"project": "wikiproject", "timestamp": "20150523194215", "label": "stub", "page_title": "Acantharea"}
{"project": "wikiproject", "timestamp": "20090614133509", "label": "stub", "page_title": "Mutagenesis"}
{"project": "wikiproject", "timestamp": "20070303174700", "label": "stub", "page_title": "Heinz"}
{"project": "wikiproject", "timestamp": "20070223050341", "label": "stub", "page_title": "ARY Group"}
{"project": "wikiproject", "timestamp": "20101226213843", "label": "stub", "page_title": "Mass media"}
{"project": "wikiproject", "timestamp": "20070411142310", "label": "stub", "page_title": "Aldona of Lithuania"}
{"project": "wikiproject", "timestamp": "20070407151157", "label": "stub", "page_title": "Born again (Christianity)"}
{"project": "wikiproject", "timestamp": "20070313135550", "label": "stub", "page_title": "List of science fiction awards"}
{"project": "wikiproject", "timestamp": "20061015195549", "label": "stub", "page_title": "Balts"}
{"project": "wikiproject", "timestamp": "20070203195447", "label": "stub", "page_title": "Burgess Shale"}

I think we're doing pretty good.  :) It'll be good if we can find an article that had a downgrade in rating. It looks like Noam Chomsky fits the bill.

$ cat enwiki.observations.first_labelings.20160204.json | grep '"Noam Chomsky"' | head
{"project": "wikiproject", "timestamp": "20060809132422", "label": "a", "page_title": "Noam Chomsky"}
{"project": "wikiproject", "timestamp": "20070508203411", "label": "b", "page_title": "Noam Chomsky"}
{"project": "wikiproject", "timestamp": "20090324103311", "label": "start", "page_title": "Noam Chomsky"}

Well... that looks backwards. Let's figure out what happened.

Oh!!!! We don't see the "b" class re-appear because the "start" downgrade never changed the WP 1.0 assessment class. So, there is never a set difference in the classes present! This doesn't look wrong from an algorithmic point of view, but it does look wrong. --Halfak (WMF) (talk) 16:40, 29 March 2016 (UTC)Reply

@Halfak (WMF): Looking at the diffs, I noticed that en:Special:Diff/279336384 downgrades to "Start", but there's still {{WP1.0|v0.5=pass|class=B|category=Arts}}. Is that captured as a rating and the reason why en:Special:Diff/308564816 isn't registered as a rating upgrade? Cheers, Nettrom (talk) 15:57, 30 March 2016 (UTC)Reply
Return to "Automated classification of article quality/Work log/2016-03-29" page.