Research talk:Automated classification of article quality/Work log/2016-03-29
Tuesday, March 29, 2016
editNew data day! Actually, I've been sitting on this for a few days while I did ORES infra stuff.
$ wc enwiki.observations.first_labelings.20160204.json 4944744 49153126 547278651 enwiki.observations.first_labelings.20160204.json $ cat enwiki.observations.first_labelings.20160204.json | grep '"stub"' | wc 3052675 29931255 336187196 $ cat enwiki.observations.first_labelings.20160204.json | grep '"start"' | wc 1473066 14938507 165572041 $ cat enwiki.observations.first_labelings.20160204.json | grep '"c"' | wc 223962 2287919 24303410 $ cat enwiki.observations.first_labelings.20160204.json | grep '"b"' | wc 151493 1544712 16445691 $ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | wc 31046 322381 3408305 $ cat enwiki.observations.first_labelings.20160204.json | grep '"fa"' | wc 7019 72801 770140 $ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | wc 5623 56939 606087
Cool! This looks more reasonable than the last time. Let's look at some examples to make sure they are not crazy. I'm most skeptical of the "a" label, so let's look at that first.
$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | head {"timestamp": "20070723212122", "page_title": "Caligula", "label": "a", "project": "lgbtproject"} {"timestamp": "20060809132422", "page_title": "Noam Chomsky", "label": "a", "project": "biography"} {"timestamp": "20061006122548", "page_title": "Pythagorean theorem", "label": "a", "project": "mathematics"} {"timestamp": "20060609204743", "page_title": "Poetry", "label": "a", "project": "v0.5"} {"timestamp": "20060809153945", "page_title": "Albinism", "label": "a", "project": "medgen"} {"timestamp": "20070601160648", "page_title": "Albinism", "label": "a", "project": "bannershell"} {"timestamp": "20060621071831", "page_title": "Moon", "label": "a", "project": "core topic"} {"timestamp": "20070321171727", "page_title": "Moon", "label": "a", "project": "ss"} {"timestamp": "20070221233233", "page_title": "Jazz", "label": "a", "project": "new orleans"} {"timestamp": "20060518044326", "page_title": "Archaeology", "label": "a", "project": "core topic"}
- en:Special:Diff/146614638 to en:Talk:Caligula looks right
- en:Special:Diff/68604322 to en:Talk:Noam Chomsky looks right
- en:Special:Diff/68627981 to en:Talk:Albinism looks right
- But en:Special:Diff/135114861 to en:Talk:Albinism looks maybe wrong.
- It looks like en:Special:Diff/130859496 to en:Talk:Albinism added "class=B", but that got reverted.
- It looks like we're seeing the transition of these two template blocks resulting in a new "A" label
{{WikiProjectBanners |1={{WP1.0|class=A|importance=Mid|category=Natsci|v0.5=pass|WPCD=yes}} |2={{MedGen|class=A|importance=Mid}} |3={{WPMED|class=B|importance=low}} }}
{{WikiProjectBannerShell|1= {{BannerShell|class=A|topic=Version 1.0 Editorial Team|1={{WP1.0|class=A|importance=Mid|category=Natsci|v0.5=pass|WPCD=yes}} }} {{BannerShell|class=A|topic=Medical Genetics|1={{MedGen|class=A|importance=Mid}} }} {{BannerShell|class=B|topic=Medicine|1={{WPMED|class=B|importance=Low}} }} }}
So, I think that what is happening here is something that looks like a re-assessment by WikiProject "BannerShell". It's not a re-assessment though. I think we'll need to adjust our strategy for detecting re-assessments to identify new classes being added to set already present on the page. I think I'll need to go back to the code to address this. --Halfak (WMF) (talk) 14:44, 29 March 2016 (UTC)
OK. I've implemented two changes. First, we don't use the WikiProject as part of the key anymore. So, only changes to the quality class will be noticed. I've also set a bound on the duration that a quality class must be current before it will be considered of 48 hours. This seems to work well in practice. Let's do some more spot-checking.
$ cat enwiki.observations.first_labelings.20160204.json | grep '"a"' | head {"project": "wikiproject", "timestamp": "20070723212122", "label": "a", "page_title": "Caligula"} {"project": "wikiproject", "timestamp": "20060809132422", "label": "a", "page_title": "Noam Chomsky"} {"project": "wikiproject", "timestamp": "20061006122548", "label": "a", "page_title": "Pythagorean theorem"} {"project": "wikiproject", "timestamp": "20060609204743", "label": "a", "page_title": "Poetry"} {"project": "wikiproject", "timestamp": "20060621071831", "label": "a", "page_title": "Moon"} {"project": "wikiproject", "timestamp": "20070909033444", "label": "a", "page_title": "Robot"}
No repeats in this list!
$ cat enwiki.observations.first_labelings.20160204.json | grep '"ga"' | head {"project": "wikiproject", "timestamp": "20071205084657", "label": "ga", "page_title": "Caligula"} {"project": "wikiproject", "timestamp": "20131012212120", "label": "ga", "page_title": "Fat Man"} {"project": "wikiproject", "timestamp": "20060928160937", "label": "ga", "page_title": "Sanskrit"} {"project": "wikiproject", "timestamp": "20060903095953", "label": "ga", "page_title": "Pythagorean theorem"} {"project": "wikiproject", "timestamp": "20061109074006", "label": "ga", "page_title": "Nutrition"} {"project": "wikiproject", "timestamp": "20060801033223", "label": "ga", "page_title": "Algeria"} {"project": "wikiproject", "timestamp": "20080515104134", "label": "ga", "page_title": "Group (mathematics)"} {"project": "wikiproject", "timestamp": "20070113233412", "label": "ga", "page_title": "Boeing B-17 Flying Fortress"} {"project": "wikiproject", "timestamp": "20120823105617", "label": "ga", "page_title": "Roman Empire"} {"project": "wikiproject", "timestamp": "20060612161218", "label": "ga", "page_title": "Moon"}
$ cat enwiki.observations.first_labelings.20160204.json | grep '"stub"' | head {"project": "wikiproject", "timestamp": "20150523194215", "label": "stub", "page_title": "Acantharea"} {"project": "wikiproject", "timestamp": "20090614133509", "label": "stub", "page_title": "Mutagenesis"} {"project": "wikiproject", "timestamp": "20070303174700", "label": "stub", "page_title": "Heinz"} {"project": "wikiproject", "timestamp": "20070223050341", "label": "stub", "page_title": "ARY Group"} {"project": "wikiproject", "timestamp": "20101226213843", "label": "stub", "page_title": "Mass media"} {"project": "wikiproject", "timestamp": "20070411142310", "label": "stub", "page_title": "Aldona of Lithuania"} {"project": "wikiproject", "timestamp": "20070407151157", "label": "stub", "page_title": "Born again (Christianity)"} {"project": "wikiproject", "timestamp": "20070313135550", "label": "stub", "page_title": "List of science fiction awards"} {"project": "wikiproject", "timestamp": "20061015195549", "label": "stub", "page_title": "Balts"} {"project": "wikiproject", "timestamp": "20070203195447", "label": "stub", "page_title": "Burgess Shale"}
I think we're doing pretty good. :) It'll be good if we can find an article that had a downgrade in rating. It looks like Noam Chomsky fits the bill.
$ cat enwiki.observations.first_labelings.20160204.json | grep '"Noam Chomsky"' | head {"project": "wikiproject", "timestamp": "20060809132422", "label": "a", "page_title": "Noam Chomsky"} {"project": "wikiproject", "timestamp": "20070508203411", "label": "b", "page_title": "Noam Chomsky"} {"project": "wikiproject", "timestamp": "20090324103311", "label": "start", "page_title": "Noam Chomsky"}
Well... that looks backwards. Let's figure out what happened.
- en:Special:Diff/68604322 seems to have added the "a" class and that stuck for a while
- en:Special:Diff/129340045 switched to "b" class and that stuck
- en:Special:Diff/279336384 downgrades to "start" and sure enough, it sticks. This is kind of silly. It was apparently never better than start :/
Oh!!!! We don't see the "b" class re-appear because the "start" downgrade never changed the WP 1.0 assessment class. So, there is never a set difference in the classes present! This doesn't look wrong from an algorithmic point of view, but it does look wrong. --Halfak (WMF) (talk) 16:40, 29 March 2016 (UTC)
- @Halfak (WMF): Looking at the diffs, I noticed that en:Special:Diff/279336384 downgrades to "Start", but there's still {{WP1.0|v0.5=pass|class=B|category=Arts}}. Is that captured as a rating and the reason why en:Special:Diff/308564816 isn't registered as a rating upgrade? Cheers, Nettrom (talk) 15:57, 30 March 2016 (UTC)