Research talk:Scholarly article citations in Wikipedia/Work log/2015-02-09
Monday, February 9, 2015
editI did some work this morning with v0.0.5 of https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia.
Extract a random sample of DOI citations
edit$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | shuf -n 1000 > sample_doi.1k.tsv
Using crossref to check DOIs
edit$ cat sample_doi.1k.tsv | awk -F"\t" '{print "http://api.crossref.org/works/"$6"/agency"}' | xargs -I {} bash -c "wget --quiet -O- '{}' | sed -r 's/(.*)/\1\n/'" > doi_agencies.1k.json
Convert dois to sorted sets and diff
edit$ cat sample_doi.1k.tsv | tail -n+2 | cut -f6 | sort | uniq | tr '[:upper:]' '[:lower:]' > sample_doi.1k.set.tsv $ cat doi_agencies.1k.json | mwstream json2tsv message.DOI | sort | uniq | tr '[:upper:]' '[:lower:]' > doi_agencies.1k.set.tsv $ diff sample_doi.1k.set.tsv doi_agencies.1k.set.tsv | grep "<" | sed -r "s/>\s(.+)/\1/" > missing_doi.1k.tsv $ wc missing_doi.1k.tsv 103
Spot-checked missing dois
editshuf -n10 missing_doi.1k.tsv
- http://dx.doi.org/10.1007/pl00005669 -- resolves
- http://dx.doi.org/10.1016/j.intell.2006.03.005 -- resolves
- http://dx.doi.org/10.1080/0963749032000045837 -- resolves
- http://dx.doi.org/10.1525/auk.2009.03409.2 -- not found
- Added in http://enwp.org/?oldid=630201543&diff=prev
- Extracted as expected from "<ref name=Auk>{{cite doi|10.1525/auk.2009.03409.2}}</ref>"
- http://dx.doi.org/10.1666/0022-3360(2005)079[0981:arodly]2.0.co;2 -- resolves
- http://dx.doi.org/10.1162/jinh.2008.38.3.499 -- resolves
- http://dx.doi.org/10.1109/iadcc.2014.6779425 -- resolves
- http://dx.doi.org/10.4202/app.2011.0120 -- resolves
- http://dx.doi.org/10.1007/s10482-011-9605-y -- resolves
- http://dx.doi.org/10.1016/0030-4220(76)90098-0-- resolves
shuf -n10 missing_doi.1k.tsv
- http://dx.doi.org/10.1002/(sici)1521-3773(19990215)38:4<428::aid-anie428>3.0.co;2-3 -- resolves
- http://dx.doi.org/10.1084/jem.20012024 -- resolves
- http://dx.doi.org/10.1017/s0266462309090035 -- resolves
- http://dx.doi.org/10.1093/hmg/6.2.317 -- resolves
- http://dx.doi.org/10.1101/gad.989402 -- resolves
- http://dx.doi.org/10.1021/ed007p2875 -- resolves
- http://dx.doi.org/10.2307/27570652 -- not found
- Added in http://enwp.org/?oldid=557094374&diff=prev
- Extracted as expected from "| url = http://www.jstor.org/stable/10.2307/27570652"
- http://dx.doi.org/10.1093/oxfordjournals.tropej.a057419 -- resolves
- http://dx.doi.org/10.1093/oi/authority.20110803100400841 -- not found
- Added in http://enwp.org/?oldid=596786148&diff=prev
- Extracted as expected from "|url=http://www.oxfordreference.com/view/10.1093/oi/authority.20110803100400841"
- Might not be a DOI, but it does look like one.
- http://dx.doi.org/10.1086/302282 -- resolves
Well... that looks good. Almost all the IDs that aren't resolving with crossref resolve just find with dx.doi.org. And the ones that don't seem to be fine extractions.
Counts
edit- DOI/Page pairs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | wc
- 742565 5269445 63756121
- PubMed ID/Page pairs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | wc
- 437484 3011320 30215760
- Unique DOIs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f6 | sort | uniq | wc
- 524357 524357 13332518
- Unique pages with DOIs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f1 | sort | uniq | wc
- 172644 172644 1438573
- Unique pages with PubMed IDs
- $ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | cut -f1 | sort | uniq | wc
- 68648 68648 575015
--22:52, 9 February 2015 (UTC)