Research talk:Scholarly article citations in Wikipedia/Work log/2015-02-09

Monday, February 9, 2015 edit

I did some work this morning with v0.0.5 of https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia.

Extract a random sample of DOI citations edit

$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | shuf -n 1000 > sample_doi.1k.tsv

Using crossref to check DOIs edit

$ cat sample_doi.1k.tsv | awk -F"\t" '{print "http://api.crossref.org/works/"$6"/agency"}' | xargs -I {} bash -c "wget --quiet -O- '{}' | sed -r 's/(.*)/\1\n/'" > doi_agencies.1k.json

Convert dois to sorted sets and diff edit

$ cat sample_doi.1k.tsv | tail -n+2 | cut -f6 | sort | uniq | tr '[:upper:]' '[:lower:]' > sample_doi.1k.set.tsv
$ cat doi_agencies.1k.json | mwstream json2tsv message.DOI | sort | uniq | tr '[:upper:]' '[:lower:]' > doi_agencies.1k.set.tsv
$ diff sample_doi.1k.set.tsv doi_agencies.1k.set.tsv | grep "<" | sed -r "s/>\s(.+)/\1/" > missing_doi.1k.tsv
$ wc missing_doi.1k.tsv 
103

Spot-checked missing dois edit

shuf -n10 missing_doi.1k.tsv

shuf -n10 missing_doi.1k.tsv

Well... that looks good. Almost all the IDs that aren't resolving with crossref resolve just find with dx.doi.org. And the ones that don't seem to be fine extractions.

Counts edit

DOI/Page pairs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | wc
742565 5269445 63756121
PubMed ID/Page pairs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | wc
437484 3011320 30215760
Unique DOIs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f6 | sort | uniq | wc
524357 524357 13332518
Unique pages with DOIs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep doi | cut -f1 | sort | uniq | wc
172644 172644 1438573
Unique pages with PubMed IDs
$ cat doi_and_pubmed_citations.enwiki-20150112.tsv | grep -v "doi" | grep -E "pmcid|pmid" | cut -f1 | sort | uniq | wc
68648 68648 575015

--22:52, 9 February 2015 (UTC)

Return to "Scholarly article citations in Wikipedia/Work log/2015-02-09" page.