Research talk:Understanding Wikidata's Value/Work log/2017-05-02

Tuesday, May 2, 2017

As of today, I have code to check a given wiki for its Wikidata entity usage. This code iterates through an sql dump file of wbc_entity_usage, looking for inserts into a given wiki's database. This code can then extract each usage and aggregate by entity, returning that usage in a list. To recognize lines containing inserts in an sql file, I'm using a regex. This seems to be the best way to do this in an efficient manner. I initially tried using a library to do this matching but, by my rough estimates, it would take ~4 hours to process English Wikipedia alone.

I also have code to get a list of wikis. There's 906 wikis although some (wiktionaries in particular) do not appear to use Wikidata.

Today, I'm working on a Python script (taking advantage of the code in the above two modules I mentioned) that iterates through each wiki and calculates its Wikidata usage. A csv will be created by this script for each wiki (that uses) Wikidata. The script will also generate a csv of aggregated Wikidata usage across all wikis. I've been running this script on my personal machine as I've been developing it. However, I'm going to run the code on a server instead and I'll work on setting that up today (setting up a virtual env probably).

I'll also work on refactoring code I've produced in all three of the Python files I've mentioned above.

End of day

I deployed the code to a server and it took ~3 hours to calculate usages across all dumps. zhwiki was a particularly time-consuming dump to process. https://www.wikidata.org/wiki/Q54919 was the most used entity with over 2 million usages.

Also have been refactoring code in the above files.

Add topic