Research talk:Citoid support for Wikimedia references

Active discussions

I'm not sure how to list my work related to Citoid. I'd feel a bit silly to list sandboxes and Phab tasks. --Elitre (WMF) (talk) 10:12, 23 April 2015 (UTC)

Elitre (WMF): Feel free to add sandboxes and Phabricator tasks in the "See also" section; we can organize them later. In the meantime, it'll be useful to have an inventory as complete as possible :) Guillaume (WMF) (talk) 19:04, 24 April 2015 (UTC)
I have a few discussions t it:Wikipedia:VisualEditor/Commenti#Mi_aiutate_a_testare_Citoid.3F and it:Discussioni_progetto:Coordinamento/Bibliografia_e_fonti#Mi_aiutate_a_testare_Citoid.3F. Dunno whether it makes sense to add them to the main page though. --Elitre (WMF) (talk) 20:11, 27 April 2015 (UTC)

Support neededEdit

Per the conversation at https://phabricator.wikimedia.org/P691#7452, while the research project is over, the generation of data as outlined at Research:Citoid_support_for_Wikimedia_references#Methods_and_results needs to be reviewed and probably fixed, so any help with that would be appreciated. --Elitre (WMF) (talk) 09:30, 31 August 2015 (UTC)

Hi Elitre (WMF), I'm interested :-) What's the plan? --Atlasowa (talk) 14:27, 1 September 2015 (UTC)
Hey Atlasowa, thanks a lot for replying. In the conversation I linked above, I think next steps are described by User:Halfak (WMF) at https://phabricator.wikimedia.org/P691#7504 . I'm pinging him here to see if he manages to give a bit more guidance on the topic. Best, --Elitre (WMF) (talk) 14:49, 1 September 2015 (UTC)
Hey folks. I'm dropping a ping to User:Guillaume (WMF) since he did some work beyond my own here. So, in order to extract all <ref> tags, I advised Guillaume to run mwrefs' extract utility on the XML dump files. This produces a dataset containing all citations in the most recent version of articles as of the date of the XML dump. I assume that Guillaume is doing some post-processing to identify and extract the URLs from this dataset and that is how he arrives at his counts. I've been working to re-engineer the basic XML processing systems beneath this utility, so I'd be happy to do another run sometime soon as a test of that new infra. --Halfak (WMF) (talk) 15:12, 1 September 2015 (UTC)
Hi Elitre (WMF), i had a look at the ref-stats generated via quarry de:Benutzer:Atlasowa/ref_stats#Top_external_links_in_Wikipedia_articles_via_quarry. I think the "most popular domains / external links from Wikipedia articles" are biased towards domain links added by bots, added via templates/as infobox fields, and added via semiautomatic tools (viaf links for instance). No wonder those are mass added and consequently "most popular". But these domains are not the primary use case for citoid: Inexperienced, human editors adding URL-refs to free-form article text. Maybe looking at "most frequent domains in template:cite news" would be better, or even template:cite web. Do you know https://tools.wmflabs.org/templatetiger/ ? Unfortunately templatetiger doesn't really work as before (toolserver/wmflabs? [1][2]) See https://tools.wmflabs.org/templatetiger/template-parameter.php?template=cite%20news&lang=enwiki and https://tools.wmflabs.org/templatetiger/tt-table4.php?template=cite%20news&lang=enwiki&order=layurl&where=&is= . But ultimately the approach of "find most popular ref-domains and make devs handcode them for citoid" doesn't scale. Instead, citoid should be 1) easier to improve by community members (my list of de:user:Atlasowa/ref_citation_tools shows that many editors care about making it easy to insert good refs!)
We'd love more community contributions. Creating new translators for Zotero is probably easier than contributing to citoid directly, because Zotero translators don't require learning an entire codebase. I agree it doesn't scale well, but we're also working on improving citoid's ability to read metadata. Ultimately though a Zotero translator will give the best results given a particular domain and if this follows the 80-20 rule (which it might) translators for the top few domains might improve our coverage considerably. I'm also looking into using another scraper which uses these JSON definitions instead (https://github.com/ContentMine/journal-scrapers) which may be easier to improve but that idea is definitely in its infancy... do you have any other ideas about how to get community contributions? Mvolz (talk) 14:24, 5 September 2015 (UTC)
and 2) citoid should learn from every user-corrected ref inserted via citoid-applications.
One way to sort of do this is to use wikidata to insert citations; that way when a user changes a reference, wikidata is updated and the improvement of the metadata propagates to every place that source appears. But that is a semi-long way off for en wiki as it requires using citation templates that use wikidata (and en wiki doesn't have arbitrary access to wikidata yet- coming Sept 16th last I heard!) and a lot of other stuff has to be written to make that work, see this IdeaLab page if you're curious about specifics: https://meta.wikimedia.org/wiki/Grants:IdeaLab/Tools_for_using_wikidata_items_as_citations Mvolz (talk) 14:24, 5 September 2015 (UTC)
The current VisualEditor UI application of Citoid is really bad, because it makes users add automatic bad-quality formatted/mangled refs and makes it really burdensome to post-edit your citoid-ref. --Atlasowa (talk) 11:58, 2 September 2015 (UTC)
There has been some disagreement about this before. I'll look into it again and see what can be done. Mvolz (talk) 14:24, 5 September 2015 (UTC)

Tech Talk about Zotero/CitoidEdit

You're invited to "Automated citations in Wikipedia: Citoid and the technology behind it", on February 29th. See you there, --Elitre (WMF) (talk) 11:48, 18 February 2016 (UTC)

Return to "Citoid support for Wikimedia references" page.