Citation Detective

Citation Detective is a tool and public dataset that designed to periodically run Citation Need model, a machine learning-based classifier published in WWW'19 by WMF researchers and collaborators, on a large number of articles in English Wikipedia, and release a public, usable database contains sentences that have been identified as needing a citation with their associated metadata.

Schema summary edit

DESCRIBE sentences;

+-----------+------------------+------+-----+---------+----------------+
| Field     | Type             | Null | Key | Default | Extra          |
+-----------+------------------+------+-----+---------+----------------+
| id        | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| sentence  | varchar(5000)    | YES  |     | NULL    |                |
| paragraph | varchar(5000)    | YES  |     | NULL    |                |
| section   | varchar(768)     | YES  |     | NULL    |                |
| rev_id    | int(8) unsigned  | YES  |     | NULL    |                |
| score     | float            | YES  |     | NULL    |                |
+-----------+------------------+------+-----+---------+----------------+

Applications edit

 
A screenshot of the prototype Citation Hunt that imports sentences lacking citations from Citation Detective

Citation Detective dataset can be use in developing tools, bots, and other systems for improving the encyclopedia's reliability. As an example use case for this data, a proof of concept for integrating Citation Detective and Citation Hunt was created. Check out the prototype Citation Hunt, which uses Citation Detective to import sentences that would not normally be featured in Citation Hunt. The repository for the prototype is on GitHub. More use cases for this type of data were identified in a design research project conducted by WMF researchers.

See also edit