Research:Identification of Unsourced Statements/Feasibility Analysis

To make sure that the spaces of positive (sentences with citations) and negative (without citations) examples are separable, we do some preliminary feasibility test.

Feasibility analysis: Automatically labeled data edit

We first test the feasibility of the framework, i.e. the separability of sentences with and without citations in the feature space, by using the raw automatically labeled data. We use as training data the sentences from featured biographies. We considered sentences with an inline citation as positives, and sentences without a citation as negatives.

Featured Biography Article Data edit

We created a training with all 7692 negatives and an equal number of positives. Below the results on cross-validation for existing sentences in the data.

All Featured Article Data edit

To assess the generalisability of the previous methodology, we also test with data from all featured articles (73,280 negatives and an equal number of positives).

Results show that a system using word vectors + random forests is able to detect sentences needing citations with around 75% accuracy.