Research:Task recommendations/Qualitative evaluation of morelike

The purpose of this study is to determine whether the quality of the "morelike" feature in mw:Extension:CirrusSearch would be useful for identifying articles that are in a similar topic area.

Methods

Terminology

Base article: The article from which similar article recommendations will be based
Similar article: An article returned by morelike for a base article
Morelike: A feature of mw:Extension:ElasticSearch that finds "more articles" that are "like" a base article

Sampling base articles

In order to get a sense for how morelike will work for recommending articles to newcomers, we wanted to get a representative sample of articles that newcomers are likely to land on -- and potentially edit -- after registering their account. To do this, we used the returnTo field in Schema:ServerSideAccountCreation. We gathered a random sample of returnTo articles for newly registered users that were in the article namespace during the 7 day period between 2014-07-16 and 2014-07-23.

Gathering similar articles

In order to replicate the proposed behavior of the mw:Task_recommendations behavior, we implemented the following filters:

article_length > 0: Sanity check that it's not blank
Filter en:Category:Living people: no biographies of living people -- too difficult for newbies to edit without being reverted

After filtering, we ended up with 195 base articles. We then used the search API to request the top 50 similar articles from mw:Extension:ElasticSearch.

Subsamples and snippets

Hand-coding all of the top 50 articles for all 195 base articles (9750) would be a lot to ask for, so we subsample further at this point to a random set of 5 similar articles per base article to get a total of 975 (base article, similar article) pairs.

In order to aid in hand-coding, we took a second pass over this dataset to generate snippets of relevant content. For similar articles, morelike returned a snippet. For base articles, we gathered wikimarkup from the first section and filtered out templates using mwparserfromhell then saved the first 200 characters.

Hand-coding

These (base article, similar article) with snippets were split evenly between hand-coders with a 54 item overlap to test for inter-rater reliability. Items were loaded into google spreadsheets and hand-coders were asked to rate each items as similar or not based on (1) only looking at the titles and (2) after reviewing the text snippets. Handcoders were volunteers from the Wikimedia Foundation staff: Halfak (WMF), Steven (WMF) and Maryana (WMF).

We performed an Fleiss Kappa test on the 54 item overlap to look for evidence of reliability and found very little agreement between assessments of title (Kappa = 0.257) or text (Kappa = 0.156). So the assessments of individuals were examined separately.

Fleiss Kappa test outputs

 > kappam.fleiss(overlap[,list(title.aaron>0, title.maryana>0, title.steven>0)])
 Fleiss' Kappa for m Raters

 Subjects = 54 
   Raters = 3 
    Kappa = 0.257 

        z = 3.28 
  p-value = 0.00105 


> kappam.fleiss(overlap[,list(text.aaron>0, text.maryana>0, text.steven>0)])
 Fleiss' Kappa for m Raters

 Subjects = 54 
   Raters = 3 
    Kappa = 0.156 

        z = 1.98 
  p-value = 0.0477

Results

Since our inter-rater reliability scores were low, we could not combine the codings of our three separate coders into a single dataset. So, instead, we analyse each coder separately. Since there was such a lot number of observations per rank, we combine ranks into 5 rank buckets. For example, ranks 1-5 appear in bucket 1. Ranks 6-10 appear in bucket 6.

Similarity by rank. Proportions of hand-coded similar titles are plotted by rank buckets (bucket size = 5) for each coder.

#Similarity by rank suggects that there's a clear, high rate of similarity once hand-coders reviewed the text of the base article and similar article. As rank approaches 50, about 75% of articles returned by morelike are considered similar by the hand-coders. That means, if 3 articles are recommended by morelike between rank 45 and 50, we have a 25% * 25% * 25% = 1.6% chance that none of them will actually be similar.

For the first 5 similar articles, raters found them to be about 94% were similar. That means we have a 6% * 6% * 6% = 0.02% chance that none of the top five recommended articles will be similar.

Conclusion

These results suggest that mw:Extension:ElasticSearch's morelike feature will be an acceptable means to recommend topicly related articles based on newly registered user's first edits.