Research talk:Automated classification of article importance/Work log/2017-04-24

Monday, April 24, 2017 edit

My goals for today is to wrap up analysis and selection of WikiProjects, and follow up on communication with WPMED.

WPMED communications edit

I wrote a follow up message to our discussion thread, pointing to some of the challenges we're facing. Am following this up with Aaron.

WikiProject candidates edit

I gathered a dataset of the number of edits to WikiProject articles (within the past 180 days) last Friday. I'll combine this dataset with the ones we already have and analyze them further to understand if there are significant differences between any interesting candidates when it comes to the number of edits their articles get.

 
Scatterplot of edits to project pages versus project articles.

The answer is quite categorically "no", as the scatterplot above shows. There are a few outliers in the graph. One example is WikiProject Donald Trump, which has a much higher number of edits per article. This might be vandalism, which we do not catch. Another example is WikiProject Biography, below all others, partly due to their organization of their articles into subcategories, making us not pick up most of their articles. More interesting to us is perhaps the graph below:

 
Scatterplot of non-bot edits vs. number of articles, with a color gradient for percentage of articles that are of "unknown" importance

We have again plotted number of non-bot edits to project pages on the X-axis, replaced the Y-axis with the number of articles within scope of the project, and added a color gradient showing the extent to which articles in the project are rated with "unknown" importance. The thought here is that we want to approach projects that are reasonably active, have a fair number of articles, and where our predictions can help them rate articles more efficiently. Based on the graph, it looks like the first ones to approach have more than 100 non-bot edits to project pages, more than 1,000 articles, and an unknown proportion of at least 25%, possibly higher. I'll make a list of those and discuss the approach with my fellows.

Return to "Automated classification of article importance/Work log/2017-04-24" page.