Research talk:Automated classification of article importance/Work log/2017-04-17

Monday, April 17, 2017 edit

Today I'll follow up with WPMED on candidates for reassessment, complete looking into candidate WikiProjects, update our code so we handle inlinks with redirects properly, and start working on gathering global data.

Candidate WikiProjects edit

I altered my code slightly to better handle capitalization, and now only have 207 projects where a project page is not found. Those I can handle manually. I have also decided to use the category structure as the way to find candidate projects, mainly because it makes it very easy for us to identify exactly which articles are within the scope of the project, but also because it restricts our data gathering to projects that we know use importance ratings. If we instead go through Wikidata, we'll get all projects and from there have to skip those that do not have the necessary categories. Lastly, we are also picking up task forces with the current approach, and it might be easier to go from there to the parent WikiProject (e.g. through redirects and page title parsing), than the other way around.

Some observations after manually inspecting categories that did not have a parent WikiProject in the dataset:

I've cleaned the dataset by removing all categories that had a "-class" name, or had "unassessed" in the name. Through R I also removed all categories that only had unassessed pages.

Return to "Automated classification of article importance/Work log/2017-04-17" page.