Research talk:Automated classification of article importance/Work log/2017-04-13

Thursday, April 13, 2017 edit

Today I'll follow up on communications with WPMED, look into getting data across enwiki to test our classifier on global data, and study candidate WikiProjects in more detail.

Candidate WikiProjects edit

We earlier gathered data on the number of articles in a couple of thousand different categories related to WikiProjects. The question is, do we use these to find candidates, or do we use other approaches? For example, we can find 2,670 WikiProjects and task forces with English names and their associated enwiki page through the following WDQS query:

PREFIX schema: <http://schema.org/>

SELECT ?item ?itemLabel ?sitelink
WHERE
{
    ?item wdt:P31 wd:Q21025364 .
    ?sitelink schema:about ?item .
    FILTER REGEX(STR(?sitelink), "en.wikipedia.org") .
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

That approach does not give us the project's associated category, which we'd like to use to identify projects where activity is still happening, for example through edits to the project's articles. I'll think more about this.

Counting Redirected Inlinks edit

Our code so far has not counted links pointing to an article through redirects. For example, en:Anarchist is a redirect to en:Anarchism, and there are 1,243 articles linking to the former variant. We should also account for these in our datasets. I wrote a SQL query that handles this and will update our code to get improved inlink counts.

Return to "Automated classification of article importance/Work log/2017-04-13" page.