Research talk:Autoconfirmed article creation trial/Work log/2018-03-15

Thursday, March 15, 2018 edit

Today I'll look into our reported statistic on rate of published drafts that's resulted in a lot of discussion on the research report's talk page.

Publication of pages created in the Draft namespace edit

What specifically did we claim in the report? edit

"While not the primary focus of our analysis, during our work we discovered that the publication rate of pages created in the Draft namespace is incredibly low (about 1.2%)..."

How did I get that 1.2% number? edit

I gathered a dataset of pages created in the Draft namespace and limited it to those created after 2014-07-01 00:00:00 UTC and before 2017-12-01 00:00:00 UTC. That dataset contains 126,957 pages. There's a graph of the number of pages created per day in the January 31 work log. There are 1,249 days between those two dates, so a rate of slightly above 100 pages per day (101.65) makes sense based on the graph.

I checked the edit history of all 126,957 pages to see if the edit comment identified a move, and if so, if it's a valid move into the Main (article) namespace (relevant code is afc_draft_predictions.py, lines 379–485). Here we found that 1,550 pages had been moved. That's 1.22%.

In the R file for the AfC analysis, specifically lines 67–70, I mention that I believed that number to be wrong when I first encountered it. As the code comments mention, I therefore ran a sanity check by identifying how many of those Draft pages were live on English Wikipedia in the Main namespace (as of Feb 1, 2018). That sanity check found only 964 pages, indicating that the reported total is likely correct, as we would expect some pages to either fail AfD or be moved back.

Things to note edit

  1. This is not the AfC publication rate. The way drafts go through AfC is different as pages can originate from other namespaces (e.g. Legacypac mentions User namespace drafts being moved to Draft when submitted to AfC). They can also be submitted multiple times to AfC, which means we would in that case have to decide whether we calculate publication rate based on number of pages, or number of submissions.
  2. Why do we end up with 1,550 when there's more than 80,000 accepted AfCs (number of pages in Category:Accepted AfC submissions).
    1. AfCs can originate from multiple namespaces.
    2. We do not know what timespan those pages cover, we only looked at page creations from July 1, 2014 to December 1, 2017.
    3. The number of accepted AfCs also contains a large number of redirects, proposed through Wikipedia:Articles for creation/Redirects. Per this query, the number of accepted AfCs that are not redirects appear to be just shy of 50,000 on March 14, 2018.

What's a more recent estimate? I gathered a dataset of Draft namespace creations between 2017-09-15 and 2018-02-15, the first five months of ACTRIAL. The reason I ended it on February 15 was to enable reviews and moves after that date. The process and results are documented in this gist. Out of 34,115 Draft pages created, 3,771 (11.1%) were live as of March 14, 2018.

It's also worth noting that analyzing the quality of AfC submissions from the Draft namespace (ref the February 24 work log), we see an increase over time in the proportion of submissions labelled "OK". This could be associated with a higher publication rate, but we would have to investigate further in order to determine if there's a relationship.

Return to "Autoconfirmed article creation trial/Work log/2018-03-15" page.