Research talk:Social media traffic report pilot

Active discussions

Please provide feedback on the English Wikipedia social media traffic report here.

General feedback goes hereEdit

Questions, comments, concerns, suggestions that don't fit into any of the sections below

  • I was surprised to see the report published on the English Wikipedia rather than Meta. Is this just for the pilot, or do you think that can scale to all projects? Thanks, Nemo 19:46, 23 March 2020 (UTC)
  • Nemo_bis it won't scale without dedicated resources (human and technological), but if the EnWiki pilot gets enough traction, we'll lobby for those resources. Thanks, Jmorgan (WMF) (talk) 19:56, 23 March 2020 (UTC)
  • Is the YouTube traffic coming via their official use of Wikipedia (as context beneath videos) or through user comments? (I had heard about the former but I can't recall the last time I saw such usage on a video.) If these can be distinguished, I think it would make an analytic difference. czar 00:49, 30 March 2020 (UTC)
  • Czar, unfortunately we can't distinguish whether the traffic is coming from video descriptions, creator pages, comments, or banners added to videos by youTube itself. A lot of the perennial top traffic hits from YouTube are news organizations; this is likely because YouTube has a policy of linking to the Wikipedia article for news providers on high-traffic videos from that provider. But that's just an assumption on my part. Cheers, Jmorgan (WMF) (talk) 23:53, 30 March 2020 (UTC)
  • This is a great idea! Quick thought: Are you planning on running this as a randomized control trial? I think that could make your analysis more powerful. The idea would be to not put on the list all articles that received at least 500 views from social media, but only a randomly sampled fraction (e.g., 50%) of those. Then you could compare those to the held-out sample of articles that would usually have made it onto the list but didn't because of randomization, and quantifying the impact of the list would be a breeze. If you don't randomize, but publish the complete list, you'll have to deal with all sorts of unobserved confounds, which will challenge the validity of the trial. Ciao, Bob West. --Cervisiarius (talk) 07:41, 31 March 2020 (UTC)
  • Ciao Cervisiarius yes, we're considering this. Right now we're working out the kinks and assessing user acceptance, but I agree that this would be powerful. Let me know if anyone in your lab is interested in participating in the study. Jmorgan (WMF) (talk) 16:46, 31 March 2020 (UTC)
  • Just to add another note -- we have a strict threshold of 500 pageviews for security reasons but retain data privately for pages with less than 500 pageviews so there is also the opportunity to do a regression discontinuity analysis -- e.g., compare pages with 500-600 pageviews with pages with 400-500 pageviews. This obviously is not as powerful as a randomized control trial but also means we don't have to withhold information that we can be sharing. --Isaac (WMF) (talk) 00:29, 1 April 2020 (UTC)
  • That's a great idea, Isaac! On top of comparing [400,500[ to [500,600[, you could also try checking for a "dose-response" relationship: create more fine-grained buckets (as much as the data allows; I'm using a width of 10 in the example to follow) and then see if the jump from [490,500[ to [500,510[ is significantly larger than that between buckets that don't cross the discontinuity (e.g., [480,490[ and [490,500[). That said, if I were you, I'd still go for the randomized control trial if at all possible... ;) --Cervisiarius (talk) 09:58, 4 April 2020 (UTC)

The page has a notice that "This report will no longer be regularly maintained as of 31 May 2020." but still seems to be regularly maintained. Is the template outdated, or is it still going away? — Rhododendrites talk \\ 14:27, 1 November 2020 (UTC)

(@Isaac (WMF):) @Rhododendrites: thanks for the heads-up. I believe Isaac has started the reports up again for a limited time. Cheers, Jtmorgan (talk) 21:08, 1 November 2020 (UTC)
Thanks for checking -- to add to what Jtmorgan said, we had a request to restart the pilot for another few weeks around the US election. It will likely be taken offline again in the next few weeks though (hence the not regularly maintained aspect). --Isaac (WMF) (talk) 13:14, 2 November 2020 (UTC)

New column suggestions go hereEdit

Suggestions of new columns to include in the report (e.g. ORES quality scores, historical traffic averages per article)

Platform traffic as a percentage of all trafficEdit

  • Would it be useful to have a calculation of which percentage of all traffic is coming from a specific platform? (I.e. (Platform traffic / All traffic) * 100%). And should the report possibly be filtered on that instead of just on ">500 views from that platform"? Rchard2scout (talk) 13:08, 24 March 2020 (UTC)
  • Rchard2scout this is a good suggestion. Thanks! I can definitely add a platform_percent_of_total_current_day column. That would make it easier for people to sort the table by the articles that are receiving the highest, or lowest, percent of their traffic from a particular platform. We can do that and still keep the "> 500" values for the previous day (the reason we have that is just that we aren't able to report previous day counts that are less than 500, for privacy reasons). Let me know if you have additional thoughts on that. Cheers, Jmorgan (WMF) (talk) 23:22, 25 March 2020 (UTC)
  • From Stuart A. Yates on wikiresearch-l on 2020/03/23: "My immediate thought is how to connect this to the wiki projects for each article, because wiki projects are the primary sources of expert knowledge and have the resources to deal with many issues." Jmorgan (WMF) (talk) 19:54, 24 March 2020 (UTC)
Some WikiProjects already post about an influx of traffic potentially coming from an event. It would be different for each project. At the very least, could be interesting to entertain a bot to note current traffic spikes for the article's talk page so that regular stewards of the page at least have some idea whence the traffic comes to start a discussion on how long it might be sustained. czar 00:52, 30 March 2020 (UTC)

Number of editsEdit

  • As I understand it a major use of this will be to see whether there are disruptive edits associated with the additional traffic. If possible could you include the number of edits for that day, or even the number of IP edits and the number of edits by registered users. Smallbones (talk) 22:50, 24 March 2020 (UTC)
  • Smallbones Thank you! This is a great suggestion. I'm considering adding a "number of edits in the past 24 hours to this article that ORES#Advanced_support predicts are likely damaging". These are the same filters available in the Recent Changes feed. I think that this would serve the same purpose, but without putting good-faith edits by IPs or new editors under unfair scrutiny. Do you think that would address the basic need you're articulating here? Cheers, Jmorgan (WMF) (talk) 23:14, 25 March 2020 (UTC)
    • It might just come down to whichever is easiest to get. I don't think people will normally assume all IP edits are made in bad faith. OTOH if there are 50 IP edits made to an article that normally gets 1 edit per month, that would indicate a possible problem whether it's good faith or not. If the damaging edit prediction is working well and easy to get, then it should probably work as well. I doubt that the type of edits we'd be looking for can be subtly indicated by the referrer to beat the system. BTW I may put a paragraph or 2 in The Signpost about this, unless you object, if you want to email me a short sentence or three about the pilot, I won't just have to paraphrase what's on these pages. Or I may contact you in a couple of days. Smallbones (talk) 23:29, 25 March 2020 (UTC)
      • Smallbones I plan on implementing some version of this over the next week or two, and I'll keep you posted. Re: the next Signpost (whenever the next one comes out; I noticed one went out today): Here's a blurb/summary: "The social media traffic report is intended to help editors identify articles that are either going viral, or are being used by social media platforms to "fact check" misinformation posted by their users. In both of these scenarios, previously quiet Wikipedia articles may receive a huge influx of traffic all at once. Until now, editors had no easy way of monitoring these spikes in near-real time unless the social media spike also corresponded to an overall traffic spike that would be visible in the public page traffic reports. A sudden surge may result in bad faith and/or otherwise damaging edits. In some cases, a spike in traffic from a particular social media platform may even reflect a coordinated attempt to insert disinformation into Wikipedia. The WMF Research team thought that these four platforms in particular would be good initial candidates for this data release, but we're eager to hear additional suggestions. In the near future, we'll be rolling out a reporting form so that editors can flag suspicious diffs that they encounter while browsing the pages on the traffic report. Specific examples help the research team understand what disinformation campaigns on Wikipedia might look like, which in turn will help us develop machine learning models that can detect this kind of activity automatically and dashboards or other tools where these edits can be flagged for editor review. If the social media traffic report proves useful, we're considering making it available long-term, and on multiple Wikipedias." Cheers, Jmorgan (WMF) (talk) 23:48, 30 March 2020 (UTC)

Top referral linkEdit

Are you able to pull the URL of the top referring link? E.g., if a page has gone viral via Reddit, can you link the post that is trending? czar 00:46, 30 March 2020 (UTC)

Czar, no, because we enforce HTTPS and our webrequest logs don't provide any granularity beyond the referring platform. Cheers, Jmorgan (WMF) (talk) 23:50, 30 March 2020 (UTC)

New editor conversionEdit

Is there any way to determine the number of editors who registered for an account after visiting the listed page? I would guess not given your statement about enforcing HTTPS (which is a good thing and should not be changed), but it would be interesting to know the effect viral posts like these have on editor recruitment. Wugapodes (talk) 18:41, 1 April 2020 (UTC)

Wugapodes I agree this would be interesting and potentially useful to know. It's possible we could make this determination using the raw webrequest and event logs (which are not public), as long as the person who visited the page created their account within the same browsing session (i.e they didn't close the tab/window between viewing the article and clicking "sign up". This data could be not published at the individual-editor level, but it is possible we could publish the aggregated results of such an analysis. Thanks for the suggestion! Jmorgan (WMF) (talk) 17:27, 2 April 2020 (UTC)

Current Protection LevelEdit

Would it be possible to include the current protection level in the report? For our articles that are linked from reputable sources (CNN, BBC, WHO, etc) I wouldn't imagine we would have too much to worry about, but the further down we go on the social media list (Facebook, Reddit, "4chan", etc) the greater the risk would be that the edits are being added to upset or disrupt an article. Having a quick column to see what the current protection level is and perhaps when it was added and/or when it will expire could help make this a useful tool for admins to get out in front of efforts by less benevolent social sites to change content here. TomStar81 (talk) 19:15, 1 April 2020 (UTC)

TomStar81 this is an excellent suggestion. I'll look into how it might be implemented. Thank you! Jmorgan (WMF) (talk) 17:22, 2 April 2020 (UTC)

Social media platform suggestions go hereEdit

Suggestions of social media platforms to include in the report (e.g. MySpace, Friendster, Vine)


Design and formatting suggestions go hereEdit

Suggestions about how to make the report more usable (e.g. highlight some cells, make it more mobile-friendly)

Return to "Social media traffic report pilot" page.