Research talk:Wikipedia clickstream

Latest comment: 3 months ago by Joy in topic redirects

Comments or feedback about this project are welcome on this page --Dario (WMF) (talk) 19:30, 11 February 2015 (UTC)Reply

When will the other data become available ?

edit

Hoi, this is English only right ? Thanks, GerardM (talk) 19:28, 17 February 2015 (UTC)Reply

This was a one off project and has not been productionized or generalized to other language wikipedias. If you have a request for a set of languages please list them and we will take that into account during quarterly planning Ewulczyn (WMF)(talk) 15:54, 24 February 2015 (UTC).Reply

Not found

edit

Hoi, does this include the articles people looked for but could not find ? Thanks, GerardM (talk) 20:02, 17 February 2015 (UTC)Reply

Do you mean clicks on redlinks? That would be good to include. Actually, for many of the stated purposes, the dataset is of questionable value if it doesnt include clicks on redlinks. John Vandenberg (talk) 21:01, 17 February 2015 (UTC)Reply
The current release only includes requests for pages in that were in production table enwiki.page. The next release will include redlinks Ewulczyn (WMF)(talk) 15:54, 24 February 2015 (UTC).Reply

Clarification on other-wikipedia

edit

Thanks so much for putting all of this together! Just to clarify -- am I correct that entries with a prev_title of 'other-wikipedia' could be referrals from either: 1) any page on any namespace in any *.wikipedia.org project other than enwiki, or 2) any page on enwiki outside the main namespace? Thanks! Staeiou (talk) 22:39, 17 February 2015 (UTC)Reply

'other-wikipedia' includes referers from the non-main namespaces of english wikipedia and all other language wikipedias Ewulczyn (WMF)(talk) 15:54, 24 February 2015 (UTC)Reply

Top referrer stats

edit

I ran some simple descriptive stats on referrers, which are up at Research:Wikipedia_clickstream_top_referrers. Staeiou (talk) 23:38, 17 February 2015 (UTC)Reply

This is great

edit
 

I saw it on Twitter.

Is it possible for a general reader/editor like me to generate an image like this for en:Parkinson's disease, or do I need arcane technical skills? (I'm very old and un-techy) --Anthonyhcole (talk) 01:27, 29 April 2016 (UTC)Reply

I was asking myself the same thing... Doc James (talk · contribs · email) 02:36, 17 December 2017 (UTC)Reply
 
Parkinson's disease – Dec 2017 clickstream
@Anthonyhcole and Doc James: you may have seen the recent announcement of the productized clickstream dataset, which is now available as a monthly dump for each of Wikipedia's 10 largest language editions. User:MPopov (WMF) wrote a nifty visualization app in R that allows you to explore this data. See more examples here.--Dario (WMF) (talk) 22:36, 10 February 2018 (UTC)Reply

More details on other-internal

edit

Hi, more details for "other-internal" would be very useful - for example, show language id + article name for the source wikimedia project. It would help Wikipedia contributors to understand when people switch the language in Wikipedia article - generally, it would mean that existing article is not good enough and needs to be improved. Is it possible to do it? --Andy pit (talk) 14:43, 25 August 2020 (UTC)Reply

update frequency?

edit

When does this typically get updated? Right now it's already December 14 and the November data is still not there. Should we be worried? :) --Joy (talk) 12:26, 14 December 2023 (UTC)Reply

Looks like the December run went through on the 19th, while the January run went through already at the 3rd. Would be nice to be able to correlate this to some sort of more information. --Joy (talk) 13:14, 10 January 2024 (UTC)Reply
WikiNav is still stuck at October, though. --Joy (talk) 13:14, 10 January 2024 (UTC)Reply
@Joy I just saw this now, so sorry for the late reply. Thanks for flagging this.
  • The clickstream dumps get updated at the beginning of each month. Typically, the latest monthly snapshot is available on the 3rd of the next month. The November-snapshot was an exception as there seemed to have been some problem so that publication was delayed until December 19. The December-snapshot was published as expected on January 3rd.
  • The WikiNav tool checks on the 12th of each month for the latest snapshot to update the underlying data. Due to the delay with the November-snapshot, there was no update in December. With the availability of the December-snapshot, the tool got updated on January 12 (using the December data). So we are back to normal.
I hope this answers your questions. Dont hesitate to reach out if you have follow-up questions. MGerlach (WMF) (talk) 09:45, 16 January 2024 (UTC)Reply
Thanks! For the future, it would be great if the status of this monthly scheduled job would be transparent. Perhaps if it was published on some URL and linked somewhere from https://toolhub.wikimedia.org/tools/toolforge-wikinav or similar? --Joy (talk) 12:17, 16 January 2024 (UTC)Reply
I just happened to see this in passing, and I have sent a pull request to add a note about this to the WikiNav readme doc: https://github.com/mnzpk/WikiNav/pull/13 TBurmeister (WMF) (talk) 21:05, 17 January 2024 (UTC)Reply
@Joy There is now a note in the readme of the github-repository under Data update frequency. Thanks @TBurmeister (WMF) for sending the pull request - I was planning to add something along these lines too. MGerlach (WMF) (talk) 08:37, 18 January 2024 (UTC)Reply
Thanks guys, but that doesn't actually do what I asked about :) is the log from the scheduled job somehow private? --Joy (talk) 10:32, 18 January 2024 (UTC)Reply
I see that I missed your original request, sorry for that. The wikinav backend is hosted on an instance on cloud-vps and this is where the scripts are regularly run to import the latest clickstream data. We currently dont have a pipeline to make the logs of those script publicly available. At the moment, there is no ongoing further development of the tool (as a reminder, the tool was the outcome of an outreachy internship which finished in 2021). You could create an issue in the github-repo with this request and describing your use-case. In this way, this might be picked up by someone in the future (e.g. a hackathon or so); unfortunately, I dont have the capacity currently to work on this. Sorry for not being of more help. Thanks again for reaching out. MGerlach (WMF) (talk) 14:48, 18 January 2024 (UTC)Reply

Missing data or errors in 2024-04 dumps?

edit

I noticed that compared to previous clickstream dumps, the data from 2024-04 does not have a "link" type (3rd field). Why? (ping @Ewulczyn_(WMF), @MGerlach (WMF), @DarTar) Prof.DataScience (talk) 09:39, 22 May 2024 (UTC)Reply

@Prof.DataScience Could you describe which files exactly you were looking at and what output you were expecting that was not there? Looking at clickstream-ptwiki-2024-04.tsv.gz the data seems normal to me; e.g., the first line reads:
other-empty    ChatGPT    external    433981
where the 3rd field is "external" which corresponds to the "link type". Sorry if I misunderstand the question or am missing something. MGerlach (WMF) (talk) 10:38, 22 May 2024 (UTC)Reply
A meant 'link' as a value in the 3rd field. For example, we can find it in the 2nd, 3rd and 4th line in clickstream-ptwiki-2024-03.tsv.gz :
SBT_Podnight    Operação_Mesquita    link    20
Lista_dos_100_melhores_filmes_de_animação_brasileiros_segundo_a_ABRACCINE    Até_que_a_Sbórnia_Nos_Separe    link    16
SBT_News_na_TV    Operação_Mesquita    link    10
But in clickstream-ptwiki-2024-04.tsv.gz there is no line with the 'link' value in the 3rd field. This applies to all languages available in the clickstream dumps for 2024-04. Prof.DataScience (talk) 15:36, 27 May 2024 (UTC)Reply
Got it, thanks for the clarification. This seems to be a bug, indeed. I pinged folks who are working on the data-pipeline (ping @JAllemandou (WMF)). They are looking into this and are working on a fix (see T366042). Thanks again for flagging this issue and bringing this to our attention. MGerlach (WMF) (talk) 07:17, 28 May 2024 (UTC)Reply
@Prof.DataScience Update: @JAllemandou_(WMF) fixed the issue in the data-pipeline and reran the jobs. The datasets should not contain the error anymore: 2024-04 was corrected, and new datasets will be generated with link typ next month. Feel free to reach out if you have any further questions. Thanks again. MGerlach (WMF) (talk) 12:29, 28 May 2024 (UTC)Reply

2024-06 WikiNav delay

edit

Looks like we have another delay, as it's the 17th of July but https://wikinav.toolforge.org/?language=en&title=Cell is showing May. The clickstream archives were available early, the timestamps say July 4 [1]. --Joy (talk) 08:17, 17 July 2024 (UTC)Reply

@Joy Thanks for flagging this. There was an issue with restarting the webservice after some changes. It is fixed now, WikiNav is using data from the June-snapshot. MGerlach (WMF) (talk) 09:06, 17 July 2024 (UTC)Reply
Thanks for the quick response! --Joy (talk) 09:53, 17 July 2024 (UTC)Reply

2024-08 504 time-outs

edit

I noticed over the last few days that WikiNav is often taking too long to respond. Could someone have a look? --Joy (talk) 08:02, 24 August 2024 (UTC)Reply

@Joy it seems to be pretty responsive to me now. Are you still having issues? If so, which articles and approximately how long is it taking to respond? Isaac (WMF) (talk) 12:23, 26 August 2024 (UTC)Reply
It seems to have recovered for me, too. It was taking long enough for the proxy in front of it to serve the 504 page. Its logs should elucidate the scope of the incident. --Joy (talk) 16:17, 27 August 2024 (UTC)Reply

question about anonymization thresholds

edit

At en:Wikipedia talk:Disambiguation we've been having a few discussions about how to use the clickstreams to help organize Wikipedia navigation. One area of concern has been the part that says:

any `(referrer, resource)` pair with 10 or fewer observations was removed from the dataset

Where can we find more information about the rationale for this? It seems like something obviously meant to avoid risking leaking individual user browsing history, but it would be nice to hear how the specific algorithm of filtering was decided on and how it could be modified.

The reason is that the current implementation seems to often impede usability of these statistics. We observe hundreds and thousands of these filtered requests on various pages, with examples of up to a third of all traffic at a page being filtered.

It would be nice if there was a way to get aggregate information: even if individual source{1,2,3,...}-destination pairs do not amount to 10, if a multitude of sources leads to a single destination more than 9 times, that would probably help see a more accurate picture of reader behavior.

TIA --Joy (talk) 12:18, 29 August 2024 (UTC)Reply

I'm the other main participant in that discussion. In particular, I can see that there is often a percentage for "Filtered" displayed under Incoming Views -- how is this calculated if the rows below the threshold are dropped from the data set? Or is the calculation done before the rows are dropped? In either case, would it be possible to have a similar sort of percentage for "Filtered" outgoing views. This would provide at least some indication of how large the tail is in discussions about primary topics. Bkonrad (talk) 15:24, 29 August 2024 (UTC)Reply
@Joy I wanted to acknowledge your question though I don't have a full answer for you yet. Regarding the filtering rationale, the clickstream pre-dates my time at Wikimedia so I don't know where the official rationale is for this but I'll share if I manage to dig it up. It would have been around reader privacy as you suggested though. Since then, we've since switched to a more standardized set of guidelines for data publication. As part of a discussion on how to expand the clickstream to cover even more languages (task T289532), we might revisit this soon. I'll pass along this feedback if the choice of 10 is revisited.
@Bkonrad the "Filtered" percentage if I remember is calculated by comparing the pageviews accounted for in the clickstream dataset to the total pageviews to that page (you can see the code on Github). So it's done purely via access to the clickstream dataset + public pageview APIs. An outgoing "filtered" estimate would be much harder because we don't have datasets/APIs that track total counts of outgoing clicks from a page and would have to modify the clickstream dataset itself to calculate this. I'll raise this too as a possibility for at least greater information even if the threshold of 10 doesn't end up changing but I'll admit that I don't have a great sense of its technical feasibility.
Hope that helps! Isaac (WMF) (talk) 20:06, 29 August 2024 (UTC)Reply
Makes sense to follow a common guideline, though after reading it, I can't make sense of that either :D because this sounds like we're in the category of granular analysis of reading data, hence Tier 2 'medium' risk, so are pageviews and clickstreams already failing to be compliant by reporting any numbers of <250 views? There is a note of a heuristic that would explain it - monthly is the least risky temporal data type.
Regardless, we're not necessarily looking to see data about <10 views per source-destination pair anyway. Rather, having a different, aggregate view over it, where it's >10 views per destination regardless of source, would be an improvement for understanding if our navigation is working well or not, while not reducing privacy (AFAICT). --Joy (talk) 07:47, 30 August 2024 (UTC)Reply

redirects

edit

Right now, all redirect traffic is squashed together with destination page traffic. While this generally seems to make sense, it does present a problem trying to figure out if a primary topic redirect is useful or not because we can't compare the clickstreams that just went through the redirect - it's indistinguishable from the rest of the traffic, and the rest of the traffic is typically huge.

Would it be possible to have another analysis of traffic to produce redirect-exclusive clickstreams? --Joy (talk) 08:39, 30 August 2024 (UTC)Reply

@Joy I would actually just use the basic pageview data for this (as opposed to clickstream). For redirects, you know where the traffic is going so the clickstream wouldn't provide you with anything new. Here's the Gdańsk example given in the primary redirect documentation (data). Isaac (WMF) (talk) 12:14, 30 August 2024 (UTC)Reply
The thing is, it would be nice to know that the traffic came to X via a redirect Y and then if they clicked the hatnote or then if they reached for the Search box, then this can be a hint for us that the redirect might be misplaced. --Joy (talk) 14:05, 30 August 2024 (UTC)Reply

internal search engine

edit

How is the traffic coming from the internal Wikipedia search engine classified in the clickstreams?

en:Special:Search URLs are 'ugly', but rather precise, so it would be nice if we could see them in the statistics. Even when the search box generates a dropdown, clicking on these links goes through specific 'ugly' links, so it should be possible to find cases where people are using that to navigate.

TIA --Joy (talk) 08:42, 30 August 2024 (UTC)Reply

For example, someone lands at en:t bone, but then doesn't click the hatnote, rather they go (back) to the search box and type in 't bone' and then click on the second, bolded item en:T Bone Burnett. Would be great if we could find out how often such a thing happens - is it statistically relevant. --Joy (talk) 08:44, 30 August 2024 (UTC)Reply
I recently found what I think is such a case with 'other' clickstreams, described at en:Talk:Nacho#Requested move 1 August 2024. --Joy (talk) 08:47, 30 August 2024 (UTC)Reply
We did look into this (task T292435) but haven't incorporated. My conclusion at the time was that Special:Search would end up as other-internal but I'm not sure what other traffic might get included under that grouping. If you want some more (general) background on Search behavior, you can see this report too. Isaac (WMF) (talk) 12:34, 30 August 2024 (UTC)Reply
Return to "Wikipedia clickstream" page.