Research:Effectiveness of the new participant pipeline for Wiki Loves campaigns/Appendix
From this filtered data, we created five anonymized tables (split by day) using Pandas:
- Banner impressions, with for each banner impression (4.75B rows):
- Date
- Access method (mobile web, mobile app, desktop)
- Logged_in (binary indicator whether the log entry was logged in or not)
- Referer class (internal, external)
- Project and project family (e.g. Wikimedia Commons, English Wikipedia)
- Banner campaign and banner observed in that impression
- Banners seen (The number of WLM banners seen at that point, capped at 10)
- Landing seen (Binary indicator whether the reader had seen a landing page)
- Landing page views, with for each visit by an approximate reader (5.4M rows):
- Date
- Access method
- Logged_in
- Referer class
- Project and project family
- Page title (of the landing page)
- Banners seen
- Landing seen
- Create account, with for each visit by wlm-reader (0.53M rows:
- Date
- Access method
- Logged_in
- Referer class
- Project and project family
- Banners seen
- Landing seen
- Uploads, with for each visit to an upload page by a wlm-reader (90k rows):
- Date
- Access method
- Logged_in
- Referer class
- Project and project family
- Createaccount (whether the reader had visited the create account page)
- Banners seen
- Landing seen
- List pages seen (for US and DE only, number of monument list pages visited, capped at 10)
- Aggregated data from readers, with for each reader that had seen a wlm banner (1.18B rows):
- Whether they had been logged in, logged out or both
- Which banners and banner campaigns they had seen
- Whether they had visited the account creation page
- Whether they had visited an upload page
- Whether they had visited a landing page
- The number of visits to a US monument lists page (capped at 10)
- The number of visits to a DE monument lists page (capped at 10)
The tables were shuffled and index was removed to anonymize them. These tables, five per day, could then be stored beyond the 3-month deletion deadline of the webrequest logs, and read with Spark to further group together, and digest.
More detailed caveats:
- In total, we recorded more than 4 billion banner impressions over the course of the data collected. This is an incomplete number, because we don't have data for the first part of September. Of the countries analyzed, 11 countries had their banners live through September, 11 in October and 1 overlapped between the months. We can estimate the total banner impressions by extrapolation, resulting in ~10 billion banner impressions.
- When we compare the impression counts day by day with data from turnilo, we observe that Turnilo only shows numbers at ~55-60% of the numbers we find.
- There is reported data loss for beacons going through some American servers, which may have resulted in a separate underestimate of 15-21%. There is no reason to assume this is not at random at this point.
- We record a surprisingly high percentage of logged out readers. Our data pipeline/analysis filtered at an early stage almost all readers that were exclusively logged-in during their session. This would result in an undercount of logged-in readers. While for the banner impression totals this will not matter too much, it is unclear whether this also had downstream effects (e.g. if readers were not aggregated if they didn't also see a banner logged-out) which may result in more significant undercounts for e.g. uploads and landing pageviews.
- We defined reader as a daily hash of the combination of IP address and user agent within a UTC-day. We can analyze different actions (e.g. see a banner and then arrive at a landing page) only within the context of such a reader. A reader gets 'split' when the person switches device, browser, updates their browser or passes midnight in UTC (which is a different time of day depending on the location).
- We can only measure landing page arrivals for countries where the landing page was located on a Wikimedia project.
- Actual account creations have been found to be about 25% higher than the webrequest logs suggest. It is unclear how this ratio holds up over time.
- Actual uploads can be much lower (2%) or higher (166%) than suggested by the number of visits to the upload page. This is because people may drop out after arriving at the upload page, or they may upload multiple files.
Some more detailed findings:
- For two countries, Germany and the United States, we also recorded whether landing page visitors also visited a page with monument lists of that country. Visiting such a list page is a sign that they are looking for information what to photograph, or want to find the monument ID in order to complete the upload. We see that in both campaigns, the likelihood to visit a list page is three times higher when they first visited a landing page (DE: 0.3% vs <0.1%, US: 0.11% vs 0.03%). This kind of ratio is expected, and likely an underestimate due to the data collection methods.