Research talk:Daily unique media creators

Latest comment: 9 years ago by Halfak (WMF) in topic Wednesday, September 10th

Work log edit

archive

Wednesday, September 10th edit

Today, I'm reverse engineering "media creators". To me, this means people who originally uploaded a file.

> select distinct log_action from logging where log_type = "upload";
+------------+
| log_action |
+------------+
| upload     |
| overwrite  |
+------------+
2 rows in set (16.71 sec)

Looks like we have two actions. I'm guessing that "upload" is the original and "overwrite" is a re-upload for the same filename. --Halfak (WMF) (talk) 16:01, 10 September 2014 (UTC)Reply


> SELECT 
    ->     log_namespace, 
    ->     log_title, 
    ->     SUM(log_action = "upload") AS upload,
    ->     SUM(log_action = "overwrite") AS overwrite
    -> FROM logging 
    -> INNER JOIN (
    ->     SELECT 
    ->         page_namespace AS log_namespace,
    ->         page_title AS log_title
    ->     FROM page
    ->     WHERE page_namespace = 6
    ->     ORDER BY RAND()
    ->     LIMIT 100
    -> ) AS sampled_media USING (log_namespace, log_title) 
    -> WHERE 
    ->     log_type = "upload"
    -> GROUP BY 1,2;
+---------------+-----------------------------------------------------------------+--------+-----------+
| log_namespace | log_title                                                       | upload | overwrite |
+---------------+-----------------------------------------------------------------+--------+-----------+
|             6 | "Jodi,_1999".jpg                                                |      1 |         0 |
|             6 | A1_in_London_1923.jpg                                           |      1 |         0 |
|             6 | AWMphotoA02730.jpg                                              |      1 |         0 |
|             6 | Annarth_poster.jpg                                              |      1 |         0 |
|             6 | Arcoftriiamph.jpg                                               |      1 |         0 |
|             6 | Artemis_Fowl_-_The_Atlantis_Complex.jpg                         |      1 |         1 |
|             6 | B_McCulloch_CSA_ACW.jpg                                         |      1 |         0 |
|             6 | Backyard_Football_'09_Coverart.png                              |      1 |         0 |
|             6 | Batman_Contagion_TPB_cover.jpg                                  |      1 |         4 |
|             6 | Bbbbb.jpg                                                       |      2 |         0 |
|             6 | Benningsen_Beavers.jpg                                          |      1 |         0 |
|             6 | Berliner_Zeitung_Logo.svg                                       |      1 |         0 |
|             6 | BestWhorehouse.jpg                                              |      1 |         0 |
|             6 | Bin_Quraya_Logo_(2013).svg                                      |      1 |         0 |
|             6 | Black_cobra.gif                                                 |      1 |         0 |
|             6 | Bleriot_monoplane.jpg                                           |      1 |         0 |
|             6 | BoyleFarmIsland.jpg                                             |      1 |         0 |
|             6 | Brigitte_Mira.jpg                                               |      1 |         0 |
|             6 | Cheap_Trick_It's_Only_Love_Single_1986.png                      |      1 |         0 |
|             6 | Chrisye_-_Lilin-Lilin_Kecil_(1977).ogg                          |      1 |         0 |
|             6 | CleanAirCampaignLogo.jpg                                        |      1 |         0 |
|             6 | Community_Emergency_Response_Team_(US)_Logo.jpg                 |      1 |         0 |
|             6 | CraigInteview09_01.jpg                                          |      1 |         1 |
|             6 | Cut_image_of_Rarden_Image-Warrior_Infantry_Fighting_Vehicle.jpg |      1 |         0 |
|             6 | Cyclonscreenshot001.jpg                                         |      1 |         0 |
|             6 | Department_of_Stomatology_at_Martin_University_Hospital.jpg     |      1 |         0 |
|             6 | Devon_Cornwall_Police_logo.jpg                                  |      1 |         0 |
|             6 | Dickey-Deliverance.jpg                                          |      1 |         0 |
|             6 | Distribution_wealth_population_europe_2000.gif                  |      1 |         0 |
|             6 | Efaktura_logo.png                                               |      1 |         0 |
|             6 | Elin_Lanto_Rockbjornen.jpg                                      |      1 |         0 |
|             6 | Eliseu_Visconti_-_Maternity_-_Google_Art_Project.jpg            |      1 |         0 |
|             6 | Elvis_Presley-The_Sun_Sessions_(album_cover).jpg                |      1 |         0 |
|             6 | Embroidered_foliage.jpg                                         |      1 |         0 |
|             6 | Empress_Catherine_I_-c.1724_-3.jpg                              |      1 |         0 |
|             6 | Factor_X.png                                                    |      1 |         1 |
|             6 | FortunateFall.jpg                                               |      1 |         0 |
|             6 | GCRiptide.PNG                                                   |      1 |         0 |
|             6 | GED_GameEditor.jpg                                              |      1 |         0 |
|             6 | Gatecrasher4.jpg                                                |      1 |         1 |
|             6 | Georgia-Tbilisi.png                                             |      1 |         0 |
|             6 | Geroge_Benson_Big_Boss_Band_cover.jpg                           |      1 |         0 |
|             6 | Girls_Action_Foundation_logo.png                                |      1 |         0 |
|             6 | Grease_Super_Macro.JPG                                          |      2 |         0 |
|             6 | Hapinas_logo.jpg                                                |      1 |         1 |
|             6 | Imperial_College_Healthcare_logo.jpg                            |      1 |         0 |
|             6 | Inkscape-Arc.png                                                |      1 |         0 |
|             6 | JohnSteinbeck_ThePearl_title.jpg                                |      1 |         0 |
|             6 | KDAF_AntennaTV.png                                              |      1 |         0 |
|             6 | KSNA_(FM)_logo.jpg                                              |      1 |         0 |
|             6 | Kenny_Rogers_-_Dont_Fall_single.jpg                             |      1 |         0 |
|             6 | Krtologosmall.jpg                                               |      1 |         5 |
|             6 | Large_Arch_by_Henry_Moore_(1971)_in_April_2014.jpg              |      1 |         0 |
|             6 | Lastpost.jpg                                                    |      1 |         0 |
|             6 | Laurent_Nivalle_portrait.jpg                                    |      1 |         0 |
|             6 | Len_Harvey_1.jpg                                                |      1 |         0 |
|             6 | MANXYZ_permease_Step_7.jpg                                      |      1 |         0 |
|             6 | MDB_drum_majors.jpg                                             |      1 |         0 |
|             6 | Maa_Music_logo.jpg                                              |      1 |         0 |
|             6 | Mahout2_crop.jpg                                                |      1 |         0 |
|             6 | Marvel-super-hero-squad-cover.jpg                               |      1 |         0 |
|             6 | Mosaic_Stadium_Logo.png                                         |      1 |         0 |
|             6 | My-Cousin-Vinny-Poster.jpg                                      |      1 |         0 |
|             6 | Nama_Mia.png                                                    |      1 |         0 |
|             6 | Nasserist_Socialist_Vanguard_Party_symbol.jpg                   |      1 |         0 |
|             6 | New_Borinque_Pro_Wrestling_Logo.jpg                             |      1 |         0 |
|             6 | OK-CLEET-logo.jpg                                               |      1 |         0 |
|             6 | Once-upon-a-time-in-the-north.jpg                               |      1 |         2 |
|             6 | One.Way.Ticket.to.Hell...And.Back_The.Darkness.back-front.jpg   |      1 |         0 |
|             6 | Owlz.PNG                                                        |      1 |         1 |
|             6 | Peyronie's-disease-illustration.jpg                             |      1 |         0 |
|             6 | Pockbot.gif                                                     |      1 |         0 |
|             6 | Prof_Pyg.jpg                                                    |      1 |         1 |
|             6 | Robertlang.JPG                                                  |      1 |         0 |
|             6 | Sailorneptune.jpg                                               |      7 |         0 |
|             6 | Scream-In_promo_card_(WXIX)_circa_1970.jpg                      |      1 |         0 |
|             6 | Software-Freedom-Conservancy-Logo.svg                           |      1 |         1 |
|             6 | Soroban_6_50px.png                                              |      1 |         0 |
|             6 | St._lourdes_church.jpg                                          |      1 |         0 |
|             6 | Suburban_Mayhem_poster.jpg                                      |      1 |         0 |
|             6 | Sugarloafoverlook.jpg                                           |      1 |         0 |
|             6 | Swarm_(comics).png                                              |      1 |         0 |
|             6 | Thanksgiving_(2004_film)_(movie_poster).jpg                     |      1 |         0 |
|             6 | The_Sign_next_to_the_Rochdal_Canal.JPG                          |      1 |         0 |
|             6 | Theodore-rex-DVD.jpg                                            |      1 |         1 |
|             6 | Timothyvanlaar.jpg                                              |      1 |         0 |
|             6 | Tru-blu-company-logo.png                                        |      1 |         0 |
|             6 | University_of_Denver_campus_pics_043.jpg                        |      1 |         0 |
|             6 | Usvls-seal.png                                                  |      1 |         0 |
|             6 | Virginia_392.png                                                |      1 |         0 |
|             6 | WDIC-AM_FM.png                                                  |      1 |         0 |
|             6 | WNTP_logo.gif                                                   |      1 |         0 |
|             6 | Ängby_IF.png                                                    |      1 |         0 |
|             6 | Το_τελευταίο_ψέμα,.jpg                                          |      1 |         0 |
+---------------+-----------------------------------------------------------------+--------+-----------+
94 rows in set (6.77 sec)
<pre>

It looks like this mostly works.  There are some files that have more than one "upload" event.  Let's see what's up with them. 

<pre>
> select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Sailorneptune.jpg";
+----------+----------------+----------+------------+----------------------------------------------------+
| log_id   | log_timestamp  | log_type | log_action | LEFT(log_comment, 50)                              |
+----------+----------------+----------+------------+----------------------------------------------------+
|  2699690 | 20060510035725 | upload   | upload     | Image of cartoon character "Sailor Neptune"<br>
Ap |
|  3218202 | 20060620174530 | upload   | upload     | http://moonprincess.com/galleries/details.php?imag |
|  3233294 | 20060621192957 | upload   | upload     | Reverted to earlier revision                       |
|  3233318 | 20060621193122 | upload   | upload     | Reverted to earlier revision                       |
|  3233329 | 20060621193153 | upload   | upload     | Reverted to earlier revision                       |
|  6879519 | 20070206084304 | upload   | upload     | Reverted to earlier revision                       |
|  8904429 | 20070522020803 | upload   | upload     |                                                    |
| 38925761 | 20111021004413 | delete   | delete     | Deleted old revision 20060620174530!Sailorneptune. |
| 38925762 | 20111021004416 | delete   | delete     | Deleted old revision 20070522020803!Sailorneptune. |
| 38925765 | 20111021004418 | delete   | delete     | Deleted old revision 20070206084304!Sailorneptune. |
| 38925768 | 20111021004420 | delete   | delete     | Deleted old revision 20060621193152!Sailorneptune. |
| 38925770 | 20111021004422 | delete   | delete     | Deleted old revision 20060621193122!Sailorneptune. |
| 38925772 | 20111021004424 | delete   | delete     | Deleted old revision 20060621192956!Sailorneptune. |
+----------+----------------+----------+------------+----------------------------------------------------+
13 rows in set (0.02 sec)

Well... that's weird. It looks like there was a set of uploads in 2006,2007, but then a set of deletes in one day in 2011. Those "reverted to earlier versions" have me worried that log_action = "upload" used to be used for overwrites. Let's look at another example.

> select log_id, log_timestamp, log_type, log_action, LEFT(log_comment, 50) from logging where log_namespace = 6 and log_title = "Grease_Super_Macro.JPG";
+---------+----------------+----------+------------+------------------------------------------------+
| log_id  | log_timestamp  | log_type | log_action | LEFT(log_comment, 50)                          |
+---------+----------------+----------+------------+------------------------------------------------+
| 9884320 | 20070718174156 | upload   | upload     | Picture of grease taken with Super Macro Mode. |
| 9884440 | 20070718175047 | upload   | upload     |                                                |
+---------+----------------+----------+------------+------------------------------------------------+
2 rows in set (0.06 sec)

Yeah. That one is weird too. It's also in 2007. Let's try the last query with the first event date included.

> SELECT 
    ->     log_namespace, 
    ->     log_title, 
    ->     SUM(log_action = "upload") AS uploads,
    ->     SUM(log_action = "overwrite") AS overwrites,
    ->     MIN(log_timestamp) AS first_upload
    -> FROM logging 
    -> INNER JOIN (
    ->     SELECT 
    ->         page_namespace AS log_namespace,
    ->         page_title AS log_title
    ->     FROM page
    ->     WHERE page_namespace = 6
    ->     ORDER BY RAND()
    ->     LIMIT 1000
    -> ) AS sampled_media USING (log_namespace, log_title) 
    -> WHERE 
    ->     log_type = "upload"
    -> GROUP BY 1,2
    -> HAVING SUM(log_action = "upload") > 1
    -> ;
+---------------+------------------------------------------------+---------+------------+----------------+
| log_namespace | log_title                                      | uploads | overwrites | first_upload   |
+---------------+------------------------------------------------+---------+------------+----------------+
|             6 | Bags.jpg                                       |       2 |          0 | 20061018113801 |
|             6 | CAC_logo.jpg                                   |       4 |          0 | 20070413081748 |
|             6 | CHR_WA_A01.jpg                                 |       3 |          3 | 20090605112708 |
|             6 | Campbellwalker.jpg                             |       3 |          0 | 20060731191012 |
|             6 | CannibalHolocaustimpale.jpg                    |       5 |          2 | 20070112005709 |
|             6 | Charlielost.PNG                                |       3 |          3 | 20070219042047 |
|             6 | Dr_pepper_ballpark_right_field_toward_home.jpg |       2 |          0 | 20060805095416 |
|             6 | Ed_Chynoweth_Cup.JPG                           |       3 |          0 | 20070530002001 |
|             6 | EveOfTheWarEP.jpg                              |       3 |          0 | 20070521012020 |
|             6 | GREATWESTstates.PNG                            |       2 |          3 | 20061129185017 |
|             6 | Gupt.jpg                                       |       2 |          0 | 20060226025609 |
|             6 | Head_over_heels_amstrad_1.png                  |       2 |          1 | 20070316185350 |
|             6 | IndianEducationSystem.JPG                      |       2 |          0 | 20070101091319 |
|             6 | Jonathan_Rhys_Meyers.jpg                       |       2 |          1 | 20071002191953 |
|             6 | KazanRailTerminalWest.jpg                      |       2 |          0 | 20060614100347 |
|             6 | Mardi_Gras_logo_box_white.png                  |       2 |          0 | 20070403040509 |
|             6 | Monty.GIF                                      |       6 |          0 | 20050723222006 |
|             6 | News.jpg                                       |       7 |          0 | 20070417204330 |
|             6 | NoSirNihilismIsNotPractical.jpg                |       2 |          0 | 20060126182642 |
|             6 | North_Street_1957-1.jpg                        |       2 |          0 | 20090513061928 |
|             6 | Rabbanim_Supporting_the_Manhattan.jpg          |       3 |          0 | 20070712212608 |
|             6 | Rasathanthram.jpg                              |       2 |          0 | 20070514103051 |
|             6 | Right_Place,_Wrong_Time_DVD.jpg                |       2 |          1 | 20080808025356 |
|             6 | RiserDiagram.png                               |       2 |          0 | 20070605215455 |
|             6 | SHU_main_gate.jpg                              |       2 |          0 | 20080624101518 |
|             6 | Sequence.jpg                                   |       2 |          3 | 20080312002548 |
|             6 | Sourceradio.jpg                                |       2 |          0 | 20051208192739 |
|             6 | Spitfire2xs.jpg                                |       2 |          1 | 20060506014856 |
|             6 | Star_Ocean_SFC.jpg                             |       2 |          0 | 20060707224941 |
|             6 | Teaser2.jpg                                    |       2 |          0 | 20060917172304 |
|             6 | The_Changeling.jpg                             |       2 |          1 | 20060913164513 |
|             6 | Tom_and_jerry_mgm_parody.jpg                   |       2 |          0 | 20060614232540 |
|             6 | Tyranos.jpg                                    |       4 |          0 | 20060603033435 |
|             6 | What-Has-Befallen-Us,-Barbad.jpg               |       2 |          0 | 20070629101709 |
|             6 | Working_memory_model.PNG                       |       2 |          0 | 20070312124212 |
+---------------+------------------------------------------------+---------+------------+----------------+
35 rows in set (47.13 sec)

Look at all those pre-2008 entries. I wonder if there's a clear time cutoff when the log starts to be sane.

> SELECT
    ->     LEFT(first_upload, 6) AS month,
    ->     SUM(uploads > 1)/COUNT(*) AS multi_upload_prop
    -> FROM (
    ->     SELECT
    ->         log_namespace,
    ->         log_title,
    ->         COUNT(*) AS uploads,
    ->         MIN(log_timestamp) AS first_upload
    ->     FROM logging
    ->     WHERE
    ->         log_namespace = 6 AND
    ->         log_action = "upload" AND
    ->         log_type = "upload"
    ->     GROUP BY 1,2
    -> ) AS file_stats
    -> GROUP BY 1;
+--------+-------------------+
| month  | multi_upload_prop |
+--------+-------------------+
| 200412 |            0.1723 |
| 200501 |            0.1655 |

<...snip...>

| 200704 |            0.1072 |
| 200705 |            0.1114 |
| 200706 |            0.1405 |
| 200707 |            0.1360 |
| 200708 |            0.1174 | <--,
| 200709 |            0.0379 | <--'
| 200710 |            0.0350 |
| 200711 |            0.0281 |
| 200712 |            0.0252 |
| 200801 |            0.0257 |

<...snip...>

| 201406 |            0.0105 |
| 201407 |            0.0139 |
| 201408 |            0.0047 |
| 201409 |            0.0069 |
+--------+-------------------+
118 rows in set (2 min 59.80 sec)

Looks like somewhere in Sept. 2007, the upload log gets sane.

> SELECT
    ->     LEFT(log_timestamp, 6) AS month,
    ->     SUM(log_action = "upload") AS overwrites,
    ->     SUM(log_action = "overwrite") AS overwrites
    -> FROM logging
    -> WHERE
    ->     log_namespace = 6 AND
    ->     log_type = "upload"
    -> GROUP BY 1
    -> ;
+--------+------------+------------+
| month  | overwrites | overwrites |
+--------+------------+------------+
| 200412 |       3706 |          0 |
| 200501 |      13547 |          0 |
| 200502 |      13089 |          0 |

<...snip...>

| 200705 |      78580 |          0 |
| 200706 |      76389 |          0 |
| 200707 |      78138 |          0 |
| 200708 |      71218 |       2914 |
| 200709 |      54373 |      11625 |
| 200710 |      55511 |      13791 |

<...snip...>

| 201407 |       8005 |       2205 |
| 201408 |       9073 |       2357 |
| 201409 |       2677 |        738 |
+--------+------------+------------+
118 rows in set (1 min 15.14 sec)

So... is there something else I can look for in these events to know which ones are the creation?

> select * from logging where log_namespace = 6 and log_title = "GREATWESTstates.PNG" and log_action = "upload";
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
| log_id  | log_type | log_action | log_timestamp  | log_user | log_namespace | log_title           | log_comment                                                                    | log_params | log_deleted | log_user_text | log_page |
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
| 5711684 | upload   | upload     | 20061129185017 |   455767 |             6 | GREATWESTstates.PNG | Derived from [http://en.wikipedia.org/wiki/Image:BlankMap-USA-states.PNG here] |            |           0 |               |     NULL |
| 7814423 | upload   | upload     | 20070325024149 |  1279118 |             6 | GREATWESTstates.PNG | official conference colors                                                     |            |           0 |               |     NULL |
+---------+----------+------------+----------------+----------+---------------+---------------------+--------------------------------------------------------------------------------+------------+-------------+---------------+----------+
2 rows in set (0.00 sec)

Nope. Hmm... I wonder if I could try a different strategy. I'll look for the first upload with an exclusive outer join.


Got it!


SELECT
    COUNT(DISTINCT upload.log_user)
FROM (
    SELECT *
    FROM logging
    WHERE
        log_type = "upload" AND
        log_action = "upload" AND
        log_timestamp BETWEEN "20140101" AND "20140102"
) AS upload
LEFT JOIN logging old_upload ON
    old_upload.log_type = "upload" AND
    old_upload.log_action = "upload" AND
    old_upload.log_timestamp < "20140101" AND
    upload.log_namespace = old_upload.log_namespace AND
    upload.log_title = old_upload.log_title
WHERE old_upload.log_id IS NULL;

It's pretty quick for a day's worth of uploads too. Woot! --Halfak (WMF) (talk) 20:44, 10 September 2014 (UTC)Reply

Discussion edit

Return to "Daily unique media creators" page.