Research talk:Reading time/Work log/2018-11-01

Thursday, November 1, 2018 edit

Joining readingdepth data with mediawiki history edit

I had some trouble merging reading depth with the history of revisions. The initial problem is that the reading depth schema does not have revision_id. It only has page_title. This meant that to find the version of the page that a reader viewed we have to join on page_title and then filter to find the nearest previous revision. My initial attempt was a hive query that used a window function and a join on page. This approach proved intractable. I got some help from jaol who started an approach using a window function with a range join in SparkSQL. His query wasn't perfect and I re-wrote it pyspark and tweaked it until it was correct.

My version of the query edit

dt = r.table("nathante.readingDataModel_Stage1")
      .select(['*', f.expr("translate(event.pagetitle,' ','_') AS page_title")]) 
      .filter(f.col("year") > 0)

dt = df.filter(dt.event['namespaceid'] == 0)

hist = r.table("wmf.mediawiki_history")
        .select([f.expr('wiki_db AS hist_wiki'),
                 f.expr('page_title_historical AS hist_page_title'),
                 f.expr("unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts"),
                 'revision_text_bytes',
                 'revision_text_bytes_diff',
                 'page_id'])
        .filter((f.col("snapshot") == "2018-09") & 
                (f.col("event_entity")=="revision") & 
                (f.col("event_type")=="create") & 
                (f.col("page_id") > 0) & 
                (f.length(f.col("page_title_historical")) > 0))

title_win = Window.partitionBy(["hist_page_title",'hist_wiki']).orderBy("hist_current_ts")

# 1520881200 is the beginning of the reading depth data
# 4102444800 is the year 2100
hist = hist.withColumn("hist_next_ts",f.lead(f.col("hist_current_ts"),1,4102444800).over(title_win))
           .filter(f.col("hist_next_ts") >  1520881200)

join_cond = [dt.wiki == hist.hist_wiki, 
             dt.page_title == hist.hist_page_title, 
             dt.unix_timestamp_2 >= hist.hist_current_ts, 
             dt.unix_timestamp_2 < hist.hist_next_ts]

dt2 = dt.join(hist,on=join_cond,how='left_outer')

dt2.write.partitionBy(['year','month']).mode("overwrite").saveAsTable("nathante.tableReadingDataModel_Stage2")

Views where we can't find a page edit

There are still about 0.5% of reading events where we can't find a corresponding page in wmf.history. From spot checks these appear to be red links. However, I'm not totally sure that some are not red links. If we have page_id or if we add a bit indicating if the view is on a red link to the reading depth schema then we could verify this more easily. I created a [https://phabricator.wikimedia.org/T208478 bug for this.

Return to "Reading time/Work log/2018-11-01" page.