Research talk:Reading time/Work log/2018-11-01
Thursday, November 1, 2018
editJoining readingdepth data with mediawiki history
editI had some trouble merging reading depth with the history of revisions. The initial problem is that the reading depth schema does not have revision_id. It only has page_title. This meant that to find the version of the page that a reader viewed we have to join on page_title and then filter to find the nearest previous revision. My initial attempt was a hive query that used a window function and a join on page. This approach proved intractable. I got some help from jaol who started an approach using a window function with a range join in SparkSQL. His query wasn't perfect and I re-wrote it pyspark and tweaked it until it was correct.
My version of the query
editdt = r.table("nathante.readingDataModel_Stage1")
.select(['*', f.expr("translate(event.pagetitle,' ','_') AS page_title")])
.filter(f.col("year") > 0)
dt = df.filter(dt.event['namespaceid'] == 0)
hist = r.table("wmf.mediawiki_history")
.select([f.expr('wiki_db AS hist_wiki'),
f.expr('page_title_historical AS hist_page_title'),
f.expr("unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts"),
'revision_text_bytes',
'revision_text_bytes_diff',
'page_id'])
.filter((f.col("snapshot") == "2018-09") &
(f.col("event_entity")=="revision") &
(f.col("event_type")=="create") &
(f.col("page_id") > 0) &
(f.length(f.col("page_title_historical")) > 0))
title_win = Window.partitionBy(["hist_page_title",'hist_wiki']).orderBy("hist_current_ts")
# 1520881200 is the beginning of the reading depth data
# 4102444800 is the year 2100
hist = hist.withColumn("hist_next_ts",f.lead(f.col("hist_current_ts"),1,4102444800).over(title_win))
.filter(f.col("hist_next_ts") > 1520881200)
join_cond = [dt.wiki == hist.hist_wiki,
dt.page_title == hist.hist_page_title,
dt.unix_timestamp_2 >= hist.hist_current_ts,
dt.unix_timestamp_2 < hist.hist_next_ts]
dt2 = dt.join(hist,on=join_cond,how='left_outer')
dt2.write.partitionBy(['year','month']).mode("overwrite").saveAsTable("nathante.tableReadingDataModel_Stage2")
Views where we can't find a page
editThere are still about 0.5% of reading events where we can't find a corresponding page in wmf.history
. From spot checks these appear to be red links. However, I'm not totally sure that some are not red links. If we have page_id
or if we add a bit indicating if the view is on a red link to the reading depth schema then we could verify this more easily. I created a [https://phabricator.wikimedia.org/T208478 bug for this.