Research talk:Reading time/Work log/2018-10-28
Sunday, October 28, 2018Edit
Checking off some minor tasksEdit
- Regenerated cleanReadingDepth tables with updated data and filters
- Updated Schema_talk:ReadingDepth#Events_where_unloaded_event_is_logged_before_the_loaded_event with a plot showing periodic spikes.
- Gave up on using spark for doing very complex queries. It's just too brittle. I'm rewriting the parts that don't work in Hive.
Troubleshooting Slow Hive QueryEdit
nathante.readingDataModel_Stage1 using spark. This was a pretty simple process without any joins.
I wrote the following complex query based on spark code that was running pretty fast, but was also failing when I ran it on the full dataset (I think due memory issues). In Hive however, it seems pretty slow! Reducers are running for > 1.5 hours.
@Tbayer (WMF)— Any ideas?
I know the
LEFT OUTER JOIN isn't necessary (
INNER would be fine), but I want it for now so I can inspect any
CREATE TABLE nathante.readingDataModel_Stage2 AS SELECT * FROM ( SELECT *, MIN(timedelta) OVER (PARTITION BY pagetoken) AS min_timedelta FROM ( SELECT *, dt.unix_timestamp_2 - unix_timestamp(hist.event_timestamp, "yyyy-MM-dd HH:mm:ss.S") AS timedelta FROM (SELECT *, translate(event.pagetitle," ","_") AS page_title FROM nathante.readingDataModel_Stage1 dt WHERE year > 0) AS dt LEFT OUTER JOIN (SELECT wiki_db, event_entity, event_timestamp, page_title AS hist_page_title, revision_text_bytes FROM wmf.mediawiki_history WHERE snapshot == "2018-09" AND event_entity == "revision" AND event_type=="create") AS hist ON dt.page_title == hist.hist_page_title AND dt.wiki==hist.wiki_db) AS merged ) merged_2 WHERE timedelta == min_timedelta
- Not sure how much it will improve performance, but one probably needs to filter by event entity and event type, cf. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Important_Fields . Regards, Tbayer (WMF) (talk) 12:43, 29 October 2018 (UTC)
User:Joalpe came up with this Spark query:
CREATE TABLE nathante.readingDataModel_Stage2 STORED AS PARQUET AS WITH rdm1 AS ( SELECT *, translate(event.pagetitle,' ','_') AS page_title FROM nathante.readingDataModel_Stage1 WHERE year > 0 ), hist AS ( SELECT wiki_db AS hist_wiki, page_title_historical AS hist_page_title, unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts, -- Get the next event timestamp (or a fake big one if none) LEAD(event_timestamp, 1, '2100-01-01 00:00:00') OVER (PARTITION BY page_id ORDER BY event_timestamp) AS hist_next_ts_str, revision_text_bytes, revision_text_bytes_diff FROM wmf.mediawiki_history WHERE snapshot == "2018-09" -- Consider revision events AND event_entity == "revision" AND event_type=="create" -- Don't consider page having no id nor page_title_historical AND page_id > 0 AND LENGTH(page_title_historical) > 0 ) SELECT * FROM rdm1 LEFT OUTER JOIN ( -- Only consider events whose next event is after the experiment min ts SELECT *, unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') AS hist_next_ts FROM hist -- Consider only rows whose next timestamp is after the beginning of the experiment WHERE unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') > 1520881200 ) h ON rdm1.wiki = h.hist_wiki AND rdm1.page_title = h.hist_page_title AND rdm1.unix_timestamp_2 >= h.hist_current_ts AND rdm1.unix_timestamp_2 < h.hist_next_ts
Spark supports range joins which make it possible to create a 1:1 mapping between reading events and revisions.