Research talk:Reading time/Work log/2018-10-28

Active discussions

Sunday, October 28, 2018Edit

Checking off some minor tasksEdit

Modeling stuffEdit

  • Gave up on using spark for doing very complex queries. It's just too brittle. I'm rewriting the parts that don't work in Hive.


Troubleshooting Slow Hive QueryEdit

I made nathante.readingDataModel_Stage1 using spark. This was a pretty simple process without any joins.

I wrote the following complex query based on spark code that was running pretty fast, but was also failing when I ran it on the full dataset (I think due memory issues). In Hive however, it seems pretty slow! Reducers are running for > 1.5 hours.

@Tbayer (WMF) Any ideas?

I know the LEFT OUTER JOIN isn't necessary (INNER would be fine), but I want it for now so I can inspect any nulls.

CREATE TABLE nathante.readingDataModel_Stage2 AS 
SELECT * FROM (
SELECT *, MIN(timedelta) OVER (PARTITION BY pagetoken) AS min_timedelta FROM (
SELECT *, dt.unix_timestamp_2 - unix_timestamp(hist.event_timestamp, "yyyy-MM-dd HH:mm:ss.S") AS timedelta
       FROM
       (SELECT *, translate(event.pagetitle," ","_") AS page_title FROM nathante.readingDataModel_Stage1 dt WHERE year > 0) AS dt
       LEFT OUTER JOIN
       (SELECT wiki_db, event_entity, event_timestamp, page_title AS hist_page_title, revision_text_bytes FROM wmf.mediawiki_history WHERE snapshot == "2018-09" AND event_entity == "revision" AND event_type=="create") AS  hist
       ON dt.page_title == hist.hist_page_title AND dt.wiki==hist.wiki_db) AS merged
        ) merged_2
       WHERE timedelta == min_timedelta

—The preceding unsigned comment was added by groceryheist (talk) 06:38, 29 October 2018

Not sure how much it will improve performance, but one probably needs to filter by event entity and event type, cf. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history#Important_Fields . Regards, Tbayer (WMF) (talk) 12:43, 29 October 2018 (UTC)
Modified the query to do the filtering. I'll try again in nice mode. Groceryheist (talk) 16:24, 29 October 2018 (UTC)

User:Joalpe came up with this Spark query:

CREATE TABLE nathante.readingDataModel_Stage2 STORED AS PARQUET AS
WITH

rdm1 AS (
    SELECT
        *,
        translate(event.pagetitle,' ','_') AS page_title
    FROM nathante.readingDataModel_Stage1
    WHERE year > 0
),

hist AS (
    SELECT
        wiki_db AS hist_wiki,
        page_title_historical AS hist_page_title,
        unix_timestamp(event_timestamp, 'yyyy-MM-dd HH:mm:ss.S') AS hist_current_ts,
        -- Get the next event timestamp (or a fake big one if none)
        LEAD(event_timestamp, 1, '2100-01-01 00:00:00')
          OVER (PARTITION BY page_id ORDER BY event_timestamp) AS hist_next_ts_str,
        revision_text_bytes,
        revision_text_bytes_diff
    FROM wmf.mediawiki_history
    WHERE snapshot == "2018-09"
        -- Consider revision events
        AND event_entity == "revision"
        AND event_type=="create"
        -- Don't consider page having no id nor page_title_historical
        AND page_id > 0
        AND LENGTH(page_title_historical) > 0
)

SELECT
  *
FROM rdm1
LEFT OUTER JOIN (
    -- Only consider events whose next event is after the experiment min ts
    SELECT
        *,
        unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') AS hist_next_ts

    FROM hist
    -- Consider only rows whose next timestamp is after the beginning of the experiment
    WHERE unix_timestamp(hist_next_ts_str, 'yyyy-MM-dd HH:mm:ss.S') > 1520881200
) h
    ON rdm1.wiki = h.hist_wiki
      AND rdm1.page_title = h.hist_page_title
      AND rdm1.unix_timestamp_2 >= h.hist_current_ts
      AND rdm1.unix_timestamp_2 < h.hist_next_ts


Spark supports range joins which make it possible to create a 1:1 mapping between reading events and revisions.

Return to "Reading time/Work log/2018-10-28" page.