Research:Teahouse group dynamics/Dataset
Schema for Teahouse replies dataset
editReplies by Wikipedians to questions at the Teahouse. One reply per row.
- id: sequential record ID, sorted by author and (usually) chronological timestamp.
- threadid: sequential thread ID. Threads are counted from the first thread in the first Teahouse archive page, so threadid 1 corresponds to the thread "en:Wikipedia:Teahouse/Questions/Archive_1#References/Sources_for_Television_Shows_and_Books".
- archive: sequential archive ID, so archive 1 corresponds to Wikipedia:Teahouse/Questions/Archive_1.
- title: canonical title of the thread, as listed in the archive. May differ from original thread title, if that title was edited before the thread was archived.
- replyno: position of the reply within the thread.
- isreply: whether the post is a reply or not (non-replies are often the question that initiated the thread).
- timestamp: timestamp in UTC.
- welcome: whether the reply starts with a welcome (1) or not (0).
- selfreply: whether the reply is to a thread started by the same author.
- isfirstreply: whether the post is the first reply in the thread.
- policies: list of tuples of the form (PAGENAME_IN_CAPS, page_type).
- policycount: number of policies linked in the post.
- postlength: number of characters in the post.
- welcome_prev_page: proportion of first replies in threads on the previous archive page that started with a welcome.
- welcome_2prev_page: proportion of first replies in threads on the archive page preceding the previous archive page that started with a welcome.
- policyprop_prev_page: proportion of policy links to total words in first replies on the previous archive page.
- policyprop_2prev_page: proportion of policy links to total words in first replies on the archive page preceding the previous archive page.
- welcomenotfirst_prev_page: proportion of subsequent replies in threads on the previous archive page that started with a welcome.
- welcomenotfirst_2prev_page: proportion of subsequent replies in threads on the archive page preceding the previous archive page that started with a welcome.
- welcome_2prev_page: proportion of first replies in threads on the archive page preceding the previous archive page that started with a welcome.
- duplicated: whether this post is a duplicate of another post in the archive. Accidental duplication of threads is rare in the dataset.
- uname_parsed: username parsed from the archived thread. May differ from current user_name. Should match author.
- user_name: if different from uname_parsed, the post author changed their username between the time the thread was archived and the time the dataset was generated.
- 'user_id: user ID of author
- rev_id: if the author is a host, the revision ID of their host profile creation.
- rev_timestamp: if the author is a host, the UTC timestamp of their host profile creation.
- is_host: whether the author is a host ("Host") or not ("Non Host").
- age: number of days between the date of the author's first Teahouse edit and the date of the current post.
- poster_age:
- helpdesk_count:
- helpdesk_start:
- is_helpdesk:
- teahouse_count:
- wikipedia_count: