Research:MediaWiki events: a generalized public event datasource

GearRotate.svg

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


A conceptual diagram of an event processing system for MediaWiki is presented.
Conceptual diagram. A conceptual diagram of an event processing system for MediaWiki is presented.

Wiki-tool builders & researchers rely on various sources of information about what's happened and is currently happening in Wikipedia. These data sources tend to be structured in differently and contain incomplete or poorly structured information. Some datasources are queryable, but require complexity to "listen" to ongoing events while others are intended to only be used to "listen" to current events. In this project, we'll describe a common structure for public events in MediaWiki that mimics recentchanges, but also contains historical information. We'll also explore means for implementing this functionality on top of existing datasources and propose changes to infrastructure that would allow us to improve efficiency and completeness of data.

EventsEdit

Available datasourcesEdit

API
list=recentchanges -- Gathers a joined set of revision/logging and does some event metadata parsing
MySQL db
recentchanges -- Sequences both revision and logging events.
revision -- Revision and page creation events.
logging -- All non-revision and page creation events.
RCStream -- see https://wikitech.wikimedia.org/wiki/RCStream
IRC Stream -- see Research:Data#IRC_Feeds
EventLogging -- see mw:Extension:EventLogging

Relevant eventsEdit

  • RevisionSaved
fields
  • timestamp -- revision.rev_timestamp
  • user
    • id -- revision.rev_user
    • text -- revision.rev_user_text
  • comment -- revision.rev_comment
  • revision
    • rev_id -- revision.rev_id
    • parent_id -- revision.rev_parent_id
    • bytes -- revision.rev_len
    • sha1 -- revision.rev_sha1
    • page_id -- revision.rev_page
    • minor -- revision.rev_minor
    • text -- ...
  • RevisionsDeleted
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • revision
    • rev_ids -- parse logging.log_params
  • PageCreated
fields
  • timestamp -- revision.rev_timestamp
  • user
    • id -- revision.rev_user
    • text -- revision.rev_user_text
  • comment -- revision.rev_comment
  • page
    • id -- page.page_id
    • namespace -- page.page_namespace
    • title -- page.page_title
  • PageMoved
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • action -- logging.log_action ("move", "move_redir")
  • old
    • id -- logging.log_page (currently set to the wrong page_id, see bug 57084)
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • new
    • id -- logging.log_page (currently set to the wrong page_id, see bug 57084)
    • namespace -- parse logging.log_params
    • title -- parse logging.log_params
  • PageDeleted
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • page
    • id -- logging.log_page (currently always set to zero. see bug 26122)
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • PageRestored
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • old_page_id -- ???
  • page
    • id -- logging.log_page
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • PageProtectionModified
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • page
    • id -- logging.log_page
    • namespace -- logging.log_namespace
    • title -- logging.log_title
  • action -- logging.log_action ("protect", "modify", "unprotect")
  • protection
    • action -- parse logging.log_params
    • group -- parse logging.log_params
    • expiration -- parse logging.log_params
  • UserRegistered
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • action -- logging.log_action ("newusers", "create", "create2", "byemail", "autocreate")
  • newuser
    • id -- parse logging.log_params
    • text -- parse logging.log_title
  • UserRenamed
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • old
    • id -- not available in log
    • text -- parse logging.log_params
  • new
    • id -- not available in log
    • text -- parse logging.log_params
  • UserRightsModified
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • modified
    • id -- not available in log
    • text -- logging.log_title
  • old -- parse logging.log_params
  • new -- parse logging.log_params
  • UserBlocked
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • block
    • flags -- parse logging.log_params
    • duration -- parse logging.log_params
    • expiration -- parse logging.log_params and infer from current timestamp (how does the API do it?)
  • UserUnblocked
fields
  • timestamp -- logging.log_timestamp
  • user
    • id -- logging.log_user
    • text -- logging.log_user_text
  • comment -- logging.log_comment
  • unblocked
    • id -- not available in log
    • name -- parse logging.log_title

Desired functionalityEdit

ListeningEdit

for event in mw_events.listen(start="20140729000000"):
    # do thing with event
    if isinstance(event, RevisionSaved):
        revision_saved = event
        # do thing with revision_saved
    elif isinstance(event, RevisionDeleted):
        revision_deleted = event
        # do thing with revision_deleted
    else:
        pass

QueryingEdit

events = mw_events.query(start="20140729000000", end="20140731000000", types={RevisionSaved})
for revision_saved in events:
    # do thing with revision_saved

DumpsEdit

events = MWEventReader("event_dump.enwiki.1.json.7z")
for user_registered in mw_event_reader.filter(types={UserRegistered}):
    # do thing with user_registered

Relevant bugsEdit

  • T28122 No way to get the ID of a deleted page from deletion logs
  • T59084 Store the page_id of the moved page in log_page
  • T71005 Add a list=recentchanges result property for title without namespace

StandardizationEdit

MediaWiki events
  • consolidates domain knowledge and wiki archaeology
  • hides complexity -- produces standardized data structures
  • reads from MySQL database and api.php. Extendable to new formats.
  • produces JSON
  • provides a special Unavailable datatype to flag critical data that is not currently available

Support neededEdit

  • DBA's at the Wikimedia Foundation to explore means of publishing EventLogging infrastructure
  • Developers in non-python languages to talk over cross-language API similarities


Ready to create a project page?


See alsoEdit

ReferencesEdit