MGerlach (WMF)
About me
Hi. My name is Martin, I joined the Wikimedia Foundation in September 2019 as a Research Scientist in the Research team. My background is in Physics where I worked on trying to understand the dynamics of complex social systems. I currently live and work (mostly) in Berlin, Germany.
My work
My work focuses on Knowledge Gaps in order to understand and address structural inequalities in Wikimedia projects and the online ecosystem more generally. I have been contributing to this program in three main areas: i) understanding readers and how they are navigating in Wikipedia; ii) developing models for structured tasks to make it easier to newcomer editors to contribute; and iii) developing models to reliably assess the readability of content in Wikipedia. For more details about ongoing and past projects see below.
Contact me
- email: mgerlach@wikimedia.org
- office hours: book a session
- irc: mgerlach (#wikimedia-researchconnect)
- Personal website: martingerlach.github.io/
de-N | Dieser Benutzer spricht Deutsch als Muttersprache. |
---|
en-5 | This user has professional knowledge of English. |
---|
Projects
editA collection of things that I have been working on.
Tools
editSome of the tools that I have (helped) develop.
- List-building models This tool allows one to build a list of related articles to a "seed" based on various models.
- WikiNav This tool provides insights into how readers of Wikipedia explore the content when learning about a given topic using the clickstream dataset. See also our post in the Wikimedia Tech-blog
- Readability tool This tool provides scores about an article's readability in different languages (under development)
- Wiki-Visibility tool This tool provides recommendations to increase the visibility of orphan articles.
- DP Pageviews Visualizer This tool provides an example of how we might visualize differentially private pageview dataset (i.e. views to a page, split up by country and day). See the also the blogpost on the dataset.
Some python packages that make it easier to work with Wikimedia data:
- mwparserfromhtml: parsing Wikipedia HTML (parsoid output). See also our post in the Wikimedia Tech-blog
- mwtokenizer: word / sentence tokenization for text in (almost) all languages in Wikipedia
Other resources:
- Wikimedia Data Tutorial: Using public data from Wikipedia and its sister projects for academic research. A tutorial on how to get started in working with Wikimedia data for research.
Ongoing projects
edit- Research:Improving multilingual support for link recommendation model for add-a-link task
- Research:Multilingual Readability Research
- Research:Develop a model for text simplification to improve readability of Wikipedia articles
- Research:Recommending links to increase visibility of articles
- Research:Understanding Search Engine To Wikipedia
Completed projects
edit- Research:Understanding Curious and Critical Readers
- Research:Characterizing Readers Navigation
- Copyediting as a Structured Task
- Link recommendation
- Language-agnostic list-building for ad-hoc topic modeling
- Developing metrics for content gaps in the taxonomy of knowledge gaps
- Metrics for quantifying gender content gaps
- List of covid-related articles based on reader sessions (reader interest)
- New user reading patterns
- Usage of talk pages
Communication
editPapers:
- Dale Zhou, Shubhankar Prashant Patankar, David Martin Lydon-Staley, Perry Zurn, Martin Gerlach, and Danielle S Bassett. 2024. Architectural styles of curiosity in global Wikipedia mobile app readership. Science Advances. https://doi.org/10.1126/sciadv.adn3268
- Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West. Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia. EMNLP 2024. https://arxiv.org/abs/2410.04254
- Mykola Trokhymovych, Indira Sen, Martin Gerlach. An Open Multilingual System for Scoring Readability of Wikipedia. ACL 2024. https://arxiv.org/abs/2406.01835v1
- Akhil Arora, Robert West, Martin Gerlach. Orphan Articles: The Dark Matter of Wikipedia. ICWSM 2024. https://arxiv.org/abs/2306.03940
- Tiziano Piccardi, Martin Gerlach, Robert West. Curious Rhythms: Temporal Regularities of Wikipedia Consumption. ICWSM 2024. https://arxiv.org/abs/2305.09497
- Tiziano Piccardi, Martin Gerlach, Akhil Arora, and Robert West. 2023. A Large-Scale Characterization of How Readers Browse Wikipedia. ACM Transactions on the Web. https://arxiv.org/abs/2112.11848
- Tiziano Piccardi, Martin Gerlach, Robert West. 2022. Going Down the Rabbit Hole: Characterizing the Long Tail of Wikipedia Reading Sessions. WikiWorkshop 2022: In Companion Proceedings of The Web Conference 2022 (WWW '22). https://arxiv.org/abs/2203.06932
- Akhil Arora, Martin Gerlach, Tiziano Piccardi, Alberto García-Durán, Robert West. 2022. Wikipedia Reader Navigation: When Synthetic Data Is Enough. Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM '22). https://arxiv.org/abs/2201.00812
- Martin Gerlach, Marshall Miller, Rita Ho, Kosta Harlan, Djellel Difallah. 2021. A Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. 30th ACM International Conference on Information and Knowledge Management (CIKM '21). https://arxiv.org/abs/2105.15110
- Isaac Johnson, Martin Gerlach, Diego Sáez-Trumper. 2021. Language-agnostic Topic Classification for Wikipedia. WikiWorkshop 2021: In Companion Proceedings of The Web Conference 2021 (WWW '21). https://arxiv.org/abs/2103.00068
- Miriam Redi, Martin Gerlach, Isaac Johnson, Jonathan Morgan, Leila Zia. 2021. A Taxonomy of Knowledge Gaps for Wikimedia Projects (Second Draft). https://arxiv.org/abs/2008.12314
Blogposts
- 2023-02: Blogpost titled From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps on the Wikimedia Tech Blog with Isaac Johnson and Nazia Tasnim
- 2021-09: Blogpost titled Analyzing the Wikipedia clickstream just got easier with WikiNav on the Wikimedia Tech blog with Muniza A. and Isaac Johnson (WMF)
- 2021-09: Blogpost titled World Suicide Prevention Day and the opportunity to increase access to mental health information on Wikimedia projects on the Diff blog with Cristina Butoiu (WMF) and Leighanna Mixter (WMF)
Talks
- 2024-06: I participated in a panel to discuss about the future of Wikipedia at the Wikipedia Zukunftskongress 2024.
- 2024-03: Presentation at the Wikimedia Foundation March 2024 Staff meeting on 5 new learnings from our research on readers
- 2024-02: I gave a talk at EPFL (hosted by the Data Science Lab) on Research at the Wikimedia Foundation to advance knowledge equity.
- 2023-06: Invited talk at the Computational Social Science Seminar at Centre Marc Bloch on Going down the rabbit hole: Understanding information seeking in Wikipedia
- 2023-02: Presentation at FOSDEM on Building open tools to support research on Wikimedia projects
- 2022-04: Presentation at the Wikimedia Foundation April 2022 Staff meeting on 5 Learning from Research on Reader Navigation
- 2021-08: Lecture on Editing with machine learning: a case study on link recommendations at Wikimania 2021 with Kosta Harlan (WMF), Marshall Miller (WMF), Rita Ho (WMF), and Morten Warncke-Wang (WMF)
- 2021-08: Workshop on Indicators for the Wikimedia Projects at Wikimania 2021 with Marc Miquel, Pablo Aragón (WMF), Miriam Redi (WMF), David Laniado, and Cristian Consonni.
- 2021-06: Keynote on The science of knowledge equity at the Wikimedia Foundation at PCNet21 satellite workshop on political communication networks as part of Networks 21: A Joint Sunbelt and NetSci Conference.
Organization
edit- 2024-06: Co-chair of the Research Track for Wiki Workshop 2024: the annual workshop for all research on Wikimedia projects.
- 2024-06: I co-organized a tutorial on How to use Wikimedia Data for academic research held at ICWSM 2024
- 2023-04: Co-chair of the Research Track for Wiki Workshop 2023: the annual workshop for all research on Wikimedia projects.
Mentoring
edit- 2022: Mentor for Outreachy internship (Round 24): Build Python library to work with html-dumps (phab:T302237)
- 2021: Mentor for Outreachy internship (Round 22): Build a tool for analyzing and visualizing reader navigation on wikipedia (phab:T275608)