Grants:Programs/Wikimedia Research Fund/Wiki-JoCER: Joint Coreference and Entity Resolution with WikiData

statusnot funded
Wiki-JoCER: Joint Coreference and Entity Resolution with WikiData
start and end datesAugust 15 2022 to August 14 2023.
budget (USD)40,000-50,000 USD
applicant(s)• Anders Søgaard

Overview edit

Username

Applicant's Wikimedia username. If one is not provided, then the applicant's name will be provided for community review.

Anders Søgaard

Project title

Wiki-JoCER: Joint Coreference and Entity Resolution with WikiData

Entity Receiving Funds

Provide the name of the individual or organization that would receive the funds.

University of Copenhagen

Research proposal edit

Description edit

Description of the proposed project, including aims and approach. Be sure to clearly state the problem, why it is important, why previous approaches (if any) have been insufficient, and your methods to address it.

In recent years, I have built coreference and entity resolution systems based on WikiData [0, 4, 5], as well as used WikiData to fine-tune language models [1] and semantic parsers [2]. WikiData is incredibly useful for NLP research, not just because of its size and richness, but because of it's multilingual support. In the proposed project, I wish to merge our protocol for using WikiData to annotate for coreference [4] with our recent work on producing joint resources for coreference and entity resolution [5]. The key deliverable is a multilingual coreference and entity annotated corpus that can be used to train and evaluate (possibly joint) coreference and entity resolvers across a range of languages. We specifically consider languages for which we have already trained language models and have some experience in recruiting annotators: Bosnian, Icelandic, Jamaican Creole, Quechua, and Ukranian. We will annotate using WikiData-generated entity lists [4] in multiple languages, and use human translation to obtain resources for even more languages. [0] Aralikatte, Rahul; Lent, Heather; Gonzalez, Ana Valeria; Herschcovich, Daniel; Qiu, Chen; Sandholm, Anders; Ringgaard, Michael; Søgaard, Anders. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. Conference on Empirical Methods in Natural Language Processing (EMNLP) 2019. Hong Kong, China.
  • [1] Garneau, Nicolas; Hartmann, Mareike; Sandholm, Anders; Ruder, Sebastian; Vulic, Ivan; Søgaard, Anders. 2021. Analogy Training Multilingual Encoders. The 35th AAAI Conference on Artificial Intelligence (AAAI). Virtual conference.
  • [2] Abdou, Mostafa; Sas, Cezar; Aralikatte, Rahul; Augenstein, Isabelle; Søgaard, Anders. 2019. X-WikiRE: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension. Workshop on Deep Learning Approaches for Low-Resource NLP, EMNLP 2019.
  • [4] Aralikatte, Rahul; Søgaard, Anders. 2020. Model-based annotation of coreference. The 12th International Conference on Language Resources and Evaluation (LREC). Marseille, France.
  • [5] Barrett, Maria; Lam, Hieu Trong; Wu, Martin; Lacroix, Ophélie; Plank, Barbara; Søgaard, Anders. 2021. Resources and Evaluations for Danish Entity Resolution. EMNLP Workshop on Computational Models of Reference, Anaphora and Coreference. Punta Cana, Dominican Republic.

Budget edit

Approximate amount requested in USD.

40,000-50,000

Budget Description

Briefly describe what you expect to spend money on (specific budgets and details are not necessary at this time).

Research Assistant Salary 6 months: 30,000 USD. Traveling: 2,000 USD. AWS Compute: 2,000 USD. Other Direct Costs: 6,000 USD. Human Translation Costs: 3,000 USD. Crowdsourcing Annotation Costs: 7,000 USD

Impact edit

Address the impact and relevance to the Wikimedia projects, including the degree to which the research will address the 2030 Wikimedia Strategic Direction and/or support the work of Wikimedia user groups, affiliates, and developer communities. If your work relates to knowledge gaps, please directly relate it to the knowledge gaps taxonomy.

Multilingual coreference and entity resolution will enable intelligent text processing in multiple languages, bridging the digital language divide and improving downstream NLU techniques, and at the same time enable automated Wikipedia link insertion in newswire, social media, and on websites. Coreference and entity resolution are also crucially important for knowledge and information extraction.

Dissemination edit

Plans for dissemination.

We will present our work at EMNLP 2023, as well as any relevant scientific workshops in the project lifetime. Our data and models will be made publicly available, including on Github and Huggingface. Finally, we will contact Explosion.ai to offer our help with including models in the spaCy pipeline.

Past Contributions edit

Prior contributions to related academic and/or research projects and/or the Wikimedia and free culture communities. If you do not have prior experience, please explain your planned contributions.

Please see references in the above. Other contributions to NLP more broadly are listed at https://anderssoegaard.github.io/.

I agree to license the information I entered in this form excluding the pronouns, countries of residence, and email addresses under the terms of Creative Commons Attribution-ShareAlike 4.0. I understand that the decision to fund this Research Fund application, the application itself along with all the information entered by my in this form excluding the pronouns, country of residences, and email addresses of the personnel will be published on Wikimedia Foundation Funds pages on Meta-Wiki and will be made available to the public in perpetuity. To make the results of your research actionable and reusable by the Wikimedia volunteer communities, affiliates and Foundation, I agree that any output of my research will comply with the WMF Open Access Policy. I also confirm that I have read the privacy statement and agree to abide by the WMF Friendly Space Policy and Universal Code of Conduct.

Yes