Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary

This project is funded by an Individual Engagement Grant

statusselected

A graphical and interactive etymology dictionary based on Wiktionary

summaryThe aim of the project is to develop a tool to extract etymological relationships from Wiktionary, to build a database of etymological relationships, and to develop an interactive and open source tool to visualize the etymological tree of words in Wiktionary.

targetEnglish Wiktionary, eventually any language version of Wiktionary.

strategic priorityimproving quality

amount30000 USD (preferably in EUR)

grantee• Epantaleo

advisor• Tommaso.dinoia

contact• esterpantaleo

gmail.com

volunteer• AncientGebts• Xferguson

this project needs...

volunteer

give feedback

join

endorse

created on15:19, 11 April 2016 (UTC)

round 1 2016

Friendly space expectations

JUMP TO THE ETYTREE TOOL!

Project idea

What is the problem you're trying to solve?

Etymological definitions in the English version of Wiktionary are particularly well compiled and contain a very rich set of information that also includes etymological definitions of foreign words.

This rich resource could be presented in a simpler and more intuitive way. First for readers, because etymologies often consist of fairly long and complex sentences and describe etymological relationships between words defined in different Wiktionary pages. Second for editors, that have to navigate through a chain of interconnected Wiktionary pages when they edit etymologically related words and have to choose among a large number of articulate templates. Third for researchers, because Wiktionary etymological data is difficult to export into a database: etymological entries, cognates and ancestors are codified using templates, but etymological relationships are not, as they are are expressed by relatively arbitrary words chosen by editors (like "coined from", "abbreviation of", "calque of", "back-formation of") and sometimes complex sentences.

What is your solution?

The etymological tree of the English word 'butter' as visualized by etytree

Etymologies could be represented using an intuitive and multilingual graphical representation where words in different languages that derive from the same ancestor are connected with each other to form a tree structure (similar to a family tree) and where etymological relationships are expressed by properties attached to links between words. A first demo of the proposed visualization tool etytree - based on a manually constructed minimal database of etymological relationships - is available here (use a desktop for a better experience).

A graphical interface will improve the experience of both users and editors as well as generate a database of etymological relationships.

When searching for etymologies, users will discover new words that derive from the same ancestral word, both in their own language and in other languages, and will interact with the tool by clicking on nodes (to collapse/expand the tree), mousing over nodes and links to see their properties, zooming/panning, navigating the dictionary.

Editors will find it easier to check the consistency of etymological information across multiple Wiktionary pages because they will have a visualization of the whole etymological tree. Eventually, they will have a graphical interface to edit nodes or connections or to choose templates thus facilitating extraction of data into a database.

The research community will be able to explore a database of etymological relationships and extract interesting information: for example how pronunciations or semantics (if the etymological network is integrated with the semantic network) evolved through time across etymological trees and across languages.

As a framework, DBnary will be used. DBnary is already capable of extracting other information from Wiktionary (Definition, Part of Speech, Synonyms, etc) and its sister project (Blexisma) can also handle semantics. As a result this project will allow integration of Wiktionary data into Wikimedia and eventually into Wikidata.

Project goals

A primary goal of the project is to extend the DBnary java framework (which can already extract lexical entries, lexical forms, lexical senses, POS, nyms, and more ^[1]) to extract etymological relationships between lexemes contained in the Etymology section of the English Wiktionary. This information will be extracted into a RDF database in Wikibase (the turtle syntax will be used following the W3C standards) and the database will be synchronized with Wiktionary.

The framework will be developed keeping in mind future integration of the extracted database into Wikidata, which will be possible when the Wikidata-for-Wiktionary proposal (see the current proposal) turns into production. For this purpose, a document will be drafted describing mappings between the RDF data structure (e.g.: resources, properties,statements), and the latest proposed Wikidata data structure (e.g.: entities, qualifiers, statements) using the Wikidata Toolkit as a reference (see wdtk-datamodel ) (note that the Wikidata toolkit can extract Wikidata data into RDF).

A second goal of the project is to build an interactive visualization tool to represent the extracted etymological relationships as well as the associated lexical information using trees.

Project plan

Activities

The project will iterate over 4 steps:

#	Title	Objective	Month	Effort
1	database extraction	Develop `ety2data`, a java extraction tool that extracts etymologies from the English Wiktionary and converts them into a `RDF` database of resources/nodes (words) and properties/links (etymological relationships). The tool will build on DBnary, a java library to extract data from Wiktionary and will use Wiktionary templates like `{{etyl}},{{term}},{{m}},{{back-form}},{{compound}},{{blend}},{{rfe}},{{etystub}},{{derived}},{{inherited}},{{cognate}},{{suffix}},{{prefix}},{{calque}},{{borrowing}},{{learned borrowing}},{{rfv-etymology}}` and regular expressions to extract etymological relationships. `Ety2data` will use all Wiktionary languages as well as special languages; possibly for consistency this set of languages will be reduced to a smaller set. To facilitate integration into Wikidata, the java framework will be developed using as a reference the Wikidata-Toolkit java framework (e.g.: languages will be treated as in WikimediaLanguageCodes.java, etc)	M1-M3	50%
2	visualization development	Develop `etytree` as a Cartesian Tree Graph Extension to visualize an etymological tree from a database query. A demo of `etytree` is available here and the associated code is available in the Github repository. The demo uses d3.js, a JavaScript library for manipulating documents based on data. While d3.js could be used in Wikimedia, it would require developing a new Wikimedia Extension, for this reason a Cartesian Graph Extension will be used instead. `Etytree` will also infer the tree structure from the `RDF` database on the fly through specific SPARQL queries using Apache Jena Fuseki.	M3	5%
3	test	Recursively test both `etytree` and `ety2data` on an increasingly larger sample of 1000, 10000, 100000, etc. lexemes from the English Wiktionary; when the size of the sample increases, the number of extraction rules in `ety2data` and the different visualization options in `etytree` will increase (which will require going back to steps 1 and 2).	M4-M6	30%
4	community dissemination	Start a wiki to discuss with the Wiktionary etymology community how to re-format etymological definitions whose content cannot be exported by `ety2data` (e.g: alternative etymologies, typos, etc.). Start a wiki to discuss the visualization tool user experience. Interact with the community on chats and forums.	M4-M6	10%
5	integration with Wikidata	Draft a document that maps `ety2data` data structures to the latest Wikidata-for-Wiktionary proposed data structures, using the Wikidata-toolkit data structures as a reference.	M6	5%

The most complex part of the project will be developing ety2data (step 1), the extraction tool that will populate the database using Wiktionary etymology entries. This is because the etymology section of words in Wiktionary is textual and because reconstructing the full etymological tree of a word means combining etymology sections of multiple Wiktionary pages. Before proposing this project a preliminary version of ety2data has been tested giving encouraging results: etymologies seem to be formatted in a regular way, and using etymology templates facilitates the extraction of the tree structure from the English Wiktionary.

Additional notes

When choosing the appropriate framework for the extraction of etymologies, we selected DBnary because it seemed to be very well written, documented, and maintained.

A similar etymology extraction tool is the Etymological Wordnet. We were hoping we could use it and integrate it with etytree but unfortunately the Etymological Wordnet is not publicly available and, from a first inspection of data extracted using it (which is available at the link above), we believe ety2data extraction procedure can extract more etymological relationships.

Budget

Total amount requested

$30,000.00 USD (in EUR)

Budget breakdown

Number	Category	Item description	Unit	Number of units	Cost per unit	Total cost	Currency	Notes
1	Project Manager and Programmer	Compensation for Epantaleo, committing to 40/hrs. per week over the six-month period at USD $30/hr as a project leader, JavaScript and Java developer for the visualization and the extraction tool.	Contract	1	28,800.00	28,800.00	USD
2	Dissemination	Participation (travel, board & lodging) to relevant community conferences	Una tantum	1	1,200.00	1,200.00	USD

Community engagement

Input will be gathered throughout the development of the project using mailing lists (cf. community notification), IRC channels, project wikis. The community will be involved in decisions like: definition of an ontology for etymologies, definition of the Wikidata-for-Wiktionary data structure, implementation of the visual interface.

Sustainability

A detailed documentation of the java tools ety2data will be produced as well as a detailed mapping of the ety2data data structure and the Wikidata-for-Wiktionary data structure. For this purpose the java Wikidata-toolkit will be used as a reference.

Out of Scope: the Vision

Editors don't seem to be using a clear pattern or a standard rule to format conflicting etymologies (this is the main problem with a textual only etymology: following/setting rules or standards when editing). Hopefully this project will help set standards through discussions on wiki pages as well as on the project wiki page, otherwise an important part of the information contained in Wiktionary etymologies will not be machine readable and exportable into a database. In a first version, for simplicity, no conflicting etymology will be included, or only one of the conflicting versions. The plan is to have multiple (linked) visualizations when there are conflicting etymologies and notes on branches of the tree (or elsewhere) if there are controversies on etymological relationships (or elsewhere). Eventually conflicting etymologies could be assigned some kind of confidence with a default "null prior" probability (0.5 probability to each alternative if there are only two alternatives), with the etymological tree that editors believe is more likely being displayed first (users could navigate through a stack of multiple etymological trees when conflicting etymologies are available). In a database this would mean attaching a probability (a number from 0 to 1) to the etymological relationship.
Because the structure of the tree is language independent, the textual part of the tree (definition of words, language, etc) could be translated into different languages. The project could be extended to use more language versions of Wiktionary - although etymologies seem rather incomplete/informal in other languages.
As some users have suggested semantic information could be attached to the tree by integrating this tool with Blexisma.
Eventually ety2data could turn into a tool to extract data from Wiktionary to Wikidata in sync.

Measures of success

Success will be measured based on achievement of the following targets:

Implementation of a java framework based on DBnary to extract etymological relationships as well as other relevant information from the English Wiktionary (including foreign words);
Creation of a RDF database of etymological relationships as well as other relevant information from the English Wiktionary (including foreign words);
Implementation of a visualization tool to visualize the etymologocal tree of any word in the English Wiktionary. At least 50 users should be testing the beta version;
Drafting of a detailed mapping between the ety2data data structures resulting in the RDF database and Wikidata claims, statements, entities, and qualifiers using as a reference Wikidata-toolkit classes (including Wikimedia Language Codes classes).
Discussion on any of the Wiktionary discussion rooms about how to format etymological definitions whose format is not exportable by the etymology extraction tool (e.g: alternative etymologies of lexemes) or that cannot be parsed and would need editing - showing proof of interaction with at least 30 editors.
Defining an ontology for etymological relationships after discussion with the Wiktionary community, both on the project wiki as well as on other wikis - showing proof of interaction with at least 30 editors.
Discussing the visualization tool user experience on the project wiki and on other wikis - showing proof of interaction with at least 30 editors.
Producing a list of 1000 visualizations that have been tested and are working correctly; this should correspond to more than 50000 lexemes in Wiktionary as each tree represents 50 lexemes on average.

Get involved

Participants

Ester Pantaleo is a PhD in Physics and a freelance data scientist. Her CV is available here. She has done research on different types of data including Finance and Genomics data and has always been interested in etymology and open source collaborative projects. She recently became interested in data visualizations and the semantic web.

Advisor I can support the applicant in everything related to semantic data management as well as in the transition from a triple-based model (the one behind turtle) to the one adopted by Wikidata. I think my experience in Linked Data technologies could be a plus in the developement of the project. Tommaso.dinoia (talk) 11:28, 1 June 2016 (UTC)
Volunteer Help make connections of associations between words. AncientGebts (talk) 00:47, 15 January 2017 (UTC)
Volunteer I'd like to do web development on this project. Xferguson (talk) 12:03, 6 August 2017 (UTC)

Community Notification

The community has been notified of this proposal through:

the Wiktionary Etymology Scriptorium, the Wiktionary Beer Parlour (click on the links to see a discussion of the etymology community about etytree)
the Wikimedia Research mailing list wiki-research-l@lists.wikimedia.org
the Wiktionary mailing list wiktionary-l@lists.wikimedia.org
the Multimedia team in the Editing department at the Wikimedia Foundation
the DBpedia developers mailing list meadbpedia-developers@lists.sourceforge.net and the DBpedia ontology mailing list dbpedia-ontology@lists.sourceforge.net
the Open Knowledge Foundation's Working Group on "Open Data in Linguistics" mailing list open-linguistics@lists.okfn.org (two people in this mailing list have offered to volunteer testing the extraction tool and helping integrating this tool with the WordNet)
Wiktionary Italia through the direttivo@wikimedia.it and the Spaghetti Open Data mailing list

Endorsements

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

Community member: add your name and rationale here.
d:User:Jura1. Hopefully the extraction method developed in this project will eventual facilitate including Wiktionary in a Wikibase environment.
Aryamanarora I saw the original proposal on Wiktionary and think it's very feasible and cool.
+1 for moving towards wikibase/any-structured-data-store. It looks like this project is a step in the right direction and it's execution will produce a technological probe^[2] that will inspire future work structuring wiktionary data. --EpochFail (talk) 14:15, 22 April 2016 (UTC)
I think that this will be an great way for users to visualize the history of changes of their language. If this tool becomes available, maybe it would be easier to identify patterns, which could advance research into the processes of human communication. Hardifesses (talk) 16:08, 22 April 2016 (UTC)
Wiktionary is an immensely useful tool for etymological data. This software would facilitate using it for cognate corpora and much much more. --Newbiepedian (talk) 16:44, 22 April 2016 (UTC)
As a lexicographer, I 100% support this excellent proposal -- really helps people understand relationships between words and across languages. --esperluette
This is a cool project, and I sincerely hope it will be funded and realized.
I endorse this interesting project. Very cool! VittorioPalmisano (talk) 12:40, 27 April 2016 (UTC)
I thinks this could be a very useful and interesting project, because in this way users which are aware of Semantic Web and Natural Language Processing, could easily retrive information about words etymology, but also query and visualize data in a very simple and clear way. Marcyborg (talk) 13:18, 27 April 2016 (UTC)
I Strongly endorse this project, I think is very useful and cool! Daniela Munoz
Ester is one of the most committed scientists and advocates for clarity/quality in research I know. Let's get this off the ground! 108.179.35.130 21:26, 27 April 2016 (UTC)
Wikimedia Italia endorses this project proposal. We think this project is valuable and we have invited other researcher to provide feedback for it. CristianCantoro on behalf of Wikimedia Italia --CristianCantoro (talk) 13:39, 3 May 2016 (UTC)
Heartily endorse. Model could be utilized in other structured metadata scenarios as well, although from the Talk page it sounds like this will be a silo'd project. Irregardless, look forward to the outcome. Seems very important to have framework such as this for possible future integration with other Wikimedia projects! -- BrillLyle (talk) 15:20, 3 May 2016 (UTC)
Pamputt (talk) 12:48, 4 May 2016 (UTC)
Very good idea, very valuable project for the wiktionary ! Lyokoï (talk) 16:04, 4 May 2016 (UTC)
Absolutely :) --Millosh (talk) 17:55, 4 May 2016 (UTC)
Great idea, please try to integrate Wikidata / Wikibase technology. -- T.seppelt (talk) 12:33, 5 May 2016 (UTC)
I think the project is great, not only (as stressed in Talk page) for the database/linked data backend, but also for the UI, that may as well integrate other ways to navigate the data (use of semantic relatedness was mentioned in the proposal). Dodecaplex (talk)
I think the idea behind this proposal is a huge leap towards the integration of Wiktionary into Wikidata. I also see it as a long-term opportunity for the inclusion of other lexical databases. Hjfocs (talk) 15:45, 9 May 2016 (UTC)
I think the proposal is really well structured and presented. Furthermore, the proposed idea looks very interesting and useful for exploratory scenarios in wiktionary data and entries. Finally, I see the results could be easily integrated in the data model of Wikidata. Tommaso.dinoia (talk) 11:23, 1 June 2016 (UTC)
I am endorsing this for the Wikidata dev team. The proposal looks good from my side. We are going to work on making it possible to store Wiktionary's data in Wikidata based on the proposal that has been linked here. As always we will not be working on migrating the actual content as that is a clear task for the editing community. This IEG proposal looks like it will be very useful for the migration of the data once it is technically possible to store it in Wikidata. --Lydia Pintscher (WMDE) (talk) 17:05, 3 June 2016 (UTC)
That would be indeed be a great project to integrate. On French Wiktionary, there article on frequent prefixes and suffixes, like télé-, would that be included ? I also have a research project on lexems "related to structures in French" which contains more than 1600 rows in misc. tables, which should be easy to extract and maybe useful for such a project. Please contact me if I can help in any way. --Psychoslave (talk) 17:52, 6 November 2016 (UTC)
This has my full support! It would be an extremely interesting, and very cool project, and I think it would make Wiktionary's etymologies far more accessible to the average reader, who may find it useful to visualize the information rather than interpret a text-only etymology. Andrew Sheedy (talk) 21:13, 2 December 2016 (UTC)
Good Idea!!! Ferdi2005 _(Posta) 15:39, 28 March 2017 (UTC)

I'm surprised it doesn't already exist. It would help me personally. 24.212.191.236 16:06, 23 January 2022 (UTC)

References

[1] ttp://www.lrec-conf.org/proceedings/lrec2012/pdf/387_Paper.pdf

[2] ttps://www.lri.fr/~mackay/pdffiles/CHI03.probes.pdf

[1]

[2]