Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary/Midpoint

This project is funded by a Project Grant

Report accepted

This midpoint report for a Project Grant approved in FY 2017-18 has been reviewed and accepted by the Wikimedia Foundation.

To read the approved grant submission describing the plan for this project, please visit Grants:IEG/A graphical and interactive etymology dictionary based on Wiktionary.
You may still review or add to the discussion about this report on its talk page.
You are welcome to email projectgrantswikimedia.org at any time if you have questions or concerns about this report.

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first months.

Summary I

In a few short sentences or bullet points, give the main highlights of what happened with your project so far.

Creation of a java application to extract a database of etymological relationships from the English Wiktionary xml dump file: dbnary_etymology based on Dbnary;
Generation of a 2.5 Gb RDF database of etymological relationships, definitions, pos, pronunciations of both English and non English words with 39 million triples covering a total of 3173 languages;
Testing of a Virtuoso Database Management System using appropriate sparql queries: Virtuoso;
Setting up a Tool Labs and a Wikimedia Labs project;
Testing using a D3 graph visualization: coffee, dolor.

Methods and activities I

So far the project has made some very important steps.

First step: Extracting the Database

First a java application called dbnary_etymology has been developed and is available on bitbucket. This application is an extension to Dbnary that implements a new method to process etymology, derived terms and descendants sections of the English Wiktionary.

This part of the work has been done builing upon the work of prof.Gilles Sérasset (who helped merge the code with Dbnary). Both prof.Sérasset and his PhD student Andon Tchechmedjiev have given some precious insights on how to improve the code - from time to time I still go back to the code and try to improve it.

I will qualitatively illustrate the extraction process with an example. The image below shows the current Etymology section of the English word coffee:

A screenshot of the Etymology section of the English word "coffee" in the English Wiktionary

The source that generates that text in Wiktionary is the following:

===Etymology===
From 1582 via {{der|en|nl|koffie||coffee}}, from {{der|en|it|caffè||coffee}}, from {{der|en|ota|قهوه|tr=kahve||coffee}}, from {{der|en|ar|قَهْوَة||coffee, a brew}}. Cognates include {{cog|tr|kahve}}. The Arabic word has been said to originally have referred to wine, although some sources instead claim it traces back to the name of the {{w|Kingdom of Kaffa|Kaffa region of Ethiopia}}, which is an {{der|en|omv}} word.

By applying a set of recursive rewriting rules this section is rewritten into a pattern of strings that looks like this (see contex-free grammar for something equivalent):

FROM LEMMA1, FROM LEMMA2, FROM LEMMA3, FROM LEMMA4. COGNATE_WITH LEMMA5. LEMMA6, LANGUAGE.

And the following relationships are translated into the following triples:

LEMMA etymologicallyDerivesFrom LEMMA1
LEMMA1 etymologicallyDerivesFrom LEMMA2
LEMMA2 etymologicallyDerivesFrom LEMMA3
LEMMA3 etymologicallyDerivesFrom LEMMA4

that, with the appropriate substitutions becomes:

eng:coffee etymologicallyDerivesFrom nld:koffie
nld:koffie etymologicallyDerivesFrom ita:caffè
ita:caffè etymologicallyDerivesFrom ota:قهوه
ota:قهوه etymologicallyDerivesFrom ara:قَهْوَة

Anything after the "." is ignored. From this etymology section we have extracted 4 etymological relationships.

The applications takes as input the xml file containing the dump of the English Wiktionary, processes all Etymology, Derived terms and Descendants sections in the English Wiktionary, to build a big database of etymological relationships (triples) between all words contained in the English Wiktionary, both English and non English words.

This database can be build with just the following 2 command lines

java -cp $path_to_the_jar -Xmx1792m -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug org.getalp.dbnary.cli.ExtractWiktionary -l eng -m lemon -o $myout $path_to_dump > english.ttl
java -cp $path_to_the_jar -Xmx1792m -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug org.getalp.dbnary.cli.ExtractWiktionary -x -l eng -m lemon -o $myout $path_to_dump > foreign.ttl

where $path_to_dump is the path to the English Wiktionary dump. The first line processes English words, and the second line processes non English words.

After merging the two files with rdfcat we have obtained a database of triples containing etymological relationships of words as extracted from the English Wiktionary that can be updated every time a new dump is released.

Second step: Extracting the tree of etymologies for a specific word from the Database

The database described above contains all etymological relationships. It has an interest in its own. Our purpose though goes beyond the creation of such a database. We want to interrogate the database, we want to know all words that are etymologically related to a specific word and we want to visualize the etymological tree of that word.

In order to be able to query the database we needed to setup a server. We used a Virtuoso server, which had already been setup for the Dbnary project, and asked to put our data in it. The database can be queried using sparql here. Unfortunately the database that can be queried there is our second release and as such contains a lot of incorrect data. Lots of improvements have been made upon it.

Currently we are setting up the same Virtuoso server on Wikimedia Labs with the help of the WMF developers. This way we will be able to update the data on the server as often as we like.

Third step: Testing

In order to the test the database extraction procedure and the visualization, we thought of using a simpler visualization than the one we showed in the demo. The visualization we chose for testing uses graphs instead of trees. This is because, as of now, etymological relationships extracted from Wiktionary are not as linear as those described by a tree but also contain loops due to errors, inconsistencies or incompleteness.

For example, sometimes etymologies in Wiktionary are incomplete, e.g., one wiktionary page says that WORD1 derivesFrom WORD2 and that WORD2 derivesFrom WORD3 while another Wiktionary page says that WORD1 derivesFrom WORD3. As a result, a visualization that puts together this information will have a loop because links can go from WORD1 to WORD3 in two different ways: by stepping on WORD2 or by going directly to WORD3. This is easy to fix, under the assumption that the longest path is the true path.

With testing we will derive some rules to "prune" the graph and get a tree.

Testing consists (in order) in:

Writing queries to interrogate the server about a specific word;
Getting the output graph of etymological relationships;
Using the output to produce an interactive visualization in the form of a graph;
Developing rules to prune the graph;
Fixing bugs in the extraction code or correcting errors in sections in Wiktionary;
Eventually proposing the use of a new template to facilitate data extraction;
Re-extracting the database.

To see an interactive visualization developed during testing after querying the database for word dolor go here, or here for word coffee. Below I am posting a screenshot of those visualizations, in static (non interactive) form.

At this link you will find my project in the tool labs.

A screenshot of the interactive visualization for word dolor obtained using data from a query to the server

A screenshot of the interactive visualization for word coffee obtained using data from a query to the server

Midpoint outcomes I

What are the results of your project or any experiments you’ve worked on so far?

Please discuss anything you have created or changed (organized, built, grown, etc) as a result of your project to date.

I will summarize here what I have described in detail above.

The main outcome of the project is the implementation of a java application to automatically create a database of etymological relationships starting from a dump of the English Wiktionary. As a result a database of etymological relationships can be created at any new dump release of the English Wiktionary.

A second outcome is the availability of one of those databases on a server online that can be queried to produce visualizations.

A third outcome is the implementation of a testing framework to improve on the visualization output as well as on data extraction.

Finances I

I plan on giving talks about the tool once a server has been setup on Wikimedia Labs, which might require one month. I might ask for an extension of the project by one month for the purpose of dissemination. Therefore I plan on spending the $1000 for dissemination in January.

I am optimistic and I think the project itself though can be completed by the end of December.

Learning I

During these first months of the project I have learnt how to interact with people in the WMF and with volunteer. If you want help you have to ask and be present in IRC channels or mailing lists. Participation to Wikimania has been precious as I have had a chance to meet a lot of people.

What are the challenges

What challenges or obstacles have you encountered? What will you do differently going forward? Please list these as short bullet points.

Setting up the server has been a big challenge because of the resources needed and administrator privileges.
As this is an online community and you have to build your network, it is not straightforward to interact with people unless you make specific efforts to contact them. However, once you find the right contacts things become much easier and enjoyable.
Learning the Wikimedia infrastructure without interacting with people was somehow challenging: I would have liked to have a specific technical mentor assigned to me and that could point me to the specific resources (in terms of people, infrastructures or links).

What is working well

What have you found works best so far? To help spread successful strategies so that they can be of use to others in the movement, rather than writing lots of text here, we'd like you to share your finding in the form of a link to a learning pattern.

Once you ask the right people (volunteer/WMF), you get a lot of help in a very efficient way (e.g., IRC channels #wikimedia-dev #wikimedia-labs #wiktionary #wikimedia-devrel #wikidata);
Meeting people that have very specific interests and from them you can learn a lot (e.g.,Tremendous Wiktionary group);
Wikimania;
My advisor prof.Tommaso Di Noia is giving me good advices every time I meet him (~once per month).

Next steps and opportunities I

What are the next steps and opportunities you’ll be focusing on for the second half of your project? Please list these as short bullet points. If you're considering applying for a 6-month renewal of this IEG at the end of your project, please also mention this here.

Set up a server (Virtouso if possible);
Refine sparql queries;
Refine pruning rules for the graphs;
Set up a web page for the visualizations;
Give talks about the project;
Write emails and blogs for dissemination;
Create a page where I post ideas for new templates in Etymology sections - I have already contributed to the improvement of two templates and proposed a new template;
This project can potentially be renewed to:
- Create data to set up a timeline on the tree with dates associated to branches in the tree;
- Push data to Wikidata once the infrastructure there is ready;
- Clean data in collaboration with etymology editors;
- Organize hackathons (for example in linguistics departments) where users check on etymologies in Wiktionary and try to improve on them - e.g., make them parsable.

Grantee reflection I

We’d love to hear any thoughts you have on how the experience of being an IEGrantee has been so far. What is one thing that surprised you, or that you particularly enjoyed from the past 3 months?

I thought I would be contacted more often to check on the project progress.

I particularly enjoyed keeping in touch with volunteers and WMF staff that I met during Wikimania and that supported my project.

Summary II

At this point we reached the second midpoint of this project. So far we have accomplished the following:

VISUALIZATION - We tested different visualizations, and finally chose dagre-d3 which arranges nodes from left to right according to the direction of the connecting arrow. Also we are coloring differently ancestors of the searched node. Finally, we improved queries to the database and also we are using RxJS which speeds up retrieval of queries from the server.

DISSEMINATION - We presented etytree at two meetings in Bari, Italy, at the Polygloth Gathering in Bratislava, Slovakia, at the Wikimedia France offices in Paris, and at Wikimania 2017 in Montreal, Canada.

DATA EXTRACTION - We fixed a number of bugs, from the extraction of compounds, to the extraction of Ido and Esperanto entries.

Methods and Activities II

As the basic structure of etytree has already been laid out, our main goals for this project renewal are to disseminate the project and to improve some of its features. The two things go hand in hand as, while we do dissemination, we also get feedback and useful tips.

In the following Sections we will describe the main areas we have been working on these past 4 months (visualization and data extraction improvement, dissemination) and the methods used in each of these areas.

Visualization Improvement

At the end of the first etytree grant we delivered a working tool that was showing the etymological tree of words using clouds of words, with no particular order (see screenshots above). One of the main aims of the renewal is to improve the visualization, in particular to display nodes in the directed graph exploiting directionality.

At Wikimania we interacted with the Wikidata team which suggested we tested the javascript library vis.js to visualize directed graphs. After we set vis.js up, we realized that vis.js graph layout would not always arrange nodes from left to right following the direction of the arrows. We will explain what we mean with a screenshot. here.

This is a screenshot of the directed graph representation for word "door" using the javascript graph visualization library vis.js

Even though vis.js has been set to use a "left to right" representation of directed graphs, the final graph doesn't always have a left to right orientation, as it happens for example for the directed graph of word "door" shown in the screenshot: not all arrows go from left to right. The performance of vis.js can be tested using the visjs-feature branch of etytree on Github.

We then decided to stick to dagre-d3 as it displays the correct hierarchy between nodes.

A second improvement we have been working on is the use of RxJS to manage queries to the Virtuoso SPARQL endpoint. Thanks to this improvement, we can now retrieve complex etymology trees by splitting up a large query into multiple small queries and merge multiple responses.

A third improvement involves filtering of the output graph: many nodes have been filtered out so that the user is left with a less cluttered graph. For example, in the case of compound words, only the first ancestors of the compound word are shown, and not ancestors of ancestors. This has been achieved as follows.

We replaced the query

    var ancestorQuery = function(iri, queryDepth) {
           var query = "PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org/dbnaryetymology#> ";
           if (queryDepth === 1) {
               query +=
                   "SELECT DISTINCT ?ancestor1 ?ancestor2 " +
                   "{ " +
                   "   <" + iri + "> dbetym:etymologicallyRelatedTo{0,5} ?ancestor1 . " +
                   "   OPTIONAL { " +
                   "       ?eq dbetym:etymologicallyEquivalentTo ?ancestor1 . " +
                   "       ?eq dbetym:etymologicallyRelatedTo* ?ancestor2 . " +
                   "   } " +
                   "} ";
           } else if (queryDepth === 2) {
               query +=
                   "SELECT DISTINCT ?ancestor1 " +
                   "{ " +
                   "   <" + iri + "> dbetym:etymologicallyRelatedTo{0,5} ?ancestor1 . " +
                   "} ";
           }
           return query;
       };

with a query that returns an ordered set of ancestors

   var parseAncestors = function(response) {
           var ancestorArray = response.reduce((all, a) => {
               return all.concat(JSON.parse(a).results.bindings);                                 
           }, [])
               .reduce((ancestors, a) => {
                   ancestors.push(a.ancestor1.value);
                   if (a.der1.value === "0" && 
                       undefined !== a.ancestor2 && 
                       lemmaNotStartsOrEndsWithDash(a.ancestor1.value)) {
                       ancestors.push(a.ancestor2.value);
                       if (a.der2.value === "0" && 
                           undefined !== a.ancestor3 && 
                           lemmaNotStartsOrEndsWithDash(a.ancestor2.value)) {
                           ancestors.push(a.ancestor3.value); 
                               if (a.der3.value === "0" && 
                               undefined !== a.ancestor4 && 
                               lemmaNotStartsOrEndsWithDash(a.ancestor3.value)) {
                               ancestors.push(a.ancestor4.value);
                               if (a.der4.value === "0" && 
                                   undefined !== a.ancestor5 && 
                                   lemmaNotStartsOrEndsWithDash(a.ancestor4.value)) {
                                   ancestors.push(a.ancestor5.value);
                               }
                           }
                       }
                   }
                   return ancestors;
               }, []).filter(etyBase.helpers.onlyUnique);
           return ancestorArray;
       };

With this new query ancestors are ordered (ancestor1 is the direct ancestor of the searched word, ancestor2 is the direct ancestor of ancestor1 and so on), thus ancestors can be filtered based on their order. For example, if the word we are searching is a compound word (say "doorbell"), we want to only visualize the first ancestors ("door" and "bell"), we don't want to also see ancestors of the words that make up the searched word (ancestors of "door" and ancestors of "bell"), which will just clutter the visualization. Using a query that returns an ordered set of ancestors, we can stop at the desired depth (at ancestor1, or at ancestor2, and so on) depending on the word of interest.

More improvements include using Node.js modules, highlighting ancestor nodes in a different color, visualizing special labels correctly (e.g. wiktionary:Reconstruction:Proto-Indo-European/dʰwer- *dʷwer- was visualized as *dwwer- beforehand)

Data extraction improvement

While the improvements described above focus on the visualization, the improvements we describe here involve the database, which is extracted from the latest English Wiktionary XML dump.

We fixed a few bugs that were occurring when parsing templates or regular expressions of compound words. This bug implied that a large number of nodes in the database had many connections. When an etymology graph included one of those nodes, the SPARQL endpoint could not return such a graph and would incur in a timeout error.

In particular we improved parsing of the blend, compound, etycomp, infix, affix, prefix, confix, suffix, circumfix, -or, -er templates and of regular expressions for compound words using "+", "and". This improvement considerably reduced the interconnectedness of nodes in the database.

Also, now we are extracting Reconstructed words thanks to an improvement in Dbnary and the faceted browser is up and running so that the database can be easily explored (for example explore words etymologically related to pistachio).

Dissemination

Polygloth Gathering 2017

Last June I presented my work at an international meeting: the Polyglot gathering, which was held in Bratislava from the 31st of May to the 4th of June. Five hundred people joined for this meeting speaking more than 125 languages. I presented a 45 minutes talk (see the program) with title "Etytree: a graphical multilingual etymology dictionary using data extracted from the English Wiktionary". At the meeting I received a lot of interesting feedback and started a collaboration with one of the participants who is now actively contributing to the development of the software. Also I collected email addresses from interested participants which I will contact during the final testing phase of etytree.

Wikimedia France

I spent almost a week working from the Wikimedia France Office in Paris where I had the opportunity to interact with the local Wikimedia community. I also had a chance to present my work to some Wiktionarians and Wikidata experts (see twit).

Wikimania 2017

At Wikimania I met a number of Wikidata developers and I managed to arrange a visit to the Wikidata offices in Berlin, which would be very helpful for the possible integration of this project into Wikidata. After talking with some of the people in the Wikidata team I got two main suggestions: first that I try vis.js as a library to visualize graphs and second that I use "POST" instead of "GET" when a request (the query string) to the server is too long.

The submission is available at Etytree: A Graphical and Interactive Etymology Dictionary Based on Wiktionary. The slides of the 5 minutes talk are shownin in the thumbnail.

Midpoint Outcomes II

The main outcomes of these four months of work are:

The implementation of a working visualization that exploits directionality of the graph and that reduces clutter: see the etytree tool.
The availability of a working faceted browser to explore the database directly: explore words etymologically related to word pistachio.
Dissemination in 4 different countries so far.

Finances II

Next steps and opportunities II

With respect to the initial goals set by the renewal proposal I am left with three main changes that I need to perform: process diacritical marks, add transliteration, improve software documentation; and one change that I have partially implemented: tailor parsing of etymology sections based on the specific language.

Among the three tasks just mentioned, we believe the most important one is adding documentation, while transliteration is probably not as important. Among the facultative improvements instead we have already accomplished one: we improved the Virtuoso facetted browser, which is now correctly working.

This month we are planning to visit for 10 days the Wikidata Office in Berlin, after presenting etytree at the WikidataCon. With the Wikidata group we would like to explore the feasibility of exporting the database into Wikidata, inclusion of the visualization in Wiktionary pages through Wikidata, opportunities for further dissemination as well as ways to attract more users to the tool page.

In the next months we would like to present at the Italian WikCon 2017, in Switzerland, England, Spain, and we would like to work more on the dissemination of the project online or through the media.