TJones (WMF)

Learn more about this page

Welcome to Meta!

Latest comment: 9 years ago1 comment1 person in discussion

Hello, TJones (WMF). Welcome to the Wikimedia Meta-Wiki! This website is for coordinating and discussing all Wikimedia projects. You may find it useful to read our policy page. If you are interested in doing translations, visit Meta:Babylon. You can also leave a note on Meta:Babel or Wikimedia Forum if you need help with something (please read the instructions at the top of the page before posting there). Happy editing!

-- Meta-Wiki Welcome (talk) 20:15, 6 July 2015 (UTC)Reply

Fallback language

Latest comment: 7 years ago6 comments3 people in discussion

Hi Trey, you just asked whether the search tools on li: needed to fall back to Dutch. Well, I opened the discussion at the local Village Pump, but honestly, I don't know. I would like to know in more detail what these tools do. Dutch and Limburgish are in fact fairly similar, and some of the text processing used for Dutch is probably also applicable to Limburgish. If the tool is built to ignore articles, for instance, it would hardly be an improvement if the fallback function were removed. In fact, it would make sense to amend the existing tool for Limburgish, to help it cope with dialectal differences and common spelling errors. (As you must know, Limburgish has no commonly accepted standard and it is mainly a spoken language.) Steinbach (formerly Caesarion) 15:23, 4 October 2017 (UTC)Reply

@Steinbach: Thanks for the quick reply. The Dutch analyzer removes stop words—usually articles, prepositions, conjunctions and other small words that generally don't carry meaning, and also certain other words, like forms of be, do, make, etc. The exact list and scope varies by language. On our wikis, the stop words are still present, as we index everything twice; once with analysis and once without, so that we can favor exact matches, and so we can match queries made up entirely of stop words like "Take That", "The The" or "to be or not to be". The analyzer also does stemming, which tries to reduce different forms of words to a common base form. In English, this would change hope, hoped, hopes, hoping all tohope, so searching for one would find the others.

There are several potential problems even if there are also some benefits. There can be incomplete coverage—so some articles are the same and are stop words, while others are different, and they are treated differently. There can be words that look like stop words to the analyzer but are not. If there are spelling differences for inflections on words, some may be stemmed properly and others not. Parts of words that look like inflections but are not can be stripped by the stemmer, linking words in the index that are not actually related. The analyzer knows certain exceptions—irregular spellings or inflections—but if the words are spelled differently in another language, they will not be treated properly.

Another problem we've run into is that changes that do the right thing in one language are wrong for another. I originally the entire fallback situation when adjusting the analyzer for Russian to treat two letters as identical; this is good for Russian, but not for any other language using the Cyrillic alphabet, like Ukrainian. The added complexity makes mistakes like this more likely to happen and harder to catch. Breaking changes can also sit for months without any effect because not only does the code have to be changed, the wiki has to be re-indexed for the changes to take effect. This can make tracking down problems much harder.

Unfortunately, amending the Dutch analyzer for Limburgish is out of the scope of the current project. Some elements of some analyzers are available for tweaking, but significant changes require forking the code and making a new analyzer. That's certainly possible from a technical standpoint because the analyzer is open-source, but not something I have the time or expertise to work on for all the analyzers affected by this fallback situation. If someone else did the work, though, I would love to test a new analyzer and work on getting it deployed for Limburgish--language projects.

Do you think we should copy any of this conversation over to the Village Pump? I appreciate you reaching out to me here and commenting there. Any assistance you can give explaining the situation to other Limburgish speakers and relaying their concerns to me is very welcome! TJones (WMF) (talk) 19:37, 4 October 2017 (UTC)Reply

Thanks for your reply. As you may understand, my first thought is: go ahead and remove the fallback function. I will take a look at the analyser code later. I would be more than happy to tell you what a tool for analysing Limburgish text would need, and I might consider adapting the software myself. While my long time work with Wikipedia made me more tech savvy then I was in my early twenties, I'm still not an expert, let alone a professional. Do you think a non-expert would be able to write such software? Steinbach (formerly Caesarion) 20:11, 4 October 2017 (UTC) (Even if I can, this will take a lot of my time, so for the time being your proposal is fine.)Reply

If you are reasonably comfortable with Java programming I think it would be plausible. If you can work out what the Dutch analyzer is doing to process Dutch text, then do the same thing, just with the proper Limburgish twist, it would be doable. I haven't looked at the Dutch analyzer closely, and there are several common methods for language analysis, so I'm not sure what's going on. But two common approaches are rules and dictionaries. In English, a rule might say "remove -ing from the end of the word", and for a closely related language you might just need to change that to "remove -in' from the end of the word". Though it's often more complicated—in English after removing the -ing, you have to decide whether to add a final "e" (hoping -> hop -> hope) or deduplicate a final letter (hopping -> hopp -> hop). Rules can also have exceptions (made-> make; came -> come; could -> can; went -> be). Dictionaries treat everything as an exception and list them all out. Translating a dictionary approach would be a lot of work, but would probably work well. With rules, you have to make sure you catch everything, and of course Limburgish will almost certainly have some exceptions that are different from those in Dutch. An added complication might come from it being a primarily spoken language, since there could be more variation, especially in spelling, but also variations in grammar in varieties spoken in different places. Writing tends to help speakers converge on a standard. All that said, it would be an interesting project if you are familiar with Dutch and Limburgish, and if it turned into too much work, no harm in having tried! TJones (WMF) (talk) 19:56, 5 October 2017 (UTC) (I forgot to sign my comment earlier!)Reply

I will permalink this discussion at the Limburgish Pub. --OosWesThoesBes (talk) 13:43, 5 October 2017 (UTC)Reply

These changes were completed the week of October 9th and deployed the following week. The re-indexing of the affected wikis was completed a few hours ago and should be live everywhere now. The list of affected languages is on the Phab ticket T177871 and a list by wiki is on that page in a comment. For more details, see the write up on MediaWiki. TJones (WMF) (talk) 14:56, 24 October 2017 (UTC)Reply

Search and language support

Latest comment: 6 years ago3 comments2 people in discussion

It's always good to read reminders that English is special and other languages need more care, as you posted in mw:Wikimedia_Developer_Summit/2018/Participants! Thanks. Do you think WMF should do more about improving analyzers upstream, a bit like Niklas suggests about Apertium and Michael Holloway says about free software in general? (A few others also made comments about reuse by third parties, upstream collaboration and long-term reuse.) --Nemo 15:12, 3 December 2017 (UTC)Reply

@Nemo: I do think we can do more with analyzers, which is why it has been one of my goals this quarter and will probably be again next quarter to do find morphological analysis software packages that can be converted to useful analyzers. Of course, there are always questions of overall prioritization which may change the relative importance of any task, so this may go one for a long time, or finish next quarter. For language analysis, the big difficulty I see is finding morphological analysis software that strikes a balance between usefulness of features and the maintenance burden it creates. I'm still in the middle of my first foray into this way of doing things with a Serbian stemmer. It has many more positives than negatives—it seems to give good results, it's already in Java, the developer has been responsive and helpful, and the bugs I found are critical, but easily fixed. Getting it into working order and then wrapping it into an Elasticsearch analyzer seems relatively straightforward. The question of whether we can adequately maintain one, two, or a dozen such projects is still unclear to me, but I hope so. TJones (WMF) (talk) 15:03, 4 December 2017 (UTC)Reply

Serbian is a good starter. I'm happy we'll find out about viability! It's easier to ask help from other orgs then. Nemo 21:55, 4 December 2017 (UTC)Reply

Working on Javanese Wikipedia transliterator

Latest comment: 6 years ago2 comments2 people in discussion

Hi Trey, I'm Benny from Indonesian Wikipedia. I've just read your blog and I'd like continue the request phab:T47779 going. Can you give me pointers how to start moving this forward? ✒ Bennylin 20:22, 14 June 2018 (UTC)Reply

Hi Benny! Let's continue this discussion over on Phabricator. TJones (WMF) (talk) 21:48, 14 June 2018 (UTC)Reply

The Community Wishlist Survey

Latest comment: 6 years ago1 comment1 person in discussion

Hi,

You get this message because you’ve previously participated in the Community Wishlist Survey. I just wanted to let you know that this year’s survey is now open for proposals. You can suggest technical changes until 11 November: Community Wishlist Survey 2019.

You can vote from November 16 to November 30. To keep the number of messages at a reasonable level, I won’t send out a separate reminder to you about that. /Johan (WMF) 11:25, 30 October 2018 (UTC)Reply

Add topic