Lingua Libre/2022 Review
Status: draft staled. (Irrelevant as long as tech side is not fixed). The page below is mainly a strategic review of existing assets (human resources, knownledge base, data), competitions, comparative advantages and SWOT. |
On LinguaLibre, we are starting to design a broad 6 months long PR effort campaign. I wonder if such overall PR campaign based on mailing to popular languages and tech blogs/newsrooms, creation of base PR materials followed by translations, and writing of ~8 needed downstream grant requests could itself be funded by some 50k$ Grant fund ?
- Current state
- We have a lot to do, thousands languages, most of them are on the decline. We do not want to stay solely on major languages, we have to reach out to very small language early to print marginalized languages and their conservation into our DNA and brand.
Current state
editAspects and associated human resources
edit- Estimates as of spring 2022.
Assets | Human resources | Reactive/regular user on this front | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Type | Aspects or Lingualibre:Roles | Final score | Importance | People | Hrs/mths | Bus factor | A | Y | P | P | W | L |
Technology |
Core technology: Recording Studio. Backlog of:
|
5/10–fragile | Critical | n.a. | 10h - tests | n.a | No | Yes | Yes | Yes | No | No |
Technology | Lists management system | 7/10–correct | Medium | n.a. | 10h | n.a | No | Yes | No | Yes | Yes | No |
Documented code | Github repositories documentations | 9/10–good | High | 3 | 1h | Pass | No | Lead | Yes | Yes | Yes | No |
Tickets | Phabricator tickets | 8/10–good | Medium | 2 | 3h | Pass | No | Yes | Lead | Lead | Yes | No |
Human Resources | Developers back — MediaWiki, server | 6/10–stable | Critical | 1+ | 6h | Fragile | No | No | Yes | No | No | No |
Human Resources | Developers bots | 6/10–stable | Critical | 1+ | 6h | Fragile | No | No | Yes | Lead | No | No |
Human Resources | Developers front — CSS, HTML, JS | 2/10–insufficient | High | 2 | 3h | Fragile | No | Yes | Yes | No | Yes | No |
Wikipages | Editors on existing Help:* pages
|
7/10–correct | High | n.a. | 10h | n.a | No | Lead | No | No | Yes | No |
Human Resources | Editors on other project pages (not discussions) | 8/10–correct | Medium | 3 | 40h | Pass | No | Yes | No | No | Yes | No |
Human Resources | Editors on welcoming, onboarding | 5/10–basic | Medium | 3 | 5h | Fragile | No | No | No | Yes | Yes | No |
Human Resources | Editors/ on strategy, grants requests writing | 4/10–basic | High | 2 | 10h | Fragile | Yes | Lead | Yes | Yes | Yes | No |
Human Resources | Editors on communication, outreach, PR | 8/10–good | High | 1 (Mel) | 28h | Good | Yes | Lead | No | No | Yes | No |
Institutional resources | Support staff: Wikimedia France, WIR. | 8/10–correct | High | 2 | ? | Fragile | Lead | Lead | No | No | No | No |
Funding flow | Looking for development funds. | 4/10–basic | High | n.a. | n.a. | Fragile | Lead | Yes | No | No | No | 2016 |
TOTAL | ~5 | Fragile | ||||||||||
Status score integrates importance, existing resources, fragility, and opportunity cost. Critical server side is well managed by volunteer and staff, but bus factor stays too weak, resulting in modest 6/10 score. |
Risk/opportunity: We must use current calm to double the community and ensure long term HR and know-how's sustainability.
Distribution of recordings per languages
edit- See also Languages gallery.
LinguaLibre linguistic coverage is expanding, first with major Western languages, then toward other large languages and minority languages in Western countries, and last toward non-Western marginalized communities. This is a starting point.
World languages | Lili languages | ||||||
---|---|---|---|---|---|---|---|
Demographic[1] | Number | Ratio | Number | Coverage | Supported language's profile | Examples | Community's presence |
Major (>30M) | 30 | 0.5% | 20 | 66% | Mostly major Western or Indian languages. | FRA, SPA, BEN | Solid: Several productive speakers. Sustained or periodic. |
Large (1~30M) | 350 | 5% | 150 | 40% | Mostly Western languages, other notable languages | NLD, AFR, CAT | Emerging: One productive speaker, few not-retained speakers. Fragile. |
Marginalized (<1M) | 6500 | 94% | 80 | 1% | Mostly larger minorities in Western countries. | ATJ, BRE, EUS | Contact point: No productive speaker, one not-retained speaker. Below fragile. |
Risk/opportunity: We must raise awareness, seed, demo, train to Lingualibre among true marginalized communities on the 7 Wikimedia regions. Know-how is moved away from West nerds and into true local, linguistically at risk communities.
Distribution of recordings publications per beneficiaries
editAfter 6~8 years, beneficiary users are by and large Western wiktionrists. French users with 54.8% of wiktionary recordings publications for the French and Occitan wiktionaries.
Local wiki | Bot name | First edit | Edit count | % | Groups | Region of most beneficiaries |
---|---|---|---|---|---|---|
Wiktionaries | ||||||
fr.wiktionary.org | Lingua Libre Bot | 12 June 2018 | 477,000+ | 51.9% | bot | Europe/France |
pl.wiktionary.org | Olafbot (g) | 4 March 2020 | 240,000+ | 26.1% | bot | Europe/Poland |
ku.wiktionary.org | Lingua Libre Bot | 30 November 2021 | 64,000+ | 7.0% | bot | Asia |
or.wiktionary.org | Lingua Libre Bot | 10 January 2023 | 33,000+ | 3.5% | ― | Asia/India |
oc.wiktionary.org | Lingua Libre Bot | 16 December 2018 | 25,000+ | 2.7% | bot | Europe/France |
shy.wiktionary.org | Lingua Libre Bot | 8 September 2021 | 2,500 | 0.3% | bot | Africa |
All other projects | — | 0 | 0.0% | ― | World | |
Wikidata | ||||||
www.wikidata.org | Lingua Libre Bot | 10 June 2018 | 78,000+ | 8.5% | bot | Unclear |
Technical projects | ||||||
lingualibre.org | Olafbot | 26 February 2021 | 5,208 | bot | Unclear | |
meta.wikimedia.org | 10 June 2018 | 6 | ― | |||
References : Lingua Libre Bot (g), Olafbot (g). |
See also :
- quarry:query/72976 – pl:wikt : Olafbot Lili edits, 2022.
Wiktionaries overall pageviews
editWiktionary, which is the main point of reused, is only properly documented and visited in few privileged languages.
Number of files uses
editNumber of times listened
edit- Lingualibre : ~1,000,000,000+
Other data
editBaglama2 has rich data which would give new light on usages.
PR outreach
editOn PR side, we observe the following fronts and opportunities :
Type | Description | M.L. | Workload | Budget | Comment |
---|---|---|---|---|---|
Communication | Overall outreach campaign seeding the idea of LinguaLibre into diverse community, geographically dispersed. Cause new influx of advocates, speakers, event organizers, devs. Ensures bus factor is not a risk anymore = secures project's sustainability. |
n.a. | 7 months | 26k USD | |
Sub-projects | |||||
Outreach/External | Email campaign to dozens of language blogs and newsrooms. Email exchanges, coordination, interviews, copyedit with their authors |
n.a. | 4 months | 16k USD | |
Outreach/Toolkit | Improve base PR materials with: base emails ; base presentations ; Base flyers. | n.a. | 1 month | 4k USD | |
Outreach/Medias | Coordinating and hiring to create testimony video material (example) | n.a. | 1/2 month | 2k USD – coordinator
2k USD – filming day |
Could reuse existing short online videos on language diversity ? |
Outreach/Crowdsourcing | Create a KissKissBangBang crowdsourcing campaign on the model of WikiCheese. | n.a. | 1 month | 4k USD | Objectif is equaly PR by increasing awareness and raising some public money. Helps assess this avenue of autonomous future funding via crowdsourcing. |
Outreach/Translations | Coordinating translation of refined base documents from EN to major languages Languages: ES, FR, AR, RU, HI, ZH, Indonesian, Swahili (language spheres with largest language diversity). |
n.a. | 1/2 months | 0€ | Wikimedians lead translations. |
Grants to request
editThere also are opportunities in writing funding requests and initiating the following :
Type | Description | M.L. | Duration (est.) | Budget (est.) | Comment |
---|---|---|---|---|---|
Coordination | Overall strategy, coordination, planning and multiple fund requests. Get things on rails and rolling at the required speed. Strategies sub-project : technologies, proof of concepts, communication. |
n.a. | 8 months | 32k USD | This position is leading the writing of the fund requests below. |
Technical improvements. | |||||
WMF Technical Fund 1 | LinguaLibre backlog features requests and improvements | n.a. | 3 months | 15k EUR | API: Sparql. Front: CSS, JS, Vuejs, MediaWiki, PHP. |
WMF Technical Fund 2 | Anki e-learning plugin generator | n.a. | 1 month | 5k EUR | API: Sparql. Front: CSS, HTML, JS. |
WMF Technical Fund 3 | Integrated e-learning webpage for words | n.a. | 1~2 months | 5~10k USD | API: Sparql. Front: HTML, CSS, JS, VueJS. |
WMF Technical Fund 4 | Unilex 1000 languages lexical database update | n.a. | 1~2 months | 5~10k USD | Python. |
WMF Technical Fund 5 | Dashboard for linguistic coverage.
|
n.a. | 1~2 months | 5~10k USD | API: Sparql. Front: HTML, CSS, JS, D3js.
Will help redefine LinguaLibre. |
Total: | 35-50k USD | ||||
Field outreach : training local community, recording 5000 words. | |||||
WMFR Micro-fi 1 | Field outreach marginalize language(s) — Region W. Europe "France's Whistled Gascon (1)" | 1 | 4 days | 500 EUR | Volunteer in that region, implies minimal travel, hosting costs. |
WMF Community Fund 1 | Field outreach marginalize language(s) — Region Lat. America "Peru Amerindians languages (3)" | 3 | 2 months | 6~8k EUR | Partnership with Aquaverde. |
WMF Community Fund 2 | Field outreach marginalize language(s) — Region US/Canada "Canada Amerindians languages (3)" | 3 | 2 months | 6~8k EUR | Partnership with Wikimedia Canada and Atikamekw's wiki. |
WMF Community Fund 3 | Field outreach marginalize language(s) — Region Africa | 1+ | 2 months | 6~8k EUR | No contact at the moment. |
WMF Community Fund 3 | Field outreach marginalize language(s) — Region CEE / Russia. | 1+ | 2 months | 6~8k EUR | No contact at the moment. |
WMF Community Fund 3 | Field outreach marginalize language(s) — Region ESEAP: E./S.E. Asia, the Pacific region | 1+ | 2 months | 6~8k EUR | No contact at the moment. |
Others | |||||
WMF Community Fund 4 | 2022 Contribuling Conference | n.a. | 2 days | 4k EUR | Partnership with INALCO. |
WMF Alliances Fund 1 | Taiwan aboriginal languages (16) Wikimedian in residence | 16 | 9 months | 50k EUR | Partnership with |
TOTAL | 24 | 21-23 months | 95-120k | ||
M.L.: marginalized languages supported, creating a solid range of demonstrations of in-community LinguaLibre's usage. Field outreach : proof of concept and seeding expeditions, implying contact, informed consent by minority, travel, possible linguistic research, training of locals, supervision of recording session, hosting fees. |
Hires to onboard
editThe wave of new comers should be welcomed and onboarded properly.
Type | Description | M.L. | Duration (est.) | Budget (est.) | Comment |
---|---|---|---|---|---|
Community engagement | Guiding / onboarding wave of new comers per LinguaLibre:Roles:
|
n.a. | 8 months (part times) |
- Campaign solidity
I do not believe the full effort drafted above can be achieved by volunteers with occasional evolvement. We can expect such team to achieve 1/4 to 1/3 of that plan (go 3-4 times slower). I wonder if this overall coordination plan –two coordinators for 6 months (favored) or one for one year (will do) to initiate multiple projects via grants– could itself benefit from a Wikimedia Fund we previously discussed ? Such ~50k€ central 2022 LinguaLibre Campaign coordinator.s would be more solid to get most or more of this wishlist to actually happen in 2022.
- @Yug: Thanks for sharing these ideas with me to support LinguaLibre. Because this is a complex set of related but separate proposals, my suggestion for next steps would be to have the applying individuals or organizations who are proposing these projects to contact the Regional Program officer over e-mail depending on where they are physically based. You can find more information about each funding region here, and a listing of our team is provided here. I JethroBT (WMF) (talk) 15:03, 27 November 2021 (UTC)
- As a follow up anecdote, the volunteer-based LL PR campaign designed and lead by Marreromarco, which my overall PR+tech funding requests plan aimed to secure and solidify, has just been called off. Marreromarco has assessed the software side to be too basic to satisfy non-wikimedian public, and therefore, not worth his ambitious volunteer-powered PR campaign. In our case, tech and PR goes together. I'm still drafting a draft proposal for Marti. Yug (talk) 11:48, 29 November 2021 (UTC)
Competition
editMetrics
editProject | Licence | Languages | Members | Recorded words | Recorded sentences | Written sentences | Comment |
---|---|---|---|---|---|---|---|
Tatoeba.org | Open | 410[2] | 56,406[3] | n.a. | 929,389[4] | 10,192,845[5] | UI: excellent and lively UI, to learn from |
CommonVoices.org | Open | 90+ | 200,000[6][7] | 30,000,000[6][8] | n.a. | n.a. | UI: site has clean and dynamic UI to learn from. |
LinguaLibre.org | Open | 250+[9] | 2200+[10] | 1,300,000+ | n.a. | n.a. | UI: « best opportunities for progress » |
Forvo.com | NC | UI: excellent and futurist UI, to learn from | |||||
(c) | 1000+ | ||||||
(c) | 1000+ | ||||||
Note: CommonVoices has no aggregated count available. Per language counts from which a sum can be made. One hour estimated equivalent to 2000 words. |
Competitive advantages
editEach site has different focus.
- Tatoeba actually focus on written sentences and parallel sentences (translation) to feed learning applications.
- Common Voice focus on audio sentences by various speakers, with very diverse audios being looked after, to feed Speech2Text and Text2Speech systems.
- LinguaLibre focus on clean audio words to illustrate Wikimedia Wiktionaries (so far), but by design convenient for vocabulary applications and dictionaries (requires working datasets page).
- (Forvo - TBC)
SWOT
edit- SWOT analysis (strengths, weaknesses, opportunities, and threats) analysis is a method for identifying and analyzing internal strengths and weaknesses and external opportunities and threats that shape current and future operations and help develop strategic goals.
SWOT for Lingualibre's UI
editStrength | Weakness/Lags |
---|---|
|
|
Opportunities | Threats |
|
|
Other assessments
editCommunication pages
edit- Lingualibre:Mailing – empty stub needing purposeful writing
- Doing... →LinguaLibre:Events/2023 Editathon
Help pages
editCategory:Lingua Libre:Help pages' : overall review, recategorizing of identified orphans pages, basic improvements, needs assessment (below) done yesterday. Help pages would benefit from some care.
Needs merge :
- LinguaLibre:Language codes systems used across LinguaLibre & Help:Langtags
- Help:Choosing a microphone & Help:Configure your microphone
- Help:Data structure into Help:Documentation opérationelle Mediawiki or a template ?
Needs split:
Needs better inclusion into LinguaLibre:Stats/Languages or links:
Needs expansions (Category:Drafts):
- Help:Ethics
- LinguaLibre:Hackathon
- LinguaLibre:Jargon
- LinguaLibre:Roles
- Help:SPARQL 2
- Template:User ratelimit
- LinguaLibre:Wikidata
- LinguaLibre:Workshops
Comment:
- orphan pages likely missed.
- other namespaces not assessed.
- maintenance ideas, improvements, templates could help
Gadgets scripts
editLinguaLibre Gadgets are JS script enhancing the site by adding some features.
- MediaWiki:Gadget-Demo.js - a words list generator and demo
- MediaWiki:Gadget-ExternalTools.js - I believe is the current wordlist generators... not sure.
- MediaWiki:Gadget-Normalizer.js
- MediaWiki:Gadget-Upload local file.js
Future
edit- mw:Wikimedia Apps/Reading list browser extension ― to create a vocabulary e-learning web application (e-learning)
- WDQS editable — to allow logged in users to edits items via the result tables.
See also
editReferences
edit- ↑ "Summary by language size". Ethnologue. Archived from the original on 12 March 2019.
- ↑ https://tatoeba.org/en/sentences/index
- ↑ https://tatoeba.org/en/users/all
- ↑ https://tatoeba.org/en/audio/index
- ↑ https://tatoeba.org/en/stats/sentences_by_language
- ↑ a b https://commonvoice.mozilla.org/en/languages
- ↑ Rapid approximated sum of participants
- ↑ Extrapolated from : 15,000 hours of voices (raw personal estimate)[1]
- ↑ https://lingualibre.org/wiki/LinguaLibre:List_of_languages
- ↑ https://lingualibre.org/wiki/LinguaLibre:Speakers