Lingua Libre/2022 Review

Status: draft staled. (Irrelevant as long as tech side is not fixed). The page below is mainly a strategic review of existing assets (human resources, knownledge base, data), competitions, comparative advantages and SWOT.

On LinguaLibre, we are starting to design a broad 6 months long PR effort campaign. I wonder if such overall PR campaign based on mailing to popular languages and tech blogs/newsrooms, creation of base PR materials followed by translations, and writing of ~8 needed downstream grant requests could itself be funded by some 50k$ Grant fund ?

Current state
We have a lot to do, thousands languages, most of them are on the decline. We do not want to stay solely on major languages, we have to reach out to very small language early to print marginalized languages and their conservation into our DNA and brand.

Current state

edit

Aspects and associated human resources

edit
Estimates as of spring 2022.
Assets Human resources Reactive/regular user on this front
Type Aspects or Lingualibre:Roles Final score Importance People Hrs/mths Bus factor A Y P P W L
Technology

Core technology: Recording Studio. Backlog of:

  • workflow features
  • valorizing our assets features
  • UX-damaging bugs : ratelimit, online only, …
5/10–fragile Critical n.a. 10h - tests n.a No Yes Yes Yes No No
Technology Lists management system 7/10–correct Medium n.a. 10h n.a No Yes No Yes Yes No
Documented code Github repositories documentations 9/10–good High 3 1h Pass No Lead Yes Yes Yes No
Tickets Phabricator tickets 8/10–good Medium 2 3h Pass No Yes Lead Lead Yes No
Human Resources Developers back — MediaWiki, server 6/10–stable Critical 1+ 6h Fragile No No Yes No No No
Human Resources Developers bots 6/10–stable Critical 1+ 6h Fragile No No Yes Lead No No
Human Resources Developers front — CSS, HTML, JS 2/10–insufficient High 2 3h Fragile No Yes Yes No Yes No
Wikipages Editors on existing Help:* pages 7/10–correct High n.a. 10h n.a No Lead No No Yes No
Human Resources Editors on other project pages (not discussions) 8/10–correct Medium 3 40h Pass No Yes No No Yes No
Human Resources Editors on welcoming, onboarding 5/10–basic Medium 3 5h Fragile No No No Yes Yes No
Human Resources Editors/ on strategy, grants requests writing 4/10–basic High 2 10h Fragile Yes Lead Yes Yes Yes No
Human Resources Editors on communication, outreach, PR 8/10–good High 1 (Mel) 28h Good Yes Lead No No Yes No
Institutional resources Support staff: Wikimedia France, WIR. 8/10–correct High 2 ? Fragile Lead Lead No No No No
Funding flow Looking for development funds. 4/10–basic High n.a. n.a. Fragile Lead Yes No No No 2016
TOTAL ~5 Fragile
Status score integrates importance, existing resources, fragility, and opportunity cost. Critical server side is well managed by volunteer and staff, but bus factor stays too weak, resulting in modest 6/10 score.

Risk/opportunity: We must use current calm to double the community and ensure long term HR and know-how's sustainability.

Distribution of recordings per languages

edit
See also Languages gallery.

LinguaLibre linguistic coverage is expanding, first with major Western languages, then toward other large languages and minority languages in Western countries, and last toward non-Western marginalized communities. This is a starting point.

 
Current distribution of human languages into 9 root families.
World languages Lili languages
Demographic[1] Number Ratio Number Coverage Supported language's profile Examples Community's presence
Major (>30M) 30 0.5% 20 66% Mostly major Western or Indian languages. FRA, SPA, BEN Solid: Several productive speakers. Sustained or periodic.
Large (1~30M) 350 5% 150 40% Mostly Western languages, other notable languages NLD, AFR, CAT Emerging: One productive speaker, few not-retained speakers. Fragile.
Marginalized (<1M) 6500 94% 80 1% Mostly larger minorities in Western countries. ATJ, BRE, EUS Contact point: No productive speaker, one not-retained speaker. Below fragile.

Risk/opportunity: We must raise awareness, seed, demo, train to Lingualibre among true marginalized communities on the 7 Wikimedia regions. Know-how is moved away from West nerds and into true local, linguistically at risk communities.

Distribution of recordings publications per beneficiaries

edit

After 6~8 years, beneficiary users are by and large Western wiktionrists. French users with 54.8% of wiktionary recordings publications for the French and Occitan wiktionaries.

Updated August 20th, 2024
Local wiki Bot name First edit Edit count % Groups Region of most beneficiaries
Wiktionaries
fr.wiktionary.org Lingua Libre Bot 12 June 2018 477,000+ 51.9% bot Europe/France
pl.wiktionary.org Olafbot (g) 4 March 2020 240,000+ 26.1% bot Europe/Poland
ku.wiktionary.org Lingua Libre Bot 30 November 2021 64,000+ 7.0% bot Asia
or.wiktionary.org Lingua Libre Bot 10 January 2023 33,000+ 3.5% Asia/India
oc.wiktionary.org Lingua Libre Bot 16 December 2018 25,000+ 2.7% bot Europe/France
shy.wiktionary.org Lingua Libre Bot 8 September 2021 2,500 0.3% bot Africa
All other projects 0 0.0% World
Wikidata
www.wikidata.org Lingua Libre Bot 10 June 2018 78,000+ 8.5% bot Unclear
Technical projects
lingualibre.org Olafbot 26 February 2021 5,208 bot Unclear
meta.wikimedia.org 10 June 2018 6
References : Lingua Libre Bot (g), Olafbot (g).

See also :

Wiktionaries overall pageviews

edit

Wiktionary, which is the main point of reused, is only properly documented and visited in few privileged languages.

Number of files uses

edit

Number of times listened

edit

Other data

edit

Baglama2 has rich data which would give new light on usages.

PR outreach

edit

On PR side, we observe the following fronts and opportunities :

Type Description M.L. Workload Budget Comment
Communication Overall outreach campaign seeding the idea of LinguaLibre into diverse community, geographically dispersed.
Cause new influx of advocates, speakers, event organizers, devs.
Ensures bus factor is not a risk anymore = secures project's sustainability.
n.a. 7 months 26k USD
Sub-projects
Outreach/External Email campaign to dozens of language blogs and newsrooms.
Email exchanges, coordination, interviews, copyedit with their authors
n.a. 4 months 16k USD
Outreach/Toolkit Improve base PR materials with: base emails ; base presentations ; Base flyers. n.a. 1 month 4k USD
Outreach/Medias Coordinating and hiring to create testimony video material (example) n.a. 1/2 month 2k USD – coordinator

2k USD – filming day

Could reuse existing short online videos on language diversity ?
Outreach/Crowdsourcing Create a KissKissBangBang crowdsourcing campaign on the model of WikiCheese. n.a. 1 month 4k USD Objectif is equaly PR by increasing awareness and raising some public money. Helps assess this avenue of autonomous future funding via crowdsourcing.
Outreach/Translations Coordinating translation of refined base documents from EN to major languages
Languages: ES, FR, AR, RU, HI, ZH, Indonesian, Swahili (language spheres with largest language diversity).
n.a. 1/2 months 0€ Wikimedians lead translations.

Grants to request

edit

There also are opportunities in writing funding requests and initiating the following :

Type Description M.L. Duration (est.) Budget (est.) Comment
Coordination Overall strategy, coordination, planning and multiple fund requests.
Get things on rails and rolling at the required speed.
Strategies sub-project : technologies, proof of concepts, communication.
n.a. 8 months 32k USD This position is leading the writing of the fund requests below.
Technical improvements.
WMF Technical Fund 1 LinguaLibre backlog features requests and improvements n.a. 3 months 15k EUR API: Sparql. Front: CSS, JS, Vuejs, MediaWiki, PHP.
WMF Technical Fund 2 Anki e-learning plugin generator n.a. 1 month 5k EUR API: Sparql. Front: CSS, HTML, JS.
WMF Technical Fund 3 Integrated e-learning webpage for words n.a. 1~2 months 5~10k USD API: Sparql. Front: HTML, CSS, JS, VueJS.
WMF Technical Fund 4 Unilex 1000 languages lexical database update n.a. 1~2 months 5~10k USD Python.
WMF Technical Fund 5 Dashboard for linguistic coverage.
  • Visualize LinguaLibre coverage vs linguistic world heritage.
  • Emphasis marginalized communities as our core partners (+90%).
  • Emphasis need to outreach and serve those smaller languages communities.
  • Will help redefine LinguaLibre.
n.a. 1~2 months 5~10k USD API: Sparql. Front: HTML, CSS, JS, D3js.

Will help redefine LinguaLibre.

Total: 35-50k USD
Field outreach : training local community, recording 5000 words.
WMFR Micro-fi 1 Field outreach marginalize language(s) — Region W. Europe "France's Whistled Gascon (1)" 1 4 days 500 EUR Volunteer in that region, implies minimal travel, hosting costs.
WMF Community Fund 1 Field outreach marginalize language(s) — Region Lat. America "Peru Amerindians languages (3)" 3 2 months 6~8k EUR Partnership with Aquaverde.
WMF Community Fund 2 Field outreach marginalize language(s) — Region US/Canada "Canada Amerindians languages (3)" 3 2 months 6~8k EUR Partnership with Wikimedia Canada and Atikamekw's wiki.
WMF Community Fund 3 Field outreach marginalize language(s) — Region Africa 1+ 2 months 6~8k EUR No contact at the moment.
WMF Community Fund 3 Field outreach marginalize language(s) — Region CEE / Russia. 1+ 2 months 6~8k EUR No contact at the moment.
WMF Community Fund 3 Field outreach marginalize language(s) — Region ESEAP: E./S.E. Asia, the Pacific region 1+ 2 months 6~8k EUR No contact at the moment.
Others
WMF Community Fund 4 2022 Contribuling Conference n.a. 2 days 4k EUR Partnership with INALCO.
WMF Alliances Fund 1 Taiwan aboriginal languages (16) Wikimedian in residence 16 9 months 50k EUR Partnership with
TOTAL 24 21-23 months 95-120k
M.L.: marginalized languages supported, creating a solid range of demonstrations of in-community LinguaLibre's usage. Field outreach : proof of concept and seeding expeditions, implying contact, informed consent by minority, travel, possible linguistic research, training of locals, supervision of recording session, hosting fees.

Hires to onboard

edit

The wave of new comers should be welcomed and onboarded properly.

Type Description M.L. Duration (est.) Budget (est.) Comment
Community engagement Guiding / onboarding wave of new comers per LinguaLibre:Roles:
  • (Speakers – follow up to lead in order to increase current low retention rate, with focus on speakers of minority languages.)
  • Dev – point them to the Github and Phabricator, profile their skillsets, guide them to suitable repositories/project.
  • Online advocate − onboard them into the PR team, may help lower the workload
  • Local coordinators in the target community – able to organize local events and training, those will increase long term impact.
n.a. 8 months (part times)
Campaign solidity

I do not believe the full effort drafted above can be achieved by volunteers with occasional evolvement. We can expect such team to achieve 1/4 to 1/3 of that plan (go 3-4 times slower). I wonder if this overall coordination plan –two coordinators for 6 months (favored) or one for one year (will do) to initiate multiple projects via grants– could itself benefit from a Wikimedia Fund we previously discussed ? Such ~50k€ central 2022 LinguaLibre Campaign coordinator.s would be more solid to get most or more of this wishlist to actually happen in 2022.

@Yug: Thanks for sharing these ideas with me to support LinguaLibre. Because this is a complex set of related but separate proposals, my suggestion for next steps would be to have the applying individuals or organizations who are proposing these projects to contact the Regional Program officer over e-mail depending on where they are physically based. You can find more information about each funding region here, and a listing of our team is provided here. I JethroBT (WMF) (talk) 15:03, 27 November 2021 (UTC)[reply]
As a follow up anecdote, the volunteer-based LL PR campaign designed and lead by Marreromarco, which my overall PR+tech funding requests plan aimed to secure and solidify, has just been called off. Marreromarco has assessed the software side to be too basic to satisfy non-wikimedian public, and therefore, not worth his ambitious volunteer-powered PR campaign. In our case, tech and PR goes together. I'm still drafting a draft proposal for Marti. Yug (talk) 11:48, 29 November 2021 (UTC)[reply]

Competition

edit

Metrics

edit
Project Licence Languages Members Recorded words Recorded sentences Written sentences Comment
Tatoeba.org Open 410[2] 56,406[3] n.a. 929,389[4] 10,192,845[5] UI: excellent and lively UI, to learn from
CommonVoices.org Open 90+ 200,000[6][7] 30,000,000[6][8] n.a. n.a. UI: site has clean and dynamic UI to learn from.
LinguaLibre.org Open 250+[9] 2200+[10] 1,300,000+ n.a. n.a. UI: « best opportunities for progress »
Forvo.com NC UI: excellent and futurist UI, to learn from
Google (c) 1000+
Facebook (c) 1000+
Note: CommonVoices has no aggregated count available. Per language counts from which a sum can be made. One hour estimated equivalent to 2000 words.

Competitive advantages

edit

Each site has different focus.

  • Tatoeba actually focus on written sentences and parallel sentences (translation) to feed learning applications.
  • Common Voice focus on audio sentences by various speakers, with very diverse audios being looked after, to feed Speech2Text and Text2Speech systems.
  • LinguaLibre focus on clean audio words to illustrate Wikimedia Wiktionaries (so far), but by design convenient for vocabulary applications and dictionaries (requires working datasets page).
  • (Forvo - TBC)

SWOT

edit
SWOT analysis (strengths, weaknesses, opportunities, and threats) analysis is a method for identifying and analyzing internal strengths and weaknesses and external opportunities and threats that shape current and future operations and help develop strategic goals.

SWOT for Lingualibre's UI

edit
 
Strength Weakness/Lags
  • LinguaLibre is unique in its focus on words
  • LinguaLibre is unique in its native integration to Wikimedia ecosystem, especially Wiktionaries.
  • Lingualibre is catching up with Tatoeba in term of amounts (660k vs 930k)
  • Mediawiki allows community a flexible collaboration.
  • Lingualibre lag behind Tatoeba in term of linguistic diversity (145 vs 410 languages).
  • Lingualibre lag behind Common Voice in audio content (est. 30M words-equivalent vs 650k word).
  • LinguaLibre's UI (home page, stats), communication (home page content) is keeping LinguaLibre down compared to these (open content) competitors.
  • LinguaLibre's community is too small to leverage the wiki for events organization and else.
Opportunities Threats
  • LinguaLibre's can improve design by learning and duplicating competitors' best practices.
    • CSS snippet can be created, Help:SPAQRL can provide data.
  • LinguaLibre's can improve playfulness by learning and duplicating competitors' best practices.
    • Gamification can be increased via visual call for actions and forward competition.
  • Lingualibre's stagnating and raw user interface is not engaging enough.
  • Lingualibre's community could stagnate, and therefore go in relative decline compared to competitors.

Other assessments

edit

Communication pages

edit

Help pages

edit

Category:Lingua Libre:Help pages' : overall review, recategorizing of identified orphans pages, basic improvements, needs assessment (below) done yesterday. Help pages would benefit from some care.

Needs merge :

Needs split:

Needs better inclusion into LinguaLibre:Stats/Languages or links:

Needs expansions (Category:Drafts):

Comment:

  • orphan pages likely missed.
  • other namespaces not assessed.
  • maintenance ideas, improvements, templates could help

Gadgets scripts

edit

LinguaLibre Gadgets are JS script enhancing the site by adding some features.

Future

edit

See also

edit

References

edit