Research:Expanding Wikipedia articles across languages/Inter language approach/Section Alignment at Large Scale

Contact

Muniza Aslam

Wikimedia Foundation (Contractor)

Diego Sáez-Trumper

Wikimedia Foundation

Research:Projects

This page documents a completed research project.

Open source
via GitLab

Following our previous work on Cross-lingual Section Alignment we have expanded the language coverage and updated the ML-pipelines in order to compute section alignment across 205 different languages. This page describes the details about the implementation and performance of our new algorithm.

System description

Features

Title similarity: The cosine similarity between the vector representations of two section titles.
Link similarity (sum): Sum of link similarity for all section pairs with the same source and target sections. Link similarity is defined here as the Jaccard index between two vectors containing section links, represented as wikidata items, from each section.
Link similarity (mean): Mean link similarity for all section pairs with the same source and target sections.
Edit distance: The Levenshtein distance between two section titles.
Normalized co-occurence count: The number of times two section titles co-occur across all articles normalized by the maximum number of times the source section co-occurs with any target section.
Source count: Total number of times the source section occurs across all articles.
Target count: Total number of times the target section occurs across all articles.
Source position (relative to the top)
Target position (relative to the top)
Source position (relative to the bottom)
Target position (relative to the bottom)

Feature extraction

The above features are extracted in two phases. During the first phase, the latest revisions for all articles in a language are read from the wikitext_current hive table. Each revision is then parsed to extract all level 2 headings, list of links found under each heading, represented as wikidata items, the relative position of each heading in the article and the total number of times each heading occurs across all articles in this language.

In the second phase, all articles in the source language and its target language(s) are aligned using their wikidata id. Then for each wikidata item, every possible combination (pair) of headings in the source language and headings in the target language is generated. Next, using the data retrieved during the previous phase, the rest of the features are calculated for each pair.

Doing this in two phases helps decouple features that are target language dependent from those that are not. This means that the latter do not have to be re-calculated every time we calculate the former for a new target language and results in significant speed up across the board.

Models

Natural Language Processing

In order to calculate title similarity, section titles need to be encoded into vectors first. Since similarity between two vectors is determined using cosine similarity, similar sections should produce similar vector representations despite belonging to different languages. To generate these representations, a cross-lingual sentence embedding model, LaBSE is used. LaBSE maps languages to a 768 dimensional shared vector space, eliminating the need to align embeddings before they can be compared.

Machine Learning

To generate a single similarity score for each section pair based on its features, a gradient boosting classifier is used. This classifier generates a score between 0 and 1 which is the probability of the pair target being an accurate translation of the source. The targets for each source are then ranked according to this probability.

Training data

The training data is generated by combining the ground truth with the extracted features. The ground truth consists of crowdsourced section translations in 6 languages: Arabic, English, French, Japanese, Spanish and Russian. These section pairs are joined with the data generated by the feature extraction pipeline on source, target, source language and, target language. This yields a dataset with both positive and negative examples unlike the ground truth which contains just positive ones. The resulting dataset, however, is imbalanced with a disproportionately high number of negative examples and any machine learning task involving it will have to address that.

Testing data

Test data is generated by joining the section translations done using the Content Translation tool (CXT) with the extracted features on source, target, source language and, target language. The translations from the CXT are labelled 'True' and the rest of the pairs 'False'. The precision on this dataset is measured by counting the number of sources for which the CXT translation was among the top n targets.

Limitations

LaBSE, the embedding model used to generate vector representations for section titles, currently supports 109 languages^[1]. This means that sections from unsupported languages might not be correctly encoded.

Output description

The generated section alignments for each language are available for download in the form of sqlite databases.

Output schema

col_name	data_type	comment
source	text	source section title
target	text	target section title
source_language	text	wikipedia where the source section comes from
target_language	text	wikipedia where the target section comes from
probability	real	probability of the target being the source's translation
rank	integer	target's rank according to probability

Performance

The following results include the top 100 language pairs by number of section pairs tested. The precision here denotes the probability that, of all the aligned target sections for a source section in the test data, the CXT translation was among the top 5. Please note that any source sections occurring more than once per (source language, target language) in the CXT dataset were counted as one pair and tested by checking if any of the corresponding targets ended up among the top 5.

Source language	Target language	Precision @ 5	Pairs tested
enwiki	eswiki	0.970	12988
enwiki	frwiki	0.939	9165
enwiki	arwiki	0.937	8456
enwiki	viwiki	0.946	6054
ruwiki	ukwiki	0.986	5980
ruwiki	bawiki	0.919	5382
enwiki	jawiki	0.906	5328
enwiki	zhwiki	0.915	5153
enwiki	itwiki	0.941	5039
enwiki	ukwiki	0.944	4934
enwiki	ptwiki	0.964	4691
enwiki	trwiki	0.953	4246
enwiki	ruwiki	0.912	4110
enwiki	hewiki	0.925	4062
enwiki	idwiki	0.973	3495
enwiki	fawiki	0.946	3402
enwiki	rowiki	0.964	3048
enwiki	bnwiki	0.962	2832
enwiki	tawiki	0.963	2707
enwiki	elwiki	0.946	2685
enwiki	cawiki	0.940	2604
eswiki	cawiki	0.971	2296
frwiki	ocwiki	0.989	2094
enwiki	dewiki	0.876	1884
enwiki	pawiki	0.982	1781
enwiki	mlwiki	0.952	1632
enwiki	cswiki	0.917	1466
enwiki	kowiki	0.905	1375
enwiki	mkwiki	0.966	1308
enwiki	srwiki	0.928	1212
enwiki	sqwiki	0.971	1178
enwiki	nlwiki	0.925	1176
enwiki	mswiki	0.957	1174
enwiki	afwiki	0.977	1089
enwiki	huwiki	0.897	1041
dewiki	frwiki	0.852	1026
frwiki	eswiki	0.920	995
ruwiki	hywiki	0.959	991
frwiki	enwiki	0.918	922
dewiki	enwiki	0.895	893
enwiki	urwiki	0.948	828
enwiki	plwiki	0.891	824
enwiki	tewiki	0.953	813
eswiki	enwiki	0.913	797
ukwiki	ruwiki	0.958	754
jawiki	zhwiki	0.856	750
enwiki	fiwiki	0.888	732
enwiki	thwiki	0.920	679
enwiki	hiwiki	0.938	659
enwiki	dawiki	0.933	658
frwiki	itwiki	0.921	648
eswiki	euwiki	0.946	635
enwiki	slwiki	0.959	631
dewiki	itwiki	0.872	626
enwiki	cywiki	0.955	616
ruwiki	hewiki	0.874	595
ruwiki	enwiki	0.906	595
enwiki	tlwiki	0.939	594
eswiki	glwiki	0.927	587
enwiki	orwiki	0.926	582
enwiki	svwiki	0.930	568
enwiki	kawiki	0.952	568
enwiki	bgwiki	0.929	564
ruwiki	bewiki	0.978	544
enwiki	hywiki	0.918	538
enwiki	mywiki	0.929	535
eswiki	frwiki	0.882	534
enwiki	guwiki	0.958	524
frwiki	cawiki	0.922	523
enwiki	knwiki	0.965	510
enwiki	glwiki	0.901	506
dewiki	nlwiki	0.876	499
ruwiki	ttwiki	0.950	497
cawiki	eswiki	0.961	491
enwiki	hawiki	0.924	487
eswiki	ptwiki	0.960	475
dewiki	eswiki	0.870	453
enwiki	ckbwiki	0.642	450
frwiki	arwiki	0.824	449
plwiki	ukwiki	0.918	426
itwiki	frwiki	0.903	423
zhwiki	enwiki	0.899	414
enwiki	siwiki	0.951	412
enwiki	euwiki	0.926	404
enwiki	hrwiki	0.948	400
itwiki	enwiki	0.932	385
ruwiki	tgwiki	0.916	382
enwiki	jvwiki	0.866	372
itwiki	eswiki	0.923	364
enwiki	eowiki	0.893	355
enwiki	etwiki	0.915	354
dewiki	ukwiki	0.852	352
jawiki	kowiki	0.937	350
ptwiki	enwiki	0.935	336
ruwiki	kkwiki	0.955	332
frwiki	ptwiki	0.927	329
enwiki	gawiki	0.966	323
enwiki	mrwiki	0.944	322
ruwiki	sahwiki	0.729	321
enwiki	bswiki	0.974	312

Code & Data

Code: https://gitlab.wikimedia.org/mnz/section-alignment
Output data can be found here: https://analytics.wikimedia.org/published/datasets/one-off/section_alignment/

References

↑ Feng, Fangxiaoyu, et al. "Language-agnostic bert sentence embedding." arXiv preprint arXiv:2007.01852 (2020)

[1] Feng, Fangxiaoyu, et al. "Language-agnostic bert sentence embedding." arXiv preprint arXiv:2007.01852 (2020)

[1]