Talk:List of Wikipedias by sample of articles/Archives/2012

Please do not post any new comments on this page. This is a discussion archive first created in 2012, although the comments contained were likely posted before and after this date. See current discussion or the archives index.

1, 4, 9

Latest comment: 12 years ago2 comments2 people in discussion

I haven't found the discussion about the classes weight. I mean that factor 1 for stubs, 4 for articles and 9 for long articles to be used in the score. Can someone point me the right link or maybe summarize the reasoning? Thanks in advance, CasteloBranco^msg 14:04, 8 February 2012 (UTC)

After looking at the original source code, I would guess the numbers were choosen because they are squares.

Group number	Name	weight
0	Absent	0² = 0
1	Stubs	1² = 1
2	Articles	2² = 4
3	Long art.	3² = 9

They were discussed here, and possibly somewhere else, but they have never been changed. Boivie 16:04, 8 February 2012 (UTC)

The perils of jumping to conclusions (re Latin)

Latest comment: 12 years ago10 comments5 people in discussion

Liliana, which Latin translation of the UDHR did you choose? and why? It makes a huge difference! The UN's site offers two translations. Since these are of different lengths, your choice will affect the weight given to Latin. See how widely translations can vary. Here are headings 3 thru 6, in English and each of the Latin versions:

Everyone has the right to life, liberty and security of person. No one shall be held in slavery or servitude; slavery and the slave trade shall be prohibited in all their forms. No one shall be subjected to torture or to cruel, inhuman or degrading treatment or punishment. Everyone has the right to recognition everywhere as a person before the law.

Suae quisque ipsius vitae, libertatis, incolumitatis potestatem habet. Homo nemo iugo et servitute oppressus teneri poterit; nullo pacto, servitus et mancipiorum commercium. Homo nemo in cruciatum poterit dari, suppliciis atrocibus adhibendis. Suae quisque ipsius probationis potestatem habet, ubicumque gentium, personae rationalis et civilis.

Quisque jus habet vivere, liber esse et in tuto vivere. Nemo servitudine tenebitur. Omnes servitudines omnibus modis prohibentur. Nullus cruciabitur neque poenis nec tractationibus, quae humanae sunt, objicietur. Quisque jus habet suam juridicam personam in omnibus locis accipi.

The first Latin version is a tiny bit shorter than the English, and the second is much shorter. ("Eyeballing" the whole texts at the UN's site suggests that other such segments, if similarly compared, would yield similar, though not identical, results.) If I've counted correctly, the English has 349 characters, the first Latin version has 344, and the second Latin version has 279. So Latin's weight implied by the first version is 1.01, and Latin's weight implied by the second version is 1.25. To my ear, the second version, "sounds" more Latinate, as the first is gratuitously wordy (for example, homo nemo—twice—where nemo is adequate by itself). The second version would be even shorter (and Latin's implied weight higher) if the translators had rendered "everywhere" by the ordinary term, ubique, rather than gone for the flourish of in omnibus locis (literally 'in all places'). So a reasonable guess at Latin's "true" (but unknowable) weight against English is closer to 1.25 than to 1.01. The lessons here for everybody, in all our wonderful languages, are that short samples can vary a lot, reflecting the translators' purposes & abilities, and drastically affecting the weights, and that we shouldn't jump to conclusions on the basis of the samples that are in play so far. Jacob. 71.163.192.88 02:25, 12 February 2012 (UTC)

I used the first one. For reference, the first version uses 9907 characters (thus a weight of 1.070), the second version uses 9825 characters, yielding a weight of 1.079. So as you see, they don't actually differ all too much. -- Liliana • 02:35, 12 February 2012 (UTC)

That's good to know, as it suggests that, in larger samples, differences of the sort highlighted above may "even themselves out." But why did you choose the first version? and why didn't you mention that you'd made a choice? 71.163.192.88 02:53, 12 February 2012 (UTC)

I didn't know the difference between them so I simply went with the first one in the list, thinking they'd be near identical. As it seems, that doesn't appear to be the case though. -- Liliana • 13:16, 12 February 2012 (UTC)

To avoid scaring the Catalans, that's the sort of thing that should not be done: you chose a translation that downgrades a particular language's weight more than another translation does, and you hid that fact from public view. (What other choices may have been made?) Everything done here with regard to weights should be aboveboard, evenhanded, transparent. Jacob. 71.163.73.162 03:50, 14 February 2012 (UTC)

This is a useful case study, because we've got two different translations into the same language, made around the same time. They are different—and I agree with Jacob that the second one is more idiomatic in this passage—but they end up very close to the same length, differing by less than one percent, once we have a long enough sample. This supports the idea that a longer text for comparison should give us more accurate weights, whether that longer text be the UDHR, a longer passage from the Bible, or something from the Aeneid. (That's been translated into a lot of languages, though not necessarily all the ones we care about. It could help us test the speculation that translations are often longer than the base texts -- if we were to take, say, book 4 of the Aeneid as a base text, would Latin's weight end up lower than English's? Just curious.) A. Mahoney 23:45, 12 February 2012 (UTC)

Not a chance! Here's the death of Dido, from the end of book 4 (translated by G. P Goold in the Loeb series):

tum Iuno omnipotens, longum miserata dolorem diffi cilisque obitus, Irim demisit Olympo, quae luctantem animam nexosque resolveret artus. nam quia nec fato, merita nec morte peribat, sed misera ante diem subitoque accensa furore, nondum illi flavum Proserpina vertice crinem abstulerat Stygioque caput damnaverat Orco. ergo Iris croceis per caelum roscida pinnis, mille trahens varios adverso sole colores, devolat et supra caput adstitit. "hunc ego Diti sacrum iussa fero teque isto corpore solvo": sic ait et dextra crinem secat; omnis et una dilapsus calor atque in ventos vita recessit.

Then almighty Juno, pitying her long agony and painful dying, sent Iris down from heaven to release her struggling aoul from the prison of her flesh. For since she perished neither in the course of fate nor by a death she had earned, but wretchedly before her day, in the heat of sudden frenzy, not yet had Proserpina taken from her head the golden lock and consigned her to the Stygian underworld. So Iris on dewy saffron wings flits down through the sky, trailing athwart the sun a thousand shifting tints, and halted above her head. "This offering, sacred to Dis, I take as bidden, and from your body set you free": so she speaks and with her hand severs the lock; and therewith all the warmth passed away, and the life vanished into the winds.

If I've counted correctly, the Latin has 589 characters and the English has 747, making the estimated weight for Latin (against English) 1.27. The world has hundreds of books of Latin texts and their English translations printed on facing pages, and anybody who flips through them will agree that Latin on average is terser than English. Our question here is how much so. My rough estimate, based on flipping through numerous volumes, is that Latin has an average weight (against English) somewhere between 1.10 and 1.25, perhaps near the midpoint of that range. Liliana's 1.07 seems too low. Jacob. 71.163.73.162 17:55, 13 February 2012 (UTC)

I found this interesting -- to use a Latin text as base -- especially since I happen to have an Icelandic translation of the Aeneid on my shelf here.

Júnó hin almáttka, sem horfði full samúðar á langæja kvöl, hart helstríð, sendi loks Írisi ofan af Ólympusi til að enda glímu sálarinnar og leysa fjötraða limi. Af því að hún dó hvorki í tímans fyllingu né verðskulduðum dauða, heldur af sorg fyrir aldur fram, gripin ástaræði, þá hafði Próserpína ekki enn tekið til sín lokk úr ljósu hárinu og helgað höfuðið hinum stýgverska Orkusi. Og nú svífur Íris niður himininn á döggvuðum sóllauksvængjum, þúsundlit í skini sólar, og nemur staðar yfir höfði hennar: „Hárlokk þennan, helgaðan Dis, tek ég, eins og mér var skipað, og leysi þig úr líkama þínum.“ Svo mælir hún og sker hárið með hægri hendi: jafnskjótt hvarf allur ylur og lífið vék burt út í blæinn.

The Icelandic translation (which is a prose translation) uses 703 characters (with spaces) yielding a weight of 1.063 relative to the English instead of the 1.041 that the UN declaration yields. Incidentally, there is something wrong with that text (in all the links provided in the discussions above); random letters inserted in random places makes me think that the text was originally scanned in from a hard copy and transformed into a text document without being proof read. It's also archaizing in both spelling and punctuation. Anyway, the Aeneid comparison yields almost exactly the same weight for Icelandic as the Bible comparison did, but somewhat more than the UN text. Interesting result. --Cessator (talk) 19:48, 19 May 2012 (UTC)

A larger question remains: why does switching to the UDHR text downgrade the weights of most languages? A reasonable guess at an answer is that, since the style of the UDHR is somewhat legalistic, translators have been careful to hew rather closely to its vocabulary & syntax, sometimes glossing word for word, rather than availing themselves of (possibly more natural-sounding) idiomatic expressions. Jacob. 71.163.73.162 03:50, 14 February 2012 (UTC)

Actually I think it's the other way around; the Babel text was too conservative in writing style. My native language is German and to me it always seemed that English is much shorter than most other languages, yet with the Babel weights English has one of the *lowest* weights, which doesn't make a lot of sense to me. -- Liliana • 04:19, 14 February 2012 (UTC)

Why not use the whole Bible?

Latest comment: 12 years ago12 comments8 people in discussion

It's huge, it has a wide variety of styles, and it's been translated into most of the languages likely to have wikipedias. Moreover, treating each book as a unit could give us a sense of how weights will respond to the stylistic variations that occur from book to book (and from translator to translator, since, in at least some cases, different individuals have been responsible for translating different books.) Each book could serve as an independent sample; and for each language, the mean of the samples could become the weight. (Obviously, books rejected from particular versions couldn't serve as samples.) Or is it that not enough versions are available online to serve as samples? or that the programming involved in measuring the texts would be prohibitively complicated? 71.163.73.162 03:50, 14 February 2012 (UTC)

Where would you find complete Bibles in various languages? -- Liliana • 04:17, 14 February 2012 (UTC)

Complete Bibles have been printed in many hundreds of languages, the New Testament probably in more. For a start, try Biblos and Bible Gateway. Jacob. 71.163.198.183 13:34, 14 February 2012 (UTC)

Surprisingly, this isn't all that hard. Three minutes' searching turns up a site calling itself "Bible Gateway," a roundup listing of translations, with some pointers, not to mention a page about exactly this. In other words, I expect lots of them are available in suitable form on line, if not every language we care about. A. Mahoney 13:25, 14 February 2012 (UTC)

I agree completely with this idea, if it lies within the realm of technical feasibility. The bigger the corpus, the more reliable the weights. It should be added as an option to the Wikimedia vote. Leptictidium 14:20, 15 February 2012 (UTC)

That's my biggest gripe, I'm not positive you'd be able to find Bibles in enough languages. Feel free to convince me otherwise. -- Liliana • 15:42, 15 February 2012 (UTC)

If we were able to find Genesis in all these languages, it's obvious we'll be able to find the entire Bible. Just go to the source from where we got Genesis in the first place. Leptictidium (talk) 16:08, 20 February 2012 (UTC)

Since the books in the Bible vary from denomination to denomination, it would be necessary to work from a specific list of books, rather than from an edition of a supposedly "whole" Bible. A related but slightly different problem is that certain texts are variable. For example, the best texts of Mark end at 16:8, but less reliable manuscripts tack on a seemingly gratuitous ending, 16:9-20. Whether the extra verses are included in each translation might have to be checked. Jacob.71.163.198.245 00:26, 1 March 2012 (UTC)

That doesn't seem overly complicated. (Leptictidium) Leptictidium (talk) 15:59, 6 March 2012 (UTC)

while discussing the corpus or the source for comparison, one thing is clear: changes should be voted and debated (and with more people) --Barcelona (talk) 11:10, 17 February 2012 (UTC)

+1 for Bible, since it has translations to all 200+ wikipedia languages. The "Declaration of human rights" has I think translation to the UN members languages. --Alex Blokha (talk) 17:16, 5 March 2012 (UTC)

I believe I said it before: if you want to do this, please make a table like I did above with the new weights. -- Liliana • 18:45, 5 March 2012 (UTC)

Lezghian Wikipedia

Latest comment: 12 years ago1 comment1 person in discussion

I just want to say that in this new project we created special category for "1000 articles". Also, you can see list with page size here.--Soul Train (talk) 14:06, 1 May 2012 (UTC)

July 2012

Latest comment: 12 years ago4 comments3 people in discussion

There was a bug in the score program because the "HIV/AIDS" article has a slash in the name. I don't have time to fix this or run the program again. The way the scoring works "errors" don't necessarily reduce your score because it reduces the total articles from 1000 to 999. So, in some odd cases it can actually increase you score if it was a tiny article. Anyway, if anyone else wants to re-run it go ahead. --MarsRover 21:13, 2 July 2012 (UTC)

It is easy to fix the bug in a quick way -- just change "HIV/AIDS" to "HIV AIDS" in file "ArticleList.txt". -- Ace111 (talk) 22:04, 2 July 2012 (UTC)

Would it suffice to change the link in the list itself, either to "HIV AIDS" or to something else that re-directs there? That might be easier if there are more programs than just yours that parse the list. A. Mahoney (talk) 12:08, 5 July 2012 (UTC)

Yeah, that might work. Just to clarify the only reason en:HIV AIDS brings up en:HIV/AIDS is because its a redirect. So, programs using this list would have to handle redirects. Anyway, I'll try to fix the program before next month. --MarsRover 16:34, 5 July 2012 (UTC)