Talk:List of Wikipedias by sample of articles

Latest comment: 3 months ago by Theklan in topic Should weights be standarized?

Archives of this page


2007 | 2008 | 2009 | 2010 | 2011 | 2012

Lezghian Wikipedia edit

I just want to say that in this new project we created special category for "1000 articles". Also, you can see list with page size here.--Soul Train (talk) 14:06, 1 May 2012 (UTC)Reply

July 2012 edit

There was a bug in the score program because the "HIV/AIDS" article has a slash in the name. I don't have time to fix this or run the program again. The way the scoring works "errors" don't necessarily reduce your score because it reduces the total articles from 1000 to 999. So, in some odd cases it can actually increase you score if it was a tiny article. Anyway, if anyone else wants to re-run it go ahead. --MarsRover 21:13, 2 July 2012 (UTC)Reply

It is easy to fix the bug in a quick way -- just change "HIV/AIDS" to "HIV AIDS" in file "ArticleList.txt". -- Ace111 (talk) 22:04, 2 July 2012 (UTC)Reply
Would it suffice to change the link in the list itself, either to "HIV AIDS" or to something else that re-directs there? That might be easier if there are more programs than just yours that parse the list. A. Mahoney (talk) 12:08, 5 July 2012 (UTC)Reply
Yeah, that might work. Just to clarify the only reason en:HIV AIDS brings up en:HIV/AIDS is because its a redirect. So, programs using this list would have to handle redirects. Anyway, I'll try to fix the program before next month. --MarsRover 16:34, 5 July 2012 (UTC)Reply

Weight of kk.wiki and az.wiki edit

Many Turkish, Azeri and Kazakh words are similar. And the length of many of the sentences are identical. Therefore, Azeri and Kazakh should be "1.3" as the Turkish. --Bolatbek (talk) 12:43, 7 October 2012 (UTC)Reply

Thanks, now using the following weights:
Kazakh Cyrillic 908 char ~1.3
Azeri 956 char ~1.2
--MarsRover 06:49, 3 November 2012 (UTC)Reply

rawscore shouldn't emphasize long articles so much edit

Thanks for this cool automated tracking metric! But I'm surprised to see the large weight that long articles get using the current weights (9 times the weight for a stub):

rawscore = stubs + articles*4 + long.articles*9
language wiki absent score
Võro (fiu-vro) 469 6.07
Mirandés (mwl) 635 13.47
ગુજરાતી (gu) 743 9.75

Here are some sample numbers of absent articles and score for 2013-01-04:

Even though gu only has about 1/4 of the vital articles, it gets over twice the score that fiu-vro does with over 1/2 of them. It also must be discouraging to have written articles for half the entries, and yet have a score of 6 out of 100. And even the Swedish (Svenska) wiki, 15th on the list, has the low score of 43 even with none absent and 2/3 bigger than stubs.

I suggest that we set the weights so that for new language wikis, the best way to improve the score is to get at least stubs in place, rather than motivating them to get 9 times that many points by writing new long articles. Mature wikis would have appropriately high scores with a more asymptotic approach to 100. So I'd suggest something like this:

rawscore = stubs + articles*1.5 + long.articles*1.7

It also seems that the stub threshold is pretty high, at 10,000 characters. Very few wikis have zero stubs, and in many cases for a broad topic like the ones on the list, it is appropriate to have a short article and refer to other "main articles" for more depth. I didn't run across any guidance on that in the articles on stubs, but I'd suggest something more like 7000 or 5000: you can say a lot in that many characters. See e.g. en:Integumentary_system with perhaps 6300 characters. Perhaps reducing the lower bound on the size of a large article would also be appropriate, since large articles are often unwieldy. Nealmcb (talk) 00:30, 4 February 2013 (UTC)Reply

I agree with the argument about thresholds; I said once before that 10000 characters is quite an article already, but on the other hand, a lot of this is often filled by wikicode (references, infoboxes etc.). Lowering the stub threshold to 7000 would therefore be quite enough. The weight distribution you propose looks quite radical, and I think it would be better if we didn't go that far. 1/3/5 at most, otherwise there would be almost no incentive to write long articles. — Yerpo Eh? 18:45, 17 February 2013 (UTC)Reply
Huh - now I notice there are archives of this talk page, and this has been touched on before, e.g. http://meta.wikimedia.org/wiki/Talk:List_of_Wikipedias_by_sample_of_articles/Archives/2012#1.2C_4.2C_9 and http://meta.wikimedia.org/wiki/Talk:List_of_Wikipedias_by_sample_of_articles/Archives/2008#Suggestion:_A_more_granular_curve and http://meta.wikimedia.org/wiki/Talk:List_of_Wikipedias_by_sample_of_articles/Archives/2010#Article_metric
(FWIW, it doesn't make sense to me to automatically archive this page every year - there aren't that many comments, and now discussions are fractured) Nealmcb (talk) 04:35, 18 February 2013 (UTC)Reply
While I agree with "the best way to improve the score is to get at least stubs in place". I don't think "9 times that many points" is an overly generous reward for writing new long articles. It is actually a ton of work to get an article to 30k characters. On the other hand writing stubs just require you to have an article with a title (ie. article with greater than zero characters). Also, if a wiki has stubs for all 1000 articles that would be a score of 10 which is rank of 82 out of 285 (top third). If that doesn't motivate people, then I tweaking the scoring won't either. As for articles like "Integumentary system" that don't have much of a hope to be long articles, there is a good argument to just remove them from this list. --MarsRover 18:02, 18 February 2013 (UTC)Reply
I agree that it is a lot more work to write a 30k character article, than a stub. But if we consider how useful an article is to an average reader, than I would say that a stub is much more useful than no article at all. And I guess for most readers a very long article is not that much more useful than a stub (assuming that most readers don't take the time to read it all). So I support changing the scores a bit. Boivie (talk) 16:07, 20 February 2013 (UTC)Reply
I agree with MarsRover that long articles deserve a significantly higher score. The smaller the difference between stubs and long articles, the easier it is to game the system by creating lots of one-line stubs.Leptictidium (talk) 13:55, 26 April 2013 (UTC)Reply

Interwiki links edit

Btw, to make sure your article is counted in the score, go to the follow page http://www.wikidata.org/wiki/Special:ItemByTitle and make sure the wiki is linked to the English word for the article. Or, if you wait long enough Bots will eventually update this list. --MarsRover 22:32, 3 March 2013 (UTC)Reply

I do not know whether other Wikipedias are affected similarly, but the new list of missing articles for Yiddish (yiwiki) includes four which have already been written, and linked: Akbar (Jan 2013), Gabriel García Márquez (Jan 2013). Mount Kilimanjaro (Dec 2012) and Ultraviolet (Jan 2013). Does the algorithm have a problem interfacing with Wikidata? --Redaktor (talk) 19:33, 4 March 2013 (UTC)Reply
There is no problem with accessing wikidata. But in the Yiddish case I don't think the links are there yet. (Akbar iw links used). --MarsRover 23:00, 4 March 2013 (UTC)Reply
Btw, this is the reason the links failed to get moved to wiki data --MarsRover 23:40, 4 March 2013 (UTC)Reply
59 articles are counted as absent on Japanese Wikipedia, but I'm pretty sure their links exist on Wikidata. d:Q7242 (beauty, ja:美), d:Q12280 (bridge, ja:橋), d:Q450 (mind, ja:心) to name a few. From a quick look, every one of these "absent" ones seems to have a very short title - only one kanji character, as far as I saw. Chinese seems to be suffered by the same problem (probably for the same reason). Would there be any problem with one-character titles? --whym (talk) 23:27, 4 March 2013 (UTC)Reply
Interesting, If I figure out the bug in this change then I'll redo the score for the month. --MarsRover 23:44, 4 March 2013 (UTC)Reply
Right, Belarusian (Taraškievica) Wikipedia now is counted with 104 absent articles, although all 1000 articles were created long time ago. It's all bots to blame :) Wizardist (talk) 13:08, 5 March 2013 (UTC)Reply
The problem with one character article names was caused by a bug in my code. Sorry about that. The absent articles from Belarusian (Taraškievica) seem to be because the entries in Wikidata were missing. I checked some samples, and it seems like Legobot has added most (all?) be_x_oldwiki-entries there now. Boivie (talk) 22:15, 5 March 2013 (UTC)Reply

June 2013 update edit

This month's numbers seem a bit off. There are suspiciously few growth score decreases despite of a few changes in relevant Wikidata objects and such. :slwiki, for example, had a score increase of 0.34, but the only change was one article going from the "Article" category to "Missing". :thwiki went up 0.95 with one more of both "Stubs" and "Long articles", and two less "Articles" (which should amount to zero change). Has the method changed or is it an error due to switching to Wikidata objects? — Yerpo Eh? 19:27, 4 June 2013 (UTC)Reply

Yes, be wiki only has 2 absent articles less, and in the page say that it grow 4.96. --JaviP96 talk me 19:39, 4 June 2013 (UTC)Reply
I looked into slwiki. The scores are correct, they went from 24.30 to 24.26. That would give a growth at -0.04. But I wrote +0.34, so I must have made a mistake somewhere. I'll look into that. Boivie (talk) 19:55, 4 June 2013 (UTC)Reply
The growth number in the table is the difference between March 2013 and June 2013. I'll try to find a way to fix it. Boivie (talk) 20:06, 4 June 2013 (UTC)Reply

Maximum score - mission completed! edit

The Russian Wikipedia was able to get the maximum score of 100 in this rating according to the sample of 1000 articles. The top three Wikipedias had today (June 17 around 14 UTC) the following scores:

Wiki Language Weight Mean Article
Size
Median Article
Size
Absent
(0k)
Stubs
(< 10k)
Articles
(10-30k)
Long Art.
(> 30k)
Score Growth
1 ru Русский 1.4 61,055 40,734 0 0 0 1,000 100.00 +2.20
2 ca Català 1.1 45,463 36,591 1 0 8 991 99.46 +0.40
3 en English 1.0 74,004 65,374 1 10 140 849 91.23 +0.14

Congratulations to all participants which were hard working for many months on the project! -- Ace111 (talk) 18:10, 17 June 2013 (UTC)Reply

Yes, congratulations to the writers at the Russian Wikipedia. They have done an impressive work! Boivie (talk) 18:47, 17 June 2013 (UTC)Reply
Good job, I thought it would be another couple of years before this happened. --MarsRover 20:59, 17 June 2013 (UTC)Reply
Congratulations! And to the Catalan Wikipedia as well, the long-standing and stalwart leaders in this game. A. Mahoney (talk) 18:52, 18 June 2013 (UTC)Reply
Five days later, the Catalan Wikipedia became the second to reach the maximum score. Congratulations to them as well! Boivie (talk) 18:43, 22 June 2013 (UTC)Reply
Can we now finally change the weights around like I proposed ages ago so they can have a second try at reaching 100? -- Liliana 20:51, 22 June 2013 (UTC)Reply
Can someone make a bot to count how many Featured articles or Good articles in each wiki? Each Featured articles or Good articles can get more then 9 marks, then when all the articles are featured articles, it will become 100.--CaseyLeung (talk) 07:47, 24 June 2013 (UTC)Reply
A problem is that different Wikipedias have different criteras for featured articles. So the stats wouldn't really be comparable. Boivie (talk) 07:51, 24 June 2013 (UTC)Reply

Let's discuss the next stage. At catalan wiki (also with top score) we have 3 different proposals: each wiki promotes its own list and we compile them in a mega list (the sample extended), let's make grow the featured articles or let's consensuate a longer list--Barcelona (talk) 09:10, 30 June 2013 (UTC)Reply

There is a bigger list being made here. --V3n0M93 (talk) 16:54, 3 July 2013 (UTC)Reply

How many kilo bytes in Tamil equal 10000 characters? edit

Hi, we are trying to improve the quality of these articles in Tamil Wikipedia. Last month we expanded more than 100 articles above 15 kilo bytes. But, based on the 2nd July 2013 update here, it seems only 15 qualified to be listed as articles (10-30k). Can someone explain how many kilobytes in Tamil will it take roughly to surpass the 10000 character definition? Thanks.--Ravidreams (talk) 05:56, 5 July 2013 (UTC)Reply

MediaWiki counts bytes, but this table counts characters, which is a big difference. You'd need to divide the bytes roughly by three to get the amount of characters (it isn't very precise but close enough). -- Liliana 16:59, 5 July 2013 (UTC)Reply
Liliana, thanks for the clarification. It helped. Is this dividing by three applicable only for Roman letters? Because, a Tamil text with the same number of characters as English usually has more bytes. Anyway, to deduce this exact ratio from the weight assigned in the formula?--Ravidreams (talk) 05:44, 6 July 2013 (UTC)Reply
One Tamil character = three bytes. The space and most other Roman characters only take one byte. -- Liliana 07:28, 6 July 2013 (UTC)Reply
Thanks for the clarification. I tested with the help of a text editor and found Tamil characters taking byte size from 3 to 6 depending upon their complexity of rendering.--Ravidreams (talk) 17:23, 6 July 2013 (UTC)Reply
Note also that Tamil has a "language weight" of 0.9. This means that a Tamil text that says the same thing as a given English text is estimated to be shorter than the English text, about 90% as many charcters. These weights are estimates based on comparing translations; they're intended to help smooth out some of the differences between languages (how long typical words are, how wordy typical expressions are, things like that). So to get an "article" in the sense of this table, you need 10K "language-weighted characters," or 11,112 actual characters. In Latin we have a gadget that will count the characters in an article, multiply by our weight, and show the result on a page you're editing -- it's quite useful for this purpose. See la:MediaWiki:Gadget-edittools-magnitudo.js -- to adapt it for Tamil you'd change the variable lang_weight near the bottom from 1.1 (Latin's weight) to 0.9 (Tamil's). A. Mahoney (talk) 12:29, 9 July 2013 (UTC)Reply
Thanks for this info, the gadget is really useful. I adapted it for Slovene Wikipedia. For those that are less adept at MediaWiki, the installation process is as follows:
  • Create MediaWiki page "MediaWiki:Gadget-edittools-size.js" and copy the source code of the script there. The part that eliminates interwikis is not needed anymore (example)
  • Modify the script to calculate using your language's weight. The value is defined by var lang_weight near the end.
  • Pick an interface element on your wiki's edit page where the output will be appended after and type its div name under var box parameter (example - I inserted it after the editing tools in slwiki's interface, but a suitable name depends on your wiki's layout).
  • Create MediaWiki page for description that will show up in the user's preferences, should have the same title but without the .js extension (example)
  • Add the gadget's definition to MediaWiki:Gadgets-definition, under the "editing-gadgets" heading (example)
  • Clear the browser's cache and enable the gadget in your user preferences.
Hope this helps. — Yerpo Eh? 13:26, 23 July 2013 (UTC)Reply

language weighting of Some different Wikipedia edit

Currently zh-classical have a language weighting same as Chinese. However, Texts in Classical Chinese are shorter than those in Modern Standard Chinese. There are Classical Chinese version of "Tower of babel" provided in http://www.omniglot.com/babel/wenyan.htm but text there are just description about the story copied from zh-classical wikipedia rather translation to bible text. And the actual Classical Chinese translation of the text can be found at s:zh:聖經 (文理和合)/創世記#第十一章. Calculated using that translation, the ratio for Classical Chinese should be 6.3. However, the http://www.omniglot.com/babel/wenyan.htm site carries Min (Eastern) version (Min Dong) and Hakka version of those text at the bottom of those pages which seem to be a usual translation of the text, and can be used to calculate the score of these two version of wikipedia which currently do not have data. And using samples provided there, Cantonese's ratio should be roughly 3.4 instead of the current 3.7 that inherited from Chinese Wikipedia.

p.s. for the Mandarin Chinese Wikipedia, I found the text used to calculate Chinese's language ratio incorrectly included a line of note, (就是变乱的意思)in the calculation, which deleting it made the character count in the passage drop from 306 to 297 and thus the ratio would be increased from 3.7 to 3.9. However, text had been re-translated (for a more modern use of language) by the original pubnisher of that version, and the revised translation after removing those note etc. have 313 characters for that passage, and thus I think Chinese Wikipedia could keep the ratio of 3.7.[1]

And Simple English version of those text should be able to be found at [2] C933103 (talk) 03:23, 23 July 2013 (UTC)Reply

A lot of articles will contain Latin letters weighted as Chinese letters (eg. "The Beatles" article usually lists the songs/albums in English). Or you can have references with URLs in English. Basing the weighting off a paragraph that has absolutely no non-native characters is a sort of inaccurate to begin with. And then Classical Chinese may be fine for an older text but wouldn't it need wordy phrasing for a modern topic? I sort of doubt the 6.3 weight for Classical Chinese but if you have valid way to calculate the character count in the "Tower of Babel" we can change it. --MarsRover 08:43, 23 July 2013 (UTC)Reply
Is there technical ways to separate the weighting of latin/numeric characters from other characters? And for the phasing in classical Chinese, checked some GA articles like Big bang theory, Global warming and ARIA (manga), and think that wouldn't be a problem except some proper nound that are directly transliterated although I am personally a bit suspicious about should they be GA.... the method of validate the weighting had been mentioned above.C933103 (talk) 01:20, 26 August 2013 (UTC)Reply
With this method of determining article size (simple character count), there isn't. However, I think it wouldn't be too hard to program a subscript that looks for latin characters and handle those appropriately. — Yerpo Eh? 19:48, 26 August 2013 (UTC)Reply
An even greater problem: unless I'm wrong, wiki mark-up is not excluded from the character count, which means wikitables and other character-intensive forms of mark-up give languages such as Russian and Chinese a massive in-built advantage. For example, the wiki mark-up in this small table is worth 799.2 characters in Chinese, but only 216 characters in English! So the Chinese article basically gets 583.2 characters for free. And this is just one small table, the problem is far greater if you take all the wiki mark-up in the article into account. Something must be done to correct this.--Leptictidium (talk) 07:47, 5 September 2013 (UTC)Reply
Yes, that's true. Or a math heavy article with a lot of formulas is the same problem. Like Yerbo said we could have two character counts one for all Latin or symbols characters (eg. ASCII < 128) and a count for the rest (eg. ASCII >= 128). We could then set the rule that limits the language weight used for the first count. I think it would solve the problem. But probably would be unpopular since its hard to manually calculate and would adversely effect existing wiki scores. --MarsRover 17:03, 5 September 2013 (UTC)Reply
Applying the rule at least to the wiki mark-up shouldn't be too controversial, since all Wikipedias use the same mark-up and would therefore be equally affected.Leptictidium (talk) 17:23, 5 September 2013 (UTC)Reply
I agree it's fair and should be implemented, but I expect it to be controversial nonetheless, because it would make a lot of Wikipedias lose a large part of their scores. — Yerpo Eh? 08:49, 10 September 2013 (UTC)Reply
Bumping the topic... Do we want a character count that is accurate, or one that maintains artificially inflated scores for certain wikis? It's not really fair for wikis using the Latin script.--Leptictidium (talk) 11:00, 2 January 2016 (UTC)Reply
I still agree with this change, but I'm afraid that someone else will have to program it. — Yerpo Eh? 19:23, 4 January 2016 (UTC)Reply
@MarsRover: Would your ASCII <> 128 suggestion be hard to program? --Leptictidium (talk) 11:06, 5 January 2016 (UTC)Reply
Lots of languages use the latin a-z characters for their normal text (ascii 65-90, 97-122). That text should be weighted differently depending on language. Boivie (talk) 13:25, 5 January 2016 (UTC)Reply
Reviving the discussion after almost a decade, @MarsRover and Yerpo: What about counting by bytes instead of by the number of characters? As UTF-8 encoding of Chinese characters are usually 3-bytes in size, while each latin characters are only 1-byte, counting by bytes can cancel the extreme shortening effect of languages that use Chinese characters.
Also, Wikipedia in Chinese language families that use Latin script, like hak and cdo, should probably follow Min-Nan's ratio (1.2), due to them following a similar romanization method and are of the same language family. C933103 (talk) 22:34, 28 March 2022 (UTC)Reply

Tyva Wikipedia edit

Can you add Tyva Wikipedia? It was opened in August 2013, but still got no statistics there.--Soul Train (talk) 17:07, 5 October 2013 (UTC)Reply

Someone needs to update the "pywikimedia" framework first. It is still missing from the list of wikipedias. --MarsRover 17:45, 6 October 2013 (UTC)Reply
It's sad that Wikipedia existing for 4 months hasn't yet been added to the list.--Soul Train (talk) 14:16, 6 December 2013 (UTC)Reply
What is "pywikimedia"? And which version are you using? — Ace111 (talk) 02:28, 15 December 2013 (UTC)Reply
The current source code references this library: http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia I am not sure why by "tyv" is not in the families.py files or other files. It might be the library has been replaced by another (better?) library. Also, look at this: https://meta.wikimedia.org/wiki/Pywikipediabot This is were I found the original library. -MarsRover 22:31, 15 December 2013 (UTC)Reply
tyv-wiki was activated on 13th september with gerrit:84114. Pywikibot has moved from svn to git system. Please read the mw:Manual:Pywikibot/Installation how to migrate your local working copy to the new repository.  @xqt 13:14, 23 December 2013 (UTC)Reply
Thanks, this solved the problem. I converted the scripts to use pywikibot instead of pywikimedia and Tyva now works. --MarsRover 22:35, 1 January 2014 (UTC)Reply
Hoorah! Thanks, waiting for update :-)--Soul Train (talk) 20:32, 4 January 2014 (UTC)Reply

Update for December 2013 edit

All the Wikipedias are missing d:Q11464 (en:Plough) this month because someone made a merge in the wrong direction on Wikidata a few days ago. — Yerpo Eh? 14:04, 6 December 2013 (UTC)Reply

But it's been fixed, so no real damage done. A. Mahoney (talk) 13:45, 9 December 2013 (UTC)Reply
I just explained it for people who might be wondering why their Wikipedia misses one (extra) article. — Yerpo Eh? 12:11, 10 December 2013 (UTC)Reply

List of Wikipedias by sample of articles per area of knowledge edit

I'm sorry if it has already been proposed (I don't know): How about ranking the language versions of Wikipedia by fields of study, something like making lists per field, each one with the 500(?) most important articles about each of some areas like Physics, Chemistry, Mathematics, Biology, Philosophy, Geography, Psychology, etc... So people can see the priorities of each project, and it can also help to track the progress in a more detailed way. Just a suggestion. Best wishes, Chen10k2 (talk) 05:07, 14 March 2014 (UTC)Reply

Interesting. One way could be using each sub-page of List of articles every Wikipedia should have/Expanded as a field of study. Boivie (talk) 06:32, 14 March 2014 (UTC)Reply

This has indeed been proposed a few times before (see for example "What about sportswomen and sportmen?"), but nobody was persistent enough to actually do it so far. — Yerpo Eh? 08:59, 14 March 2014 (UTC)Reply

perhaps we should start working on it... starting from categories?--Barcelona (talk) 12:50, 23 March 2014 (UTC)Reply
Multiple lists for each category is a bit overwhelming. I worked on an idea a few years ago where the extended list was broken into six broad categories. Each wiki could see how they stack up using the six categories. Also, included every category as the possible best achievements for the wiki. This is a way to get all the information that seemed interesting into one page. (https://en.wikipedia.org/wiki/User:MarsRover/Sandbox#huh). The broad categories seems to mirror the overall scores so not so interesting. But the smaller categories were interesting to see which wiki excelled at what topics. --MarsRover 17:02, 24 March 2014 (UTC)Reply
I like that model. The list is already broken into general areas (10 of them: see the headings in the TOC); that might be the simplest way to categorize articles -- not that it's perfect but that it's easily visible. How hard would it be to piggy-back this calculation onto the existing one? A. Mahoney (talk) 12:26, 25 March 2014 (UTC)Reply

Stub size (once more?) edit

I am a very rare visitor here so forgive me if i bring a topic you have had already. I just prepare an oversight over African language wikipedias for the wikiIndaba in June. I notice that bot generated articles often are VERY short - few hundred bytes, just one sentence, sometimes only 2-5 words and maybe an image. On the other hand there are short articles which may have a few 1000 bytes (far less than 10,000) but have seen really work - you need to check your vocabulary, think about right links etc, structure. A full print sheet A4 will have around 4.000 signs excl. spaces. That can be lot of structured info. I do not see the point of lumping this together with those one-liners that just say: "Hey, there is also something called XYZ!" Kipala (talk) 10:13, 24 May 2014 (UTC)Reply

This has been talked in the archives since 2008. In viewing this competition for several years, this may seem unfair but I against making the scale more granular. Essentially we would be providing tiny rewards for tiny amounts of work. I am not sure that really motivates people. There plenty of articles that really should be stubs. How many wikis just have a big table for their "Periodic Table" article. When you have a thousand articles it all averages out. --MarsRover 07:42, 25 May 2014 (UTC)Reply
It should also be noted that this number includes everything - actual readable text plus wiki-markup, references, infoboxes, categories etc. Excluding technicalities, the actual encyclopedic content needed to go above the "stub" designation will in some cases amount to less than 5000 characters without spaces. That, in my opinion, is not too harsh. An alternative to granulation (which I agree doesn't make much sense) would be to count pages with less than, say, 200 characters as non-existing. Arguably, such a page with a line of text and a picture can hardly be called an encyclopedic article or even a lexicon entry, so it would make sense. Of course editors could still circumvent this new feature by adding templates, categories, and such, but that is true for every threshold we use. — Yerpo Eh? 10:10, 25 May 2014 (UTC)Reply
Yerpo, I support your idea. Yes one-liners, less than 100 or 200 char) should not be counted. A number of wikis do not accept them anyway (like we try at swwiki to keep them out) but a number of wikis have these tiny non-articles. Eg. I found a lot in yowiki. This has 31 k content but only 5.4 k in "alternate article count" (Articles that contain at least one internal link and 200 (ja,ko,zh:50) characters readable text, disregarding wiki- and html codes, hidden links, etc.; also headers do not count). On swwiki the relation is 28k/24k but yowiki will always be on top. Why is this easily accessible so misleading and the other one so difficult to find? And why is this misleading table put on such a prominent position? 31.59.81.84 13:08, 3 June 2015 (UTC)Reply

I was curious, so I ran the script again after this month's update, set so that it counted pages with less than 200 characters as absent. About two thirds of Wikipedias would lose at least 0.01 point on this account, with 27 losing 0.5 or more, and four losing over 1 point. The infamous record goes to :pnbwiki where 3.31 points are due to nano-stubs like that (which is nicely indicated by their average and median sizes). — Yerpo Eh? 19:38, 5 June 2015 (UTC)Reply

That speaks for changing to that. I see two reasons for nano-stubs: unexperienced users trying to start an entry which stays as a very short definition - and very experienced users who work with an eye to rankings like this and found out that even one-word-entries help your statistics. Change to alternative count will discourage that behaviour and encourage to work on mini stubs. And if alternative count: use the same like Erik Zachte does, because it is not very helpful to find different scales every where. Kipala (talk) 04:42, 12 June 2015 (UTC)Reply

Please help updating this edit

The list supposes to be updated on 1 August, but it has still not been updated after one-third of the month. Would somebody help updating this? Thank you very much. --Yaukasin (talk) 02:47, 10 August 2014 (UTC)Reply

I haven't got enough time this month, hopefully someone else can install mw:Manual:Pywikibot and run the scripts at List of Wikipedias by sample of articles/Source code. Boivie (talk) 05:56, 11 August 2014 (UTC)Reply
I'm afraid this is not as simple as just run the script and get results. I think someone need to have some additional lists or data (i.e. ItemList.txt for getting list of articles), and MarsRover does this before, so if he could make these data/lists public available, someone else could update it too. --C3r4 (talk) 08:45, 11 August 2014 (UTC)Reply
ItemList.txt is created when running the script GetItemList.py. I wouldn't say it's easy to set up the Pywikibot, and run the scripts as expected. But it can be done, without any other lists or data. Boivie (talk) 13:07, 11 August 2014 (UTC)Reply

Any word from MarsRover? Did he give up on this or is he just on vacation? — Yerpo Eh? 07:36, 13 August 2014 (UTC)Reply

I updated it. But I'll probably need to hand it off to someone else in the future. --MarsRover 08:07, 15 August 2014 (UTC)Reply
So if one is interested in taking over (theoretically), how much effort does it actually take to build the neat tables on these pages aside from running the script(s) and copying the output? — Yerpo Eh? 09:49, 17 August 2014 (UTC)Reply
2 days to run the scripts and about 30 minutes to copy the output. That basically all there is to do. I've been meaning to make a this bot process but alas I have no time. --MarsRover
I've been struggling to run GetItemList.py, is the version at List of Wikipedias by sample of articles/Source code the most recent one? In particular, the "wikipedia" module for Python has a very different set of attributes than what's called by the script (for example, the line meta_wiki = wikipedia.site('meta', 'meta') doesn't work because the "site" attribute doesn't exist). Or is there something else that I've missed? — Yerpo Eh? 13:33, 31 August 2014 (UTC)Reply
Seems they weren't updated after the last big change of pywikibot. Should be fixed now. Boivie (talk) 16:00, 31 August 2014 (UTC)Reply
Oh, so it was a pywikibot attribute. Works like a charm now, thanks. :) — Yerpo Eh? 16:47, 31 August 2014 (UTC)Reply

Okay then, I will run the script for this month (probably next weekend) and see how it goes. If everything works smoothly, I can take over the updating at least for a while, but my Python skills are rather rudimentary, so I'll have to ask for help in case of problems in the future (like the abovementioned Pywikibot update). — Yerpo Eh? 09:02, 1 September 2014 (UTC)Reply

Thanks, Yerpo! And good luck. A. Mahoney (talk) 16:20, 2 September 2014 (UTC)Reply
Thanks for Yerpo's update--Wolfch (talk) 14:20, 3 September 2014 (UTC)Reply

It did go smoothly, there were only a few minor glitches, if somebody would like to try solving.

  • NoSuchSite results for only a few sites, other locked wikis caused the script to keep retrying, so I had to cancel myself (like :aawiki).
  • weird result for :mowiki - 974 missing, 1 stub and nothing else (same as last month)
  • there's a case or two where a page is counted absent for too much English, but it's mostly because the page in question is a stub with a lot of English-language sources (example: th:เอกภพ).
  • for :warwiki, it says that "human" is missing because "animalia" is a redirect, but d:Q5 lists war:Tawo which is a more or less legitimate page. It had some broken redirects there until I removed them right now, but they didn't have anything to do with "animalia".

Yerpo Eh? 07:52, 7 September 2014 (UTC)Reply

Weight of ba.wiki edit

Hi, can you change weight of ba.wiki? Bashkir on Cyrillic has 805 characters. -- Регион102 (talk) 06:46, 17 August 2014 (UTC)Reply

I count 824 characters in WinWord statistics. But it is still a 1.4 weight when you round the number. Whoever does the calculation please update the script. --MarsRover 03:16, 5 September 2014 (UTC)Reply
The new weight was applied. — Yerpo Eh? 07:56, 7 September 2014 (UTC)Reply
Thanks! -- Регион102 (talk) 11:32, 7 September 2014 (UTC)Reply

Wikipedia Maithili edit

I am surprised the mai.wp came back as a NoSuchSite error. It was just created in Nov 06 2014 and it appears to be active. Does the python wikipedia library still need to be updated? --MarsRover 18:43, 20 February 2015 (UTC)Reply

I would guess so, yes. I added the language code, but the script returns the error month after month. — Yerpo Eh? 07:34, 21 February 2015 (UTC)Reply

maiwiki was finally picked up this month. It seems they've been quite busy in the mean time. I know this score represents half a year's worth of growth, but I decided to keep it for encouragement. — Yerpo Eh? 20:30, 5 June 2015 (UTC)Reply

why 2 lists??? edit

Why is there this list and separately List_of_articles_every_Wikipedia_should_have?? I just came across handicap which is not in the other list (and in this case bullshit anyway as it is linked to a disambiguation page.., same for "state".. Kipala (talk) 19:16, 9 March 2015 (UTC)Reply

This list ranks Wikipedias by how large their articles from the other list are, it's a kind of a competition. As you said, including a disambiguation would not make sense, that is why disability is included instead. — Yerpo Eh? 20:29, 9 March 2015 (UTC)Reply
The item names at List of Wikipedias by sample of articles/Absent Articles come from the labels at Wikidata. On sw:Wikipedia:Makala za msingi za kamusi elezo those names are wrongly assumed to be article names at the English Wikipedia. Boivie (talk) 20:45, 9 March 2015 (UTC)Reply
Well on sw the entry is just for practical reasons to have a list to work on. So it links to English WP (which probably has nearly all those 1000) as an entry to how the topic is dealt with in teh other languages. I pick entry by entry and try to make a Swahili article on the topics. Kipala (talk) 20:23, 15 March 2015 (UTC)Reply
That's the idea, yes, you should just make sure first that the links are correct in your local list - i.e. linking to en:Disability instead of en:Handicap, for example. All the links should be checked through their Wikidata pages. — Yerpo Eh? 05:53, 16 March 2015 (UTC)Reply

Two new wikipedias (Northern Luri and Goan Konkani) edit

Need to add these to the script and probably have to update the library code:

--MarsRover 07:33, 24 June 2015 (UTC)Reply

Yup, I get "does not exist in Wikipedia family" for :lrc and :gom. Is there someone to be pinged about updating the library? — Yerpo Eh? 07:27, 27 June 2015 (UTC)Reply
Once tried to report the issue on the newsgroup for the python wikipedia library (mailing list here) but since it was moderated not sure anyone even saw the message. Btw, moderating the reporting of bugs is pretty asinine in the open source development world. Sort of gave up after seeing that. --MarsRover 21:16, 2 July 2015 (UTC)Reply

please clarify edit

In the list, malayalam wikipedia's jesus christ article (യേശു) have 40000 more size but that article did not include in long articles why?--AJITH MS (talk) 06:42, 15 July 2015 (UTC)Reply

This list is based on character count, not byte size. ml:യേശു currently has 21,819 characters plus some comments that are ignored, which, multiplied with the factor 1.1 is 24,001 - less than 30,000. Each character of the Malayalam script takes more than one byte to write, so the size reported in page history is about twice the character count. — Yerpo Eh? 08:59, 15 July 2015 (UTC)Reply

 Thanks!--AJITH MS (talk) 19:17, 15 July 2015 (UTC)Reply

How to see? edit

please help me to see the list of wikipedia by sample of articles/stubs/article(10k-30k) of malayalam wikipedia--AJITH MS (talk) 20:04, 15 July 2015 (UTC)Reply

If you want to see which articles in Malayalam Wikipedia are shorter or longer, you'll need to make your own list. Here's an example: la:Usor:Amahoney/1000 Paginae Epitome. Here's how I did it: start with List of articles every Wikipedia should have from here in Meta, which is the list of articles. For each one, go to Wikidata and look in the Wikidata table for the title of the article in your Wiki (if it's not there, then you have a missing article). Then for each title, get the article from your own Wiki and count up how long the text is, ignoring comments; multiply by the "language weight" for your language to get the size as it's calculated here. I wrote programs to do all this; it takes about half an hour to run these programs over Latin Vicipaedia, which has all 1000 of the articles. A. Mahoney (talk) 14:26, 16 July 2015 (UTC)Reply

 Thanks!--AJITH MS (talk) 14:33, 17 July 2015 (UTC)Reply

For a very rough approximation, you can say (size in bytes / 2 = size for this list), but all borderline sizes will be uncertain. If you know how to use Python, you can install pywikibot and simply run the script yourself, it is published here; it enables you to process :mlwiki only, and the script will create a txt file with character counts for all the articles. It doesn't take more than a few minutes to run. If nobody from :mlwiki's community can use Python, I can copy you the output I got, but it won't include changes after 5 July. — Yerpo Eh? 05:36, 17 July 2015 (UTC)Reply

 Thanks!--AJITH MS (talk) 14:33, 17 July 2015 (UTC)Reply

update delayed edit

Does anyone know why July update is taking longer than usually? --Chairego apc (talk) 17:45, 8 July 2016 (UTC)Reply

I was unavailable this week. Updating now. — Yerpo Eh? 08:14, 9 July 2016 (UTC)Reply
Thanks. --Chairego apc (talk) 19:06, 10 July 2016 (UTC)Reply

olo edit

Don't forget about new Wikipedia - olo. (Livvi-Karelian language).--Soul Train (talk) 11:31, 21 October 2016 (UTC)Reply

It will be included in the next update, thanks for the heads-up. — Yerpo Eh? 17:31, 22 October 2016 (UTC)Reply

Correction for Konkani (gom) edit

The label used in this list for gom - ' गोवा कोंकणी / Gova Konknni ' is incorrect. Please could you update the scripts used to generate List of Wikipedias by sample of articles and List of Wikipedias by expanded sample of articles, so that the local name is preferably ' कोंकणी / Konknni '. Alternatively, if you prefer the longer form, then you could make it ' गोंयची कोंकणी / Gõychi Konknni '. Thanks! The Discoverer (talk) 19:27, 10 May 2017 (UTC)Reply

I refer you to the community discussion of July 2015 on this topic, where the conclusion was that the name for Konkani (gom) should be ' कोंकणी / Konknni '. In case some disambiguation is required, then ' गोंयची कोंकणी / Gõychi Konknni ' can be used. The Discoverer (talk) 18:01, 24 May 2017 (UTC)Reply
Thank you for the corrections. :) The Discoverer (talk) 17:21, 6 June 2017 (UTC)Reply

Three new Wikis edit

There are three new languages: Ingush (inh), Gorontalo (gor) and Lingua Franca Nova (lfn). Please add them to the list. Orbwiki107 (talk) 18:14, 13 May 2018 (UTC)Reply

Thanks, Orbwiki107, I added them in the current run, but I was getting the "UnknownSite: Language '[code]' does not exist in family wikipedia" error message in test runs, so the score won't be calculated. I now recall these codes should be added to some master table so they can be detected by the Pywikinot, do you know where to ask for that? — Yerpo Eh? 15:51, 5 June 2018 (UTC)Reply

Update edit

Can someone update this? -Theklan (talk) 11:47, 9 November 2018 (UTC)Reply

I'm having some problems with the script, but I'll try to do it before the end of the weekend. — Yerpo Eh? 07:05, 10 November 2018 (UTC)Reply
Thanks--Wolfch (talk) 08:42, 14 November 2018 (UTC)Reply
Is there any progress?--Nickispeaki (talk) 07:18, 18 November 2018 (UTC)Reply
I haven't solved the problem. Update (at least) for November may have to be skipped. — Yerpo Eh? 08:52, 20 November 2018 (UTC)Reply

I ran the script again, this time it went almost ok, so I updated the list, eventhough it's missing one entry. I've no idea why it couldn't read Q902. — Yerpo Eh? 21:05, 20 November 2018 (UTC)Reply

@Yerpo: Some of the wikidata pages have grown rather large, to the point that retrieving usual sets of 50 data pages becomes a problem. Reducing the “article_group” size in fn. “GetIwLinks” to for example 10, might solve your problem. I am running an updated version of your script, which does not have an issue with Q902. --Dcirovic (talk) 22:55, 20 November 2018 (UTC)Reply
Ooh, good call. A bunch of data on GDP and life expectancy was indeed dumped in items about countries last month, and I didn't realize the script read the whole item. — Yerpo Eh? 05:54, 21 November 2018 (UTC)Reply
@Yerpo: @Dcirovic: I had some problems with the script, mainly because of different Python versions. Could you renew this page in a regular basis? -Theklan (talk) 19:05, 9 December 2018 (UTC)Reply
@Dcirovic: thanks for the December update, and congrats to the :srwiki community for achieving 100%. I ran the script today before I saw that you already did it. Reducing the article group size to 10 did the trick, but not 25 - some new solution will be needed if Wikidata pages continue to grow so fast. — Yerpo Eh? 21:18, 11 December 2018 (UTC)Reply

Error score edit

I notice that Catalan got the maximum score even though it had an article that has too much untranslated English. Maybe the score should be calculated differently? Boivie (talk) 13:45, 21 February 2019 (UTC)Reply

really? Do you have seen the mentioned articles? I don't see any english text to translate and mosf of it is actually references and bibliography correctly formated.--Manlleus (talk) 20:02, 12 July 2021 (UTC)Reply
You should note that Boivie's comment is over two years old. — Yerpo Eh? 06:02, 14 July 2021 (UTC)Reply

Oceania edit

Why did Oceania change to Australia and Oceania in the count? -Theklan (talk) 08:23, 6 August 2021 (UTC)Reply

Somebody changed the list without discussion (see talk), which was overlooked. It's reverted now, so next month's update should be normal again in this regard. — Yerpo Eh? 13:16, 6 August 2021 (UTC)Reply

Sigh. Some people apparently decided to swap the focus of Insular Oceania (Q538) and Oceania (Q55643), so now the earlier one linked from this list refers to "insular Oceania". Now it's a mess. Should we keep Q538 here or change? — Yerpo Eh? 16:45, 16 August 2021 (UTC)Reply

Oceania and more edit

Likewise, it is indicated that there are NINE articles between 10 and 30 kb when in fact there are only FOUR. Oceania is 88 kb.

Please confirm these two items. Thank you very much. Adolfobrigido (talk) 21:42, 22 August 2021 (UTC)Reply

I know this is an old comment, but I just want to note that this page says nothing about the size in kB. The classification is made by number of characters weighted by a language factor. Boivie (talk) 09:00, 30 March 2022 (UTC)Reply

Cornish (Kernowek) wikipedia edit

Can anyone explain to me why Cornish wikipedia shows up as having 344 absent articles in the list on this page, yet a count on our list of the 1,000 articles to be completed https://kw.wikipedia.org/wiki/Wikipedia:1,000_erthygel seems to indicate only c260 needing completion. Am I missing something, do some articles not count, has the list changed or is the bot not picking up on some articles, or does our list miss out on some articles needed? I'm not very technically minded so don't know what to look for to improve our list. Thanks for any comments in advance. Brwynog (talk) 16:30, 8 November 2022 (UTC)Reply

I don't think this page is updated using a bot. Just look at the editing history.-- Ideophagous (talk) 13:07, 22 November 2022 (UTC)Reply
@Brwynog: I didn't check everything so I'm not sure it's the complete explanation, but your local list seems to differ at least a bit from the List of articles every Wikipedia should have. Only those articles on the Meta list are counted by the automated script, you might want to update your local list. — Yerpo Eh? 14:23, 22 November 2022 (UTC)Reply

Changing color code to Viridis edit

Hello! I would like to propose a color code change to Viridis, which is more accessible for color blinded people (en:Color_blindness#Ordered_Information. Also, I have made a small change in scale. @Yerpo: would you be able to add this to the code?

       #color code score
       if score >= 100.00:
           color = "|style = \"background: "+'\u0023'+color10000+"\""
       elif score >= 80.00:
           color = "|style = \"background: "+'\u0023'+color8000+"\""
       elif score >= 60.00:
           color = "|style = \"background: "+'\u0023'+color6000+"\""
       elif score >= 40.00:
           color = "|style = \"background: "+'\u0023'+color4000+"\""
       elif score >= 30.00:
           color = "|style = \"background: "+'\u0023'+color3000+"\""
       elif score >= 20.00:
           color = "|style = \"background: "+'\u0023'+color2000+"\""
       elif score >= 10.00:
           color = "|style = \"background: "+'\u0023'+color1000+"\""
       elif score >= 5.00:
           color = "|style = \"background: "+'\u0023'+color500+"\""
       elif score >= 1.00:
           color = "|style = \"background: "+'\u0023'+color100+"\""
       else:
           color = "|style = \"background: "+'\u0023'+color0+"\""

And then:

#score colors
color10000 = '440154'
color8000 = '472d7b'
color6000 = '3b528b'
color4000 = '2c728e'
color3000 = '21918c'
color2000 = '28ae80'
color1000  = '5ec962'
color500  = 'addc30'
color100  = 'fde725'
color0    = 'EFEFEF'

I can change it in the source code, but I think you have a local copy so the changes should also be done there. Theklan (talk) 08:52, 5 January 2023 (UTC)Reply

@Theklan: this scheme lacks the "higher score = hotter" symbolism, but I'll consider. However, most text colors should also be changed here, is there a complementary scale of gray shades or do I just make them either white or black? — Yerpo Eh? 08:37, 6 January 2023 (UTC)Reply
Well, it is darker every step, but if you want there are other color schemes here: https://waldyrious.net/viridis-palette-generator/. Plasma can be interesting. The best part of this schemes is that they are logic also for color-blind people, because every step is darker than the previous. Theklan (talk) 12:49, 6 January 2023 (UTC)Reply
I'll play with it until next month. Didn't have time today. — Yerpo Eh? 17:37, 6 January 2023 (UTC)Reply
Great! No rush with this. Theklan (talk) 09:56, 7 January 2023 (UTC)Reply
I wanted to thank you for the change. I think this is much more clear now, and the categories are more evident. Theklan (talk) 15:55, 5 March 2023 (UTC)Reply
You're welcome. Let's see if there's any comments from others. — Yerpo Eh? 17:13, 5 March 2023 (UTC)Reply

How do I update the list of targets? edit

I apologize for using machine translation to say this. Looking at the source code, it looks like the target is hard-coded, but is there a place that manages it separately? How are the following changes made?

  • Add: anp, awa, fat, gpe, guc, shi, tly
  • Change?: be-x-old to be-tarask

Thank you for your help. --Amayus (talk) 23:42, 16 September 2023 (UTC)Reply

@Amayus: thanks, I'll add the missing languages in the next update. As for be-x-old/be-tarask, I tried changing that before, but it returns an error, so apparently the old code remains in the part of the system that the script depends on. — Yerpo Eh? 11:16, 18 September 2023 (UTC)Reply
Thank you for your reply! And I understand the current status of be-x-old/be-tarask. najlepša hvala. --Amayus (talk) 11:03, 19 September 2023 (UTC)Reply
@Yerpo: It appears that 'fon' (Fon Wikipedia) has been newly added. Probably already updating the list this month, so it would be greatly appreciated if you could include it next month or later.--Amayus (talk) 10:34, 4 October 2023 (UTC)Reply
@Amayus: for the record, I set the script to include all the new languages, but I forgot to update the Wikipedia family file of my Pywikibot installation, so it didn't process them. They will be done next month. — Yerpo Eh? 18:47, 14 October 2023 (UTC)Reply

Should weights be standarized? edit

Currently we have two different weights used here and in the Expanded list. Should we have both standarized, so both projects can be measured equally? Theklan (talk) 16:22, 4 January 2024 (UTC)Reply

Return to "List of Wikipedias by sample of articles" page.