Community Wishlist Survey 2017/Miscellaneous/Word count on statistics
Word count on statistics
- Problem: We don't have an actual word count since 2014, and this is a basic statistic to calculate Wikipedia's size
- Who would benefit: Statistic-lovers and everyone who want to show the size of Wikipedia
- Proposed solution: Having a word count from the dump would be the solution
- More comments:
- Phabricator tickets: Done at gerrit:392471
- Proposer: Theklan (talk) 22:40, 13 November 2017 (UTC)
- Translations: none yet
Discussion
editThis is relatively straight forward, we already have the per-article word counts broken out (they are in search results), there just isn't a public way to ask for a sum. FWIW a sum on en.wikipedia.org content index currently reports: 3.049711774E9 EBernhardson (WMF) (talk) 03:17, 18 November 2017 (UTC)
- Where did you find this number? -Theklan (talk) 17:57, 20 November 2017 (UTC)
- I wrote a custom query against the elasticsearch cluster to aggregate the stored word count (as I'm a developer working on search at WMF). I've put up a patch in code review to integrate this into Special:Statistics. I would expect this to be merged and roll out sometime in December. This is only the raw word count of pages considered articles, not any of the more advanced things discussed below. EBernhardson (WMF) (talk) 19:09, 28 November 2017 (UTC)
- @Theklan: This has now rolled out to all wiki's, you can get the counts from the Special:Statistics page. 2601:648:8402:C015:307E:5334:1490:C6B9 19:09, 15 December 2017 (UTC)
- @EBernhardson (WMF): Are you sure that this number is correct? The number was considerably higher in 2014 according to Wikistats. -Theklan (talk) 00:57, 16 December 2017 (UTC)
- @Theklan: Wikistats may have been calculating something different, would have to dig into what they counted. This particular count takes the content (main namespace), removes some non-content portions (tables, hatnote's, etc) and then counts the number of individual words (as determined by tokenization with lucene, the same used for full text search). If we were to include non-content pages the value would increase from 3.1 billion to 11.3 billion. EBernhardson (WMF) (talk) 17:58, 17 January 2018 (UTC)
- @EBernhardson (WMF): Are you sure that this number is correct? The number was considerably higher in 2014 according to Wikistats. -Theklan (talk) 00:57, 16 December 2017 (UTC)
I would suggest taking this further with basic readability statistics. there are various well-established metrics, but even simple things like average words-per-sentence and syllables-per-word would be helpful. T.Shafee(Evo﹠Evo)talk 11:02, 18 November 2017 (UTC)
- Readability metrics are misleading and bullshit. Source: I built one. --Dispenser (talk) 18:03, 20 November 2017 (UTC)
Note: This idea was also suggested at wikitech-l a few days ago, and a reply pointed out a userscript that does a very simple version. Quiddity (WMF) (talk) 19:52, 20 November 2017 (UTC)
- User:Dr pda made a byte and word counter years back and lists issues with counting "article text". The reason why people like word count is "100 words = 1 minute of reading" (without regard to textual difficulty). Naturally excludes infoboxes, tables, images, navboxes, etc. --Dispenser (talk) 21:10, 20 November 2017 (UTC)
- Yes, but I don't want a script that measures the word count of a given article, but the global number of words in the whole Wikipedia project. There's a difference there! -Theklan (talk) 12:03, 21 November 2017 (UTC)
Voting
edit- Support David1010 (talk) 10:59, 28 November 2017 (UTC)
- Support --Liuxinyu970226 (talk) 13:12, 28 November 2017 (UTC)
- Support — Draceane talkcontrib. 18:17, 28 November 2017 (UTC)
- Support Thomas Obermair 4 (talk) 21:56, 28 November 2017 (UTC)
- Support Shizhao (talk) 03:03, 29 November 2017 (UTC)
- Support Donald Trung (Talk 🤳🏻) (My global lock 🔒) (My global unlock 🔓) 10:21, 29 November 2017 (UTC)
- Support Ermahgerd9 (talk) 21:03, 29 November 2017 (UTC)
- Support Meiloorun (talk) 23:52, 29 November 2017 (UTC)
- Support Zhangj1079 talk 01:41, 30 November 2017 (UTC)
- Support - yona B. (D) 08:17, 30 November 2017 (UTC)
- Support קובץ על יד (talk) 12:41, 30 November 2017 (UTC)
- Support Whats new? (talk) 00:16, 2 December 2017 (UTC)
- Support J947 02:16, 2 December 2017 (UTC)
- Support Tom Ja (talk) 14:40, 2 December 2017 (UTC)
- Support ديفيد عادل وهبة خليل 2 (talk) 14:56, 2 December 2017 (UTC)
- Support enL3X1 ¡‹delayed reaction›¡ 16:05, 3 December 2017 (UTC)
- Support Ciao • Bestoernesto • ✉ 01:52, 4 December 2017 (UTC)
- Support Guycn2 · ☎ 19:30, 4 December 2017 (UTC)
- Support I am interested to see this happening. ··· 🌸 Rachmat04 · ☕ 07:29, 5 December 2017 (UTC)
- Support Ruslik (talk) 18:09, 10 December 2017 (UTC)
- Support Perrak (talk) 20:58, 10 December 2017 (UTC)
- Support, useful for example for tracking average number of words per article — NickK (talk) 16:51, 11 December 2017 (UTC)