Talk:List of Wikipedias by sample of articles/Source code (original)

Active discussions

Modified sourceEdit

Suggestions:

  • include .encode('cp437','replace') whenever printing to console to avoid errors
  • optimize by caching English pages
  • remove interwiki text for article length calculation
  • weight text length
  • color code score

--MarsRover 11:05, 2 December 2007 (UTC)

Modifying sourceEdit

I was looking at modifying this program for my own use (namely, directing it towards a different page; for example, Vital Articles / Extended, a specific wikiproject's topic list, or a specific topic outline's list. Who would be the right person to ask about doing such? Almafeta 05:50, 1 October 2009 (UTC)

Smeira is the original author but he has been missing for couple of years. I could probably help. I've been working on code to create a extended article list (see below). It may need some tweaking for your needs but it can read from the lists you mentioned. --MarsRover 07:36, 1 October 2009 (UTC)
I've been working on that (apparently my installation of Python had... issues), and finally have it working with the groups I'm interested in. Thank you. =)
Also, it's too bad Smeira's gone... it occurs to me that the original was probably the most significant piece of code ever written in Volap√ľk. Almafeta 16:49, 26 October 2009 (UTC)

GetExtendedArticleList.pyEdit

# -*- coding: utf_8 -*-
import sys

sys.path.append('./pywikipedia')

import wikipedia
import pagegenerators
import re

entry_re = re.compile(r"([\*|#]+)(\s*)('*)\[\[([^\]]+)\]\](\s*)\(?(\[\[([^\]]+)\]\])?\)?")
link_re  = re.compile(r'(:?([a-z\-]+):)?([^\]\|:]+)(\|([^\]]+))?')

def parseEntry(line):
    m = entry_re.search(line)
    if m:
        return {'name':m.group(4),'sibling':m.group(7),'indent':len(m.group(1)),'span':m.span()}

def parseLink(link, wiki_name):
    m = link_re.search(link)
    if m:
        linkWiki = m.group(2) or wiki_name
        return {'wiki':linkWiki,'name':m.group(3),'alias':m.group(5)}

def findAll(text, parseFunction):
    return_list = []
    pos  = 0
    item = parseFunction(text)
    while item:
        pos += item['span'][1]
        item['pos'] = pos
        del item['span']
        return_list.append(item)
        item = parseFunction(text[pos:])
    return return_list

def getArticle(wiki_name, wiki_family, article_name):
    print "reading %s" % (article_name)
    wiki         = wikipedia.Site(wiki_name, wiki_family)
    page         = wikipedia.Page(wiki, article_name)
    article_text = page.get(get_redirect=False)
    return {'text':article_text}

def getArticleList(wiki_name, wiki_family, article_name):
    
    article = getArticle(wiki_name, wiki_family, article_name)['text']
    arts = findAll(article, parseEntry)
    for art in arts:
        art['link'] = parseLink(art['name'], wiki_name)
    return arts

print "working..."

lists = {}
lists[':en:Wikipedia:Vital articles/Expanded/People']                     = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/People')
lists[':en:Wikipedia:Vital articles/Expanded/History']                    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/History')
lists[':en:Wikipedia:Vital articles/Expanded/Geography']                  = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Geography')
lists[':en:Wikipedia:Vital articles/Expanded/Arts']                       = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Arts')
lists[':en:Wikipedia:Vital articles/Expanded/Philosophy and religion']    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Philosophy and religion')
lists[':en:Wikipedia:Vital articles/Expanded/Everyday life']              = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Everyday life')
lists[':en:Wikipedia:Vital articles/Expanded/Society and social sciences']= getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Society and social sciences')
lists[':en:Wikipedia:Vital articles/Expanded/Health and medicine']        = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Health and medicine')
lists[':en:Wikipedia:Vital articles/Expanded/Science']                    = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Science')
lists[':en:Wikipedia:Vital articles/Expanded/Technology']                 = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Technology')
lists[':en:Wikipedia:Vital articles/Expanded/Mathematics']                = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Mathematics')
lists[':en:Wikipedia:Vital articles/Expanded/Measurement']                = getArticleList('en',  'wikipedia','Wikipedia:Vital articles/Expanded/Measurement')

lists[':m:List of articles every Wikipedia should have/Version 1.1'] = getArticleList('meta','meta',     'List of articles every Wikipedia should have/Version 1.1')
lists[':en:Films considered the greatest ever']                      = getArticleList('en',  'wikipedia','Films considered the greatest ever')
lists[':en:Outline of biology']                                      = getArticleList('en',  'wikipedia','Outline of biology')

print "merge lists..."

fullList = {}
for x in lists.values():
    for i in x:
        if i['link']['name'].lower() not in fullList:
            fullList[i['link']['name'].lower()] = i['link']['name']

print len(fullList)

print "sorting..."
sortedFullList = sorted(fullList.values(), key=str.lower)

for i in sortedFullList:
    print i

Perl version?Edit

Has anybody implemented this in Perl? I've been working on a similar routine, just for my own amusement, looking only at articles in my home WP (= Latin), and I can't get the sizes to come out right. Grab the page, take out the inter-wiki links, take out the comments, see how many characters you've got, and multiply by the language weight -- how hard can it be? I'm wondering if I've run up against some Perl-ish oddity about Unicode (which I thought I was handling correctly), or just made some fluff-ball error. A. Mahoney 18:09, 8 November 2011 (UTC)

I think you might be the first to try Perl. Yeah, I think you're right. It is probably related to Unicode. Make sure the length() function returns the number of characters and not the number of bytes (http://stackoverflow.com/questions/1326539/how-do-i-find-the-length-of-a-unicode-string-in-perl). --MarsRover 22:53, 8 November 2011 (UTC)
What I ended up having to do was to trim trailing white space; Unicodeity was OK. The numbers are still a little off but close enough for planning purposes. If anybody else wants to use Perl, the MediaWiki::Bot package is the way to go; it's quite straightforward. A. Mahoney 17:36, 13 January 2012 (UTC)

Tuvan languageEdit

btw, pywikipedia doesn't seem to support "tyv." yet. --MarsRover 06:33, 5 September 2013 (UTC)

Return to "List of Wikipedias by sample of articles/Source code (original)" page.