WikiMiner
WikiMiner is a search engine dedicated to the DVD edition of Wikipedia. It was created in the year 2006 for DVD edition of Polish Wikipedia (235,000 articles), but can be easily localized to any other language version. By now two language files has been created: Polish and English one. The program is now being tested under various operating systems, and some minor changes are being implemented. The first name of the program (WikiBrowser) has been changed because of the conflict with other Wikimedia project, WikiBrowse.
Main features of the application:
- The program is a standalone Java application. It requires Java Runtime Environment (JRE) ver. 1.5 or higher. Index can be placed on hard drive(fast search) or DVD.
- It supports case-insensitive searching
- Boolean phrases are supported.
- Search result entries include page title, number of occurrences of the searched keywords, Wikipedia categories and excerpts from the article content
- Index is to be installed on the user's hard drive, therefore the DVD is not used until a user clicks on a link to an article
- The resulting index for the whole Polish Wikipedia, that is 2GB of text, takes 120 MB.
- In a minimum installation mode, only Java Runtime Environment has to be installed on hard drive
- Only part of the index is loaded into memory.
- Searching is fast, once the index is loaded at the program startup.
- You can search in Japanese as well as in French or Polish - full unicode is supported
- Grammatical suffixes can be specified and cut during indexing and searching
- Command line mode, stopwords, redirects are also supported
- Alphabetical sorting is done using simplified UCA algorithm, which respects order of non-ASCII characters.
- The index is being created from HTML pages which have to be formatted in a specific way (UTF-8 coding, article text should start from <p> and end with <div id='footer'>, etc.).
- The program is released under the GNU GPL license.
- It is independent from the operating system (at least tested and working under Windows and Linux)
- In opposition to some GNU tools like Regain, no WWW server is being installed on a client hard drive, the program doesn't raise security alert on WinXP, and demands no special rights for any applet or application. Standard security settings are ok.
- Search results are written to a temporary HTML file, and then the default HTML browser is called (or some other, depending on program configuration). While opening the result page, the program checks if Wikipedia DVD with article base is present, and if not, shows appropriate warning. Temporary files are removed when program exits.
- The program doesn't use Javascript or Java applet (just Java).
- The DVD is not required to perform searches.
Building index
editSection under construction
You have to prepare a set of HTML pages to be indexed. Each HTML file must be written in UTF-8 coding.
In WIKI.INI file the following options should be set up: TODO
Then you need to execute command:
java -jar WIKIMINER.JAR -make
Filenames of all files used by the program are capitalised to be consistent with ISO 9660 Level 2 standard. It allows maximum compatibility of the DVD installation.
Searching
editProgram searches for any words in all articles in the main Wikipedia namespace (ns=0). Grammatical suffixes (like -s in English, or about 30 suffixes in Polish) can be cut. Their list can be configured.
Search is case-insensitive. All unicode non-ascii characters similar to latin characters, like ą, Ü and about 500 other letters, can be typed as their nearest ASCII equivalent as well. Standard transliteration of German and Dutch letters (Ü=ue, etc.) is also supported.
While searching for a sequence of words, default and operator is assumed, and program finds all articles that contain all required words (in any order).
The resulting list can be navigated using hotkeys, which is especially important for the blinds. On Windows system Alt+1 jumps to the first result, Alt+2 to the second one, etc.
Boolean search
editOperators and, or, not and parenthesis can be used in search query.
Examples:
- George and not Bush - looks for all pages containing word George and not containing word Bush.
- Betty has (cat or hamster) - looks for all pages containing words Betty, has and at least one of the words: cat or hamster.
Versions:
- Instead of and you can type &, &&. You can omit it as well.
- Instead of or you can type |, ||
- Instead of not you can type !, ~
Title search
editTo search in article titles only, use title: keyword.
Examples:
- title:Adam Mickiewicz looks for all pages with word Adam in a title and Mickiewicz in a title or in an article body.
- title:(Adam Mickiewicz) looks for all pages with word Adam in a title and Mickiewicz in a title (in any order).
- title:Adam or title:Eve looks for all pages with word Adam in a title or Eve in a title.
Category search
editTo obtain all articles in:
- categories with a given word in a category title and
- their subcategories,
use categ: keyword.
Examples:
- categ:History returns all pages from categories History, History of Poland, etc., and their subcategories.
- key categ:Databases returns all pages with a word key in a database context.
- categ:Islam categ:Christianity returns all pages connected both with islam and christianity.
- sun and not categ:astronomy returns all pages with word sun not connected to astronomy
Operator precedence
editOperators, if not modified by parenthesis, are executed in the following order (from the first to the last one) :
- title:, categ:
- not
- and
- or
These keywords can be also translated to other languages with no programming required. For example in Polish Wikipedia we were able to use kateg, tytuł, i, lub and nie together with categ, title, and, or and not.
Stopwords
editProgram removes from query:
- too short words (while index creation, the minimum keyword length can be set),
- too frequent words (for example the).
User can view list of stopwords using -stopwords command line option.
Command line options
editProgram can be executed from command line as well. It allows using wikipedia search in scripts.
Command line options:
Option | Description |
---|---|
-ini file.ini | use different configuration file instead of WIKI.INI |
-lang file[.lng] | use different language file. By now EN and PL are supported, but you can easily create other language files. |
query | search for a given query |
-out file | write result to a given file (applies to command line search only) |
-format txt | create text output instead of HTML file |
-format wiki | create output in wikipedia format (links in form of [[title]] |
-start n | start the list from result #n |
-max n | limit number of results on the list to #n |
-stopwords | show list of stopwords |
-asort | sort results in alphabetical order of titles |
-verbose | show debug information |
Example:
Create text list for query John III Sobieski sorted by titles:
java -jar WIKIMINER.JAR John III Sobieski -format txt -out john.txt -asort
Configuration file: WIKI.INI
editThe program uses configuration file WIKI.INI. It should be coded in UTF-8.
Options | Description |
---|---|
root=path | path to the main directory with HTML pages. If more paths is specified (separated by semicolon), program will search all of them. |
browser=path | HTML browser used to view result pages |
lang=file[.lng] | path to the language file. By now EN and PL are supported. |
deleteTemp=yes/no | delete temporary HTML files in GUI mode. In command line mode created files are not removed. |
index=path | path to WIKIINDEX.DAT file |
map=path | path to MAP.DAT file |
strings=path | path to STRINGS.DAT |
charorder=path | path to CHAR_ORDER.DAT file |
css=path | path to SEARCH.CSS |
logo=path | path to file with logo of Wikipedia |
icon=path | path to png file with icon of Wikipedia |
front=path | path to the front HTML page (C link on the snapshot above) |
help=path | path to the help HTML page (D link on the snapshot above) |
checkDVD=yes/no | should the program check if DVD is in drive ? |
alwaysOnTop=yes/no | if set to yes, search window will be always visible |
WWWLinks=yes/no | if set to yes, additional links to on-line wikipedia articles will appear in search results |
maxLinksOnPage=number | maximum number of links on one result page |
maxCategInResult=number | maximum number of categories presented in a search result item |
Paths to files in WIKI.INI can be in form of:
- absolute path, for example c:\wikidvd\map.dat in Windows
- path relative to a directory with program .JAR file, starting from {{JARPATH}}, for example
map={{JARPATH}}MAP.DAT - path relative to the database root directory on DVD, specified in root option. It should start from {{ROOT}}, for example
index={{ROOT}}WIKIINDEX.DAT
Copyleft and author
editProgram, in order to launch HTML browser and show the search results, uses modified BrowserLauncher class. Its copyleft:
This code is Copyright 1999-2001 by Eric Albert (ejalbert at cs.stanford.edu) and may be redistributed or modified in any form without restrictions as long as the portion of this comment from this paragraph through the end of the comment is not removed. The author requests that he be notified of any application, applet, or other binary that makes use of this code, but that's more out of curiosity than anything and is not required. This software includes no warranty. The author is not repsonsible for any loss of data or functionality or any adverse or unexpected effects of using this software. Credits: Steven Spencer, JavaWorld magazine (Java Tip 66) Thanks also to Ron B. Yeh, Eric Shapiro, Ben Engber, Paul Teitlebaum, Andrea Cantatore, Larry Barowski, Trevor Bedzek, Frank Miedrich, and Ron Rabakukk @author Eric Albert (ejalbert at cs.stanford.edu) @version 1.4b1 (Released June 20, 2001)
Java sources of the program are included in its jar file. You can open it using for example WinRAR program.
Contact to the author: pl:User:Olaf (Olaf Matyja).
Program download: [1] (This is a part of the DVD edition of Polish Wikipedia. Only WikiMiner and its index are included. You can search, but links point to locations, where files are expected on DVD)