Open main menu

Processing MediaWiki XML with STX

Blue Glass Arrow.svg MediaWiki logo.png
This page should be moved to MediaWiki.org.
Please do not move the page by hand. It will be imported by a MediaWiki.org administrator with the full edit history. In the meantime, you may continue to edit the page as normal.

Especially huge XML files exported by MediaWiki are too large to be processed with powerful transformation languages tools like XSLT. You can use a SAX-implementing parser that should be available in almost any language of your choice. You can also try to directely parse parts of the XML code but this method is very difficult to maintain. An alternative is Streaming Transformations for XML(STX), a one-pass transformation language for XML documents. You can also combine STX and XSLT.

STX enables the processing of large documents and streams. For instance you can process the dump

zcat pages_full.xml.gz | java -jar joost.jar - myscript.stx

STX ScriptsEdit