An XML representation of wiki markup seems to be a frequently requested feature and has been much discussed. I decided to give it a shot - I'm aware that several people have started owring on it, but I have decided to do a few things differently. Most importantly: It's already working.

Here are the most important differences:

  • Do not use XML as an intermediat format to create HTML output
  • Do not rewite the parse
  • Do not build a full parse tree in memory
  • Separate "renderer" (back end) logic from "parser" logic (front end), i.e. factor out knowledge about HTML from the parser into a separate "renderer" class.
  • Provide a "symbolic" representation (no template substitution ,etc) as well as an "expanded" view.
  • Make the back end versatile enough to be able to produce different types of XML as well as TeX, etc
  • Provide an interface to the different back ends. This is done via ...?action=convert&converter=TeXRenderer, etc.

This design allows us to move more logic into the renderer(s) step by step - right now it consists mainly of getter- and maker-functions, the output state and text is still maintained in the parser. This could change over time.

I have tried to benchmark my hacked parser in normal HTML mode against the old (current) parser, and could not find much of a difference - it seems to be nice and fast.

Some other things that need some more thought:

ToDo

edit
  • Have a formal DTD (or better: Schema)
  • Integrate with XHTML (is the schema mixing OK?)
  • deal with named entities.
  • Change markup around (this is easily done)
  • Templates directly after headings behave strangely
  • ...
edit

Example Output

edit

Her's an example of what the XML renderer produces so far (in "symbolic" mode, i.e. without substitutions):

XML rendering of en:Main Page (the pretty printing is for convenience here, it's not done automatically):

 <?xml version="1.0"?>
 <mw:wikitext xmlns:mw="http://wikimedia.org/schemas/wikitext/1.0" xmlns:xhtml="http://www.w3.org/1999/xhtml" version="1.5.0">
  <xhtml:p>
    <mw:include ref="MainPageIntro"/>
    <xhtml:div style="padding-bottom: .3em; margin: 0 .5em .5em">
      <mw:include ref="Main Page banner"/>
    </xhtml:div>
  </xhtml:p>
  <xhtml:table cellspacing="3">
    <xhtml:tr valign="top">
      <xhtml:td width="55%" class="MainPageBG" style="border: 1px solid #ffc9c9; color: #000; background-color: #fff3f3">
        <xhtml:div style="padding: .4em .9em .9em">
          <xhtml:h3>Today's featured article</xhtml:h3>
          <mw:dynamic-include>
            <mw:link-target>Wikipedia:Today's featured article/<mw:var-ref ref="CURRENTMONTHNAME"/> <mw:var-ref ref="CURRENTDAY"/>, <mw:var-ref ref="CURRENTYEAR"/></mw:link-target>
          </mw:dynamic-include>
          <xhtml:h3>Selected anniversaries</xhtml:h3>
          <mw:dynamic-include>
            <mw:link-target>Wikipedia:Selected anniversaries/<mw:var-ref ref="CURRENTMONTHNAME"/>_<mw:var-ref ref="CURRENTDAY"/></mw:link-target>
          </mw:dynamic-include>
        </xhtml:div>
      </xhtml:td>
      <xhtml:td width="45%" class="MainPageBG" style="border: 1px solid #c6c9ff; color: #000; background-color: #f0f0ff">
        <xhtml:div style="clear: right; text-align: left; float: right; padding: .4em .9em .9em">
          <xhtml:h3>In the news</xhtml:h3>
          <mw:include ref="In the news"/>
          <xhtml:h3>Did you know...</xhtml:h3>
          <mw:include ref="Did you know"/>
        </xhtml:div>
      </xhtml:td>
    </xhtml:tr>
  </xhtml:table>
  <xhtml:div class="MainPageBG" style="padding: .5em 1em 0; margin: 0 3px 3px; border-bottom: 2px solid #ccc"><xhtml:h3 id="lang">Wikipedia in other languages</xhtml:h3>
 You may read and edit articles in many different languages:<mw:include ref="Wikipedialang"/></xhtml:div>
  <xhtml:div class="MainPageBG" style="padding: .5em 1em 1em; margin: 3px;">
    <xhtml:h3 id="sister">Wikipedia's sister projects</xhtml:h3>
    <mw:include ref="WikipediaSister"/>
    <xhtml:div style="clear:left"/>
  </xhtml:div>
  <xhtml:div class="MainPageBG" style="border: 1px solid #ffad80; padding: .5em 1em; color: #000; background-color: #fff7cb; margin: 3px 3px 0; text-align: center">
    <xhtml:div style="font-size:90%">
      <mw:include ref="donate"/>
    </xhtml:div>
  </xhtml:div>
  <mw:magic-word name="__NOTOC__"/>
  <mw:magic-word name="__NOEDITSECTION__"/>
  <xhtml:div class="MainPageBG" style="padding: .5em 1em 0; margin: 3px 3px 0; text-align: center;">
    <mw:include ref="newpagelinksmain"/>
  </xhtml:div>
 </mw:wikitext>