Data summit 2011/Parsers

(Copied from original at etherpad:DataSummitParsers -- needs cleanup)

Ye olde questions:

http://www.mediawiki.org/wiki/Wikitext/2011-01_format_discussion Who's here?

  • Trevor (WMF)
  • Russ Nelson
  • Hannes Dohrn
  • Karl Matthias (aboutus - kiwi)
  • Brion

(Trevor chat notes) Intermediate structure for a wiki document -- this is our key

  • for near-term wikitext editing, can restore original format if document format is tightly tied to that and has all the info

Danese says she'd like us not to invent a new standard for this intermediate format...should use existing standards where possible. XML? JSON? (XML and JSON are great serialization formats which can be used to save data structures. What's really important is to determine what the structure looks like. Some structures will fit better natively in XML than in JSON. XML can also benefit from being able to directly include subsets of other XML grammars, such as using a subset of HTML 5 XML serialization directly. -brion) Slight AST as use for things like localized messages -- send something that says where the magic words are gonna be and lets them be handled -- such as plurals. API consumers should be able to get the AST without having to know all the freaky parsing themselves. [Plus this will be future-compatible if/when wikitext dies ;) -brion] Trevor says, and I agree, that we need to eliminate some of the truly weird-ass cases in the syntax so that we can end up with something even a little bit closer to LALR(1). The biggest problem in being able to present partial pages is that a template can enter into an HTML structure with another template exiting that structure. Specifically, starting and stopping a table. That makes it much much harder to be able to parse just a portion of the text and understand its semantics. Use of things like the peg/leg-base parsers to identify weird syntax usage <- helps to clean up and upgrade slowly to cleaner system Trevor is writeing on his pad. Wikitext -> Parser -> Document -> Post-Processor. If we stored parsed text, we could generate HTML much much more quickly. Brion points out that not all templates are the same. Some we might need to look into while parsing, others not. He draws a sketch on the board which I will attempt to reproduce here: Doc Model page is comprised of blocks a block is a paragraph | table | template box | image box | list | (list item) | heading

   ^ list items must be in a list block; list blocks are implied in present markup - dangerous
   ^ paragraphs are generally implied, which complicates many of those things (edge cases like finding a template or comment first)

a block may contain other blocks blocks cannot start or end another block

   ^--- **many known violations of this**
   ^ requires fixup; can we enforce this one template at a time? :D

A block template may be comprised of multiple blocks. A paragraph has one or more inline inline is text span | link | math | parser function | template generating inline | image inline cannot contain blocks

   ^ violation in templates is likely. need to learn to distinguish block & inline template context

inline can contain inline PEG is a backtracking parser generator. Seems to be quite capable. Karl is generating HTML, and Hannes is generating an abstract syntax representation. They don't handle some of the worst edge cases; their philosophy is "if it generates ugly results, you need to fix your wikitext."

Flagging and fixing weird broken things edit

We need to do something similar, but we need to let people know that THIS which used to work will not work in the future. If we can flag individual items as clean, we can run them through the new system henceforth forever safely. Flagged-broken items can have a big warning for the meantime... Here, here! Discussion of the difficulty of parsing '''''''''''''''''. Hannes points out that people will open a link, open italics, close the link, and it works, but it shouldn't. We can mark them automatically on saving, with a warning message.

   (this is simple bad nesting, same as with all other structures including <em> or <b> explicit usage)

... more whinging about bad nesting ...

   l'escargot
   ^ oh the terror 
   LOL, I thought you wrote "Escargot, oh the Terroir"

ZOMG, Etherpad recognizes italics ? Oh, no, that was done by hand.

   only if you push the italic button

I wish I could paste a photo in here, but I guess that would be Wave, wouldn't it? We need to actually start hashing out a DOM-ish document model that covers our needs fairly consistently. Each new-parser has this sort of half-done for specific needs. Trevor's thingy based on some of Magnus's thingies doing some testing based on ability to round-trip for editing

   includes 'origin' markings to aid in the round-tripping -- but these coudl be dropped in future for instance, when doing editing on a fully upgraded page
       (freaky whitespace and weird broken stuff needs to be preserved until a page and its dependencies can be verified as nesting properly; so being able to keep it for round-tripping is important)

[using intermediate DOM also can aid in conversions & import/export with other wikis, etc]

Collect existing work edit

  • Trevor's work? -- not checked in yet ;) [will be checked in for review. no solid docs, but plenty of comments]
  • Magnus's work? -- code is in parsers; may be some docs somewhere
  • Crystal ball? -- no description yet but can pop into the code for now. (Does additional postprocessing from the AST to normalize to document tree)
  • Other?
   Neil K is threatening to check something PEG-related in
  • Kiwi doesn't really produce a structure (goes straight to HTML)

Don't forget mwlib, which has its own document object model, and which is the most complete alternative parser implementation (written in Python). Production-tested on WMF projects to handle all kinds of weird stuff cleanly. It also already supports output to PDF via ReportLab (with prototype writers for XHTML, DocBook and OpenDocument). Drawback: Optimized for print, not web. But it may be possible to build on. http://code.pediapress.com/git Big question: when storing a parsed document, do we store the exact document or do we store a normalized document which is more structured but possibly doesn't render as the same exact thing?

Editing workflow for fun and profit edit

Keep all the power in the users' hands! Pages that are document format-friendly can be presented for editing both visually and via source -- and they can be presented in slightly better wikitext source format too, such as removing the '''''''' ambiguities and crap while still being keyboard-friendly for the future. When syntax gets upgraded "for you", make it clear. Don't break things unexpectedly during an edit.

   batch detection & todo lists -- these could even be done before upgrading things in some cases.

WEIRD TEMPLATE HORRORS can generally be replaced with better systems... do we need to create those systems though?

   What do we not know?
       Some things are easy to fix by moving a tag
       Some things like oddly nested templates might actually need NEW FEATURES to make their output work. FIND OUT WHAT THESE ARE AND DEVISE SYSTEMS.

Template expansion and other fun edit

Decision made: push rendering to as late in the process as possible. Same for template substitution. (Yay for decisions) Things like images-or-links can be a bit ambiguous; but at our DOM level they should be reasonably clear for what they are -- eg a File:whatever becomes a file-reference node, which the *renderer* might render as a link or as an image, depending on if the file is an existing image, a non-image file, or doesn't exist. Non-legacy template expansion should be done at the DOM level, and can retain the origins of parameter replacement. This means:

  • no need to re-parse entire pages; drop in the template's node tree to replace the template-inclusion node, then iterate through parameters and replace them in
  • keep the reference through to HTML output and a visual editor can do things like finding the bit of original page source that has text for you to edit in an individual table cell of an infobox
  • further, could allow re-collapsing the template so we can edit with a pre-filled tree, then collapse the templates back out for saving (oooooh)

(Legacy template expansion that deals with the horrible evil things we see sometimes with bad template nesting has to be done in the legacy preprocessor or source stages. Very ugly. That's why we don't want to do it anymore EVER in far future, and not do it anymore on particular pages/templates when we know they're good.)

Editing edit

Where we have totally DOM-clean pages, a visual editor doesn't have to worry about wikitext source at all. Whee!!!!

XML vs JSON? edit

Decision: gonna aim for JSON representation when serialization is needed. JSON is friendlier to native data structures in JS and PHP (and Python and Ruby and...), and more compact -> likely beneficial for front-end code. "We do love us some jQuery" XML has some niceties like existing systems for querying into structures, which could be beneficial. Do we need those? Do we need to reinvent them for a JSON-y structure? Further time spent bikeshedding XML vs JSON. JSON note: make sure node types are cleanly extensible for plugins that tweak syntax.

3pm summary edit

  • we mostly agree on the general structure items needed
  • migrating without breaking existing non-normal structures will require some side-by-side stuff with alternate modes [eek!]
  • some details like original-source-code-whitespace-tweak chunks will only really be needed for some uses (supporting round-tripping on non-normalized source pages); need to include this in addition to things that are output-only
  • we probably like JSON for standard serialization of the structure.

Next steps edit

Set up Parsers as the "Big Topic" at Berlin Developer's Meeting

  • folks need to experiment with the kiwi-style mass parsing to help identify known structure problems -> supply todo lists for editors to do cleanup, and prepare for the future built-in upgrade warnings
  • folks need to start hashing out a preliminary doc structure (JSON?) and start experimenting
   a) confirm that we can cover pretty much any reasonable page structure
   b) confirm how well round-tripping with original source can work for editing (with and without template expansion)
   c) start firming up the issue of what markup *won't* work
   d) if necessary, start devising alternatives for problem areas like bad nesting in templates
  • recommend having some good things to demo by the May hackathons