Preserving source formatting: I need samples of real-world markup

Imagine the following:

  1. You have a XML or HTML document from a version control system. (SVN, Hg, git, etc.)
  2. You have an editor which parses that into a DOM document.
  3. You edit the document using a GUI, not looking at the source code – just adding in paragraphs, new widgets, math formulae, vector graphics, etc. with a visual editor of some kind.
  4. You finish your changes and save the document.
  5. You get ready to commit your altered document back to the source code repository, looking over the change set, and wonder what your editor did to your document:  it only looks vaguely like what you started with.

This is a hard problem, I think, and for the most part it comes down to handling white space inside a XML tag of some kind.  For instance, attributes may appear all on the same line as the element they belong to, or on new lines by themselves, with different indentations.

On the one hand, a visual editor like this isn’t required to preserve source code formatting.  As long as two different formats of the same XML document’s white space parse to equal DOM nodes, the editor normally does not care.  On the other hand, version control systems don’t often deal with DOM nodes – only with the source markup that an editor generates.

In trying to build a better XML editor (Verbosio) which supports editing in visual modes, I feel I have to respect both visual editing and existing source formatting.  Now, there may be a solution out there – a set of algorithms and a specification – but I have certainly not heard of it.  On the other hand, many source-controlled versions of XML documents have certain patterns enforced from coding patterns their authors require.  I am much less concerned with formatting a document the way I think it should be than with honoring existing patterns of source formatting.

Given a few samples of real-world markup, I can make some educated guesses – write a few JavaScript objects into a model which tries to respect the original source.  What I’m asking is for diverse samples of human-generated and human-edited markup – not computer-generated 1MB+ size files, but something more reasonable for people to work with.  It’s especially significant if these samples live in (or are derived from) version control repositories.

So, would you please post relevant links as comments to this blog entry?  Either an existing specification for this sort of problem, or diverse XML samples where people have their hands on the code regularly.