XHTML not dead, despite reports

With the W3C's recent announcement that work on XHTML 2.0 is not being continued, it would be tempting to think that the HTML vs XHTML war has been 'won', and not by the side a lot of people wanted.

However, that's a misconception. XHTML is alive and well as part of HTML5, on more or less equal terms with 'plain' HTML. It's just not going to be replacing 'tag soup' any time soon unless people start using it!

I'll be taking a look at the options authors have for producing XHTML markup, but lets first look at why someone might want to use XML.

Pros and cons of XHTML

Pros:

  • Parsing - XHTML pages can be parsed using standard XML toolkits.
  • Transforms - pages can have XSL transforms applied to them.
  • Embedded content - pages can have other XML namespaces embedded into them.
  • Unambiguous - the author doesn't have to remember whether a tag needs closing or not, or whether the tag should be upper or lowercase. This also taps into the general developer's 'coolness' gene.
  • XML is 'cool'.

Cons:

  • Brittleness - it's a lot easier to make a mistake that renders the page invalid XHTML.
  • Deprecated JS - you can't use some constructs like document.write().
  • No iFrames - these aren't supported in XHTML strict.
  • MIME type - there is confusion about which type XHTML should be served as (see next section).

To be honest, it is worth considering whether those pros are worth it for you. There are plenty of people who think you should just use HTML 4.01 and not worry about XML.

Option 1: Use XHTML 1.1 and serve it as text/html

Just because XHTML 2.0 has gone away, it doesn't mean you have to jump on board XHTML 5 - browsers will continue supporting XHTML 1.1 for decades to come.

There are a few caveats to consider when using XHTML but serving it as HTML. The W3C XHTML Media Types recommendation states:

"In general, this media type is NOT suitable for XHTML except when the XHTML is conforms to the guidelines in Appendix A. In particular, 'text/html' is NOT suitable for XHTML Family document types that add elements and attributes from foreign namespaces, such as XHTML+MathML [...] XHTML documents served as 'text/html' will not be processed as XML"

In short, even if your DOCTYPE says your document is XHTML 1.1 Strict, your browser will not believe you and will basically parse it as HTML. There is good reason for this -however: surveys suggest that less than 40% of documents online validate against the doctype they declare.

Anecdotally, the Microformats community has developed a number of proxy services that take HTML pages as input and transform them to other document types, and nearly all of them have run into the problem of supposedly-XHTML documents causing errors when fed into XSL engines. Most now run content through Tidy before processing.

This option means:

  • Your pages have to be valid HTML 4.01 anyway, by using the Appendix A compatibility guidelines.
  • You can't use any of the XML features like namespaces.
  • Consumers are advised not to trust that it's XML anyway, so browser will render it the same as HTML 4.01.
  • Consequently, this option is best thought of as a subset of HTML 4.01 where the document just happens to also be valid XML.

Option 2: Use XHTML 1.1 served as application/xhtml+xml

The same W3C document quoted above says:

"Family documents. 'application/xhtml+xml' should be used for serving XHTML documents to XHTML user agents (agents that explicitly indicate they support this media type)."

In short, even if you're serving documents as text/html to some clients, you should use application/xhtml+xml for those that say they can support it, like most modern browsers except IE. Very few sites actually do this.

The reason is that when served as application/xhtml+xml, browsers actually trust that your document is going to be XML and throw valiation errors or break when it's not. Why is that bad? Well, it turns out it's actually pretty hard to guarantee this, even at the server side.

Developers are reasonably conversant in XML syntax, but does that apply to everyone that gets to generate content on your site? Maybe there's a junior front-end developer who knows how to bash HTML in Dreamweaver, maybe users can post content onto the site themselves, or maybe your super-genius senior developers will occasionally make a typo.

Essentially the upshot of this option is:

  • If you do this for some clients, you still have to do Option 1 for other users, which means you either have to do everything twice, or you have to duplicate your work.
  • You have to do a lot more work ensuring that your markup is valid XML at the server side.
  • You do get to use XML-specific features, as long as you don't mind not serving your content to some users.

Option 3: Use XHTML5

At this point in the post, you're probably thinking that XHTML is a bit of a mess, and might be hoping that I'm going to say that HTML5 solves all the problems. Sadly, that's not the case.

HTML5 does however clean the issues up somewhat.

For a start, unlike HTML 4.01 vs XHTML 1.1, the XML element of HTML5 is in the core of the specification. The HTML5 spec (or proposed spec, I should say) defines the elements and attributes in terms of a parsed DOM, and then explains how they should be serialised into XML and 'HTML' forms.

The spec also goes to great lengths to specify a parsing and error-handling model for the HTML serialisations, so that browsers can know the 'right' way to parse seemingly-malformed content like <p><b></p></b>> and get a consistent result. This may seem like a waste of time in a world where XML exists, but in reality with so many hand-coded sites out there, it's a good idea to have a consistent set of rules about how to handle bad markup.

One other way that HTML5 clarifies the split between HTML and XHTML is that the two document types share a consistent DOCTYPE:

<!DOCTYPE html>

Think about the current generation's situation:

  • Documents served as text/html with the HTML 4 doctype are HTML
  • Documents served as text/html with the XHTML 1.1 doctype are HTML and valid XHTML
  • Documents served as application/xhtml+xml and the XHTML 1.1 doctype are XHTML

By having a single DOCTYPE, HTML5 avoids the awkward middle situation:

  • Documents served as text/html with the HTML5 doctype are HTML
  • Documents served as application/xhtml+xml with the HTML docytpe are XHTML

This option seems like a good one going forwards, as HTML5 elements become more and more supported by browsers:

  • Authors who can guarantee their pages are valid XML can serve them as XHTML to supporting browsers.
  • Authors who think their pages are probably XML can serve them as text/html and if any mistakes sneak in, it won't matter too much. The HTML parsing engine in HTML5 should render their documents as intended anyway.
  • Authors who don't want to worry about XML can write their pages as HTML and continue not to care about whether they're parsable by XML toolkits, as they know that the content-type and DOCTYPE they're using offer no such guarantees to consumers.

The future

My own feeling is that the XHTML 1.1 specification didn't bring much to the table for browser manufacturers. It defined a new MIME type for them to recognise, and started asking them to validate pages as XML in some instances, but didn't actually add much in terms of useful features - browsers already had an HTML rendering engine after all, so it's not as if XHTML made anything easier for them.

Also from the development side, XHTML was implemented on sites for a few different motivations: a sense of neatness, the wish to keep up with the latest thing, but there was rarely a business case for its implementation - very few sites rely on their output being parsable as XML.

Hopefully by having XML at the core of the new specification, XHTML will be more and more embraced. When user-agents are upgraded to parse the new markup, programming teams will also feel the need to implement the XML parser as well as the HTML parser. Similarly, when sites are redesigned to be HTML5-compatible, hopefully in a lot of cases XML-compatibility will be included in that work.

However, in another sense hopefully HTML5's strong specification of how HTML should be parsed will enable browser manufacturers to further standardise how they treat 'tag soup', and maybe those amateur hand-coders out there will find that their non-validating badly-written code is at least achieving the goal of working roughly the same no matter what browser looks at it.

Bookmark and Share

Comments

Add a comment