HTML is great. It's the lingua franca of the web, and a fantastic format for exchange of hyperlinked information. However, it has its drawbacks - It typically relies on multiple external files, different browsers interpret it in different ways, and printing it is a bit of a minefield, even with the limited print CSS currently available.
So, sometimes it makes sense to present documents as a PDF as well. I've done so on this very site, with my CV, after finding that most recruitment sites won't except an HTML document, and recruiters just get confused when you attach one to an email (or send them a hyperlink).
The component I use is called dompdf. At its heart it is an HTML->PDF converter written completely in PHP, and is pretty simple to use. The code to convert some HTML to a PDF looks something like this:
<?php
// include in the dompdf library
require_once('dompdf_config.inc.php');
spl_autoload_register('DOMPDF_autoload');
// instance dompdf
$dompdf = new DomPDF();
$dompdf->set_paper('a4');
// tell the user-agent to expect a PDF
header('Content-type: application/pdf');
// load the HTML, convert it to PDF and output
$src = file_get_contents('document.html');
$dompdf->load_html($src);
$dompdf->render();
echo $dompdf->output();
?>
Fairly straightforward, but there are some issues with dompdf that complicate the matter a fair bit.
In short, dompdf is old. Its last release was two years ago and there's not much sign of a new one. This means that its CSS support has a lot of holes so just throwing your existing HTML at dompdf isn't going to produce fantastic results.
(As an aside, if you have a huge budget it's worth looking at Prince. Prince is a similar product that has incredible CSS support, but costs a fortune while dompdf is free.)
One option would be to write your HTML pages in a fairly old, table-based style so that they could be rendered properly by dompdf, but my preferred solution is to add a conversion step that takes the nice, semantic XHTML I've written my CV in and convert it into the sort of HTML that dompdf likes on the fly, using XSL.
If you're unfamiliar with XSL there's a brief introduction here, but essentially an XSL template is a template for converting XML documents (such as our XHTML original) into either XML, HTML or string outputs. It essentially changes the last block of code into something like the following:
<?php
// load in the HTML and the XSL into DOM objects
$htmlDom = DomDocument::load('document.html');
$styleDom = DomDocument::load('convert.xsl');
// instance an XSLT processor and give it the stylesheet
$xslProc = new XSLTProcessor();
$xslProc->importStylesheet($styleDom);
// apply the XSL to the XHTML to get the old-style HTML
$src = $xslProc->transformToXML($htmlDom);
// render the old-style HTML
$dompdf->load_html($src);
$dompdf->render();
echo $dompdf->output();
?>
The task therefore is to write an XSL that converts your 'old' format to your 'new' format. That's a bit tricker than it sounds so Ill take you through what I've done for my cv. The starting point, really, is to do nothing. The XSL for 'do nothing' looks like this:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="utf-8"
omit-xml-declaration="yes" />
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>The xsl:stylesheet and xsl:output nodes just specify that this is an XSL and that we want to output some HTML. The 'engine' of XSL is the xsl:template node - an xsl:template specifies that for a certain node in the source document, a certain output should be produced.
This template essentially says that it matches any node in the source document, and that the appropriate behaviour is to copy the node in the output, and then go on to process any of the child nodes in the same way. The processor will start at the root node of the input (in our case the html tag) and slavishly copy out the entire DOM tree into the output.
Now that we have an XSL that outputs whatever's put into it, we can start specifying exceptions to this rule. The first thing that I notice from the dompdf output is that dompdf underlines my A tags and colours them in blue, but doesn't make them clickable. The solution is to add a template into the document that removes A tags:
<xsl:template match="a">
<xsl:apply-templates select="node()"/>
</xsl:template>This would slot into the previous XSL just underneath the existing xsl:template, inside the xsl:stylesheet. What will now happen is that as the processor traverses the DOM tree of the source document, for most nodes it will copy them into the output because they match the first template. However, when it gets to an A node it will match the second template, and just copy the child nodes, ignoring the A.
For example, the following line:
<p><a href="test.html">Hello <em>world</em></a></p>
Wound be converted into:
<p>Hello <em>world</em></p>
In the case of my CV, I decided to ditch the existing stylesheet completely and insert my own rules for the PDF version. The XSL for replacing a node with something else is fairly simple, you just don't tell it to xsl:apply-templates and all of the children of that node will get ignored, and the node itself replaced with whatever you want:
<xsl:template match="style">
<style type="text/css">
body{font-size: 10pt;}
/* rest of the styles here */
</style>
</xsl:template>Using this technique of specifying exceptions, we can methodically set templates for all of the other elements in the source that aren't being rendered correctly in the output.
The process is pretty simple. I tend to do it in an iterative loop of:
- Render the PDF
- Spot something that doesn't work
- Specify an xsl:template that makes an exception and fixes it
- Repeat until the output is perfect
The hardest part is remembering how to lay stuff out using tables, which if you're like me you've been avoiding doing for the last few years. The good news is, the table-based layout you're designing only has to work for one user-agent, dompdf, and will never be seen in public - there's no 'view source' in a PDF after all.
I won't go through the entire XSL for my CV, but as a final example here is the XSL for converting the page header, which in the source uses a UL and a couple of floated elements, and in the PDF ends up relying on a good old 100% width table:
<xsl:template match="div[@id='header']">
<table cellspacing="0"
cellpadding="0"
border="0"
width="100%"
style="border-bottom: solid 0.5pt black; margin: 0;">
<tr>
<td valign="bottom" align="left">
<span style="font-size: 2em; font-weight: bold">
<xsl:value-of select="h1" />
</span>
</td>
<td valign="bottom" align="right">
<xsl:value-of select="//*[@class='tel']" />
<br />
<xsl:value-of select="//*[@class='emai']" />
</td>
</tr>
</table>
</xsl:template>Hopefully this has given you an insight into how powerful XSL can be, and how you can leverage it to give new life to tools like dompdf that otherwise seem a little behind the times.
1.
Being a newly qualified Zend Framework Genius I thought you might suggest using Zend_Pdf?
Russell
12th August 2008, 15:54