I blogged a while back about delivering pages as PDF using PHP, and at the time DOMPDF seemed to be the best-of-breed package for converting HTML into PDF for the purposes of delivering PDF versions of web content.
However, I noted at the time that DOMPDF's last release was in July 2007, and it still doesn't look like being updated any time soon. The fundamental problem with packages like DOMPDF is that they tend to implement their own rendering engine. The thing is, HTML and CSS are both pretty huge now - writing a rendering engine that can cope with all the different combinations is a huge task, so projects like DOMPDF end up missing out important bits of functionality.
A better approach would be to use an existing rendering engine from a browser, and then build a binary around it that can take a website as input and produce a PDF as output. That way you can get results consistent with how browsers would print a page and if you pick the right engine you'll not have to keep up with any changes to HTML standards, the engine developers will do that for you.
This is essentially the approach wkhtmltopdf takes: it extracts the open-sourced Webkit renderer used inside browsers like Safari and Chrome and bundles it up into a Linux CLI application which produces some pretty impressive results.
I thought I'd jump right in and start by compiling it on my Debian webserver. The wkhtmltopdf site has some instructions for building it on Ubuntu, which I thought were worth a try. The basic procedure was as follows:
#apt-get update
#apt-get install libqt4-dev qt4-dev-tools build-essential cmake
#svn checkout http://wkhtmltopdf.googlecode.com/svn/trunk/ wkhtmltopdf
#cd wkhtmltopdf
#cmake -D CMAKE_INSTALL_PREFIX=/usr .
#make
#sudo make install
In my case, this installed a terrifying amount of new packages to my server, but everything went very smoothly. I was left with a binary in /usr/bin and ploughed right in!
#wkhtmltopdf http://ciaranmcnulty.com /tmp/ciaranmcnulty.pdf
wkhtmltopdf: cannot connect to X server
Argh. The rendering engine depends on there being a GUI running on the machine so it can do cool things like generate graphics, render fonts and so forth. A typical webserver won't be running X, but luckily there are ways around it.
One such way is xvfb, or the X Virtual Frame Buffer. This is a handy bit of code that basically runs an X instance but without a lot of the overheads. You can create a temporary X buffer and run a command in it using the xvfb-run binary, the benefit of which is that the x instance gets thrown away afterwards. I installed xvfb and then invoked it as follows:
#apt-get install vfb
#xvfb-run -a -s "-screen 0 640x480x16" wkhtmltopdf --dpi 200
--page-size A4 http://ciaranmcnulty.com /tmp/ciaranmcnulty.pdf
The options should be fairly self-explanatory, the key things to note are that -a makes xvfb pick an unused display number (to avoid collisions) and -screen starts up the virtual framebuffer with a display with the correct bit depth and dimensions.
The results are fairly good, certainly better than PHPDOM would generate given the same input. My site layout uses a fair bit of floating and absolute positioning, and the PDF came out exactly as I'd expect:
It's important to note that this isn't a bitmap, the text in the PDF is still 'text'.
A quick dig around showed that to print the backgrounds I'd need to have Qt4.5 installed, something I wasn't really prepared to risk my server for. However, I thought I'd quickly try doing what I should have in the first place. The wkhtml project provides a linux binary that's statically compiled against Qt.
I downloaded this binary and gave it a whirl. The results were much better:
Frankly I think this is a great rendition of the page, and certainly good enough for an autogenerated PDF on a website. A bit of further investigation and experimentation has left me pretty impressed with the breadth of CSS print functionality webkit can support.
The next step for me is going to be to try and replace some of the DOMPDF installations in some of my smaller sites, and see how it performs under load. The time taken to generate a PDF is pretty high, and I've not really checked out how xvfb is with concurrency so I'd hesitate to throw it onto a production site straight away, but it'll be my first port of call next time I want to do something with a PDF.


1.
Creating a PDF from a web page is not that common an occurrence for me. However this looks great and I am racking my brains to think of somewhere cool to implement this!
Russell
9th April 2009, 21:45