Converting HTML to PDF using wkhtmltopdf

I blogged a while back about delivering pages as PDF using PHP, and at the time DOMPDF seemed to be the best-of-breed package for converting HTML into PDF for the purposes of delivering PDF versions of web content.

However, I noted at the time that DOMPDF's last release was in July 2007, and it still doesn't look like being updated any time soon. The fundamental problem with packages like DOMPDF is that they tend to implement their own rendering engine. The thing is, HTML and CSS are both pretty huge now - writing a rendering engine that can cope with all the different combinations is a huge task, so projects like DOMPDF end up missing out important bits of functionality.

A better approach would be to use an existing rendering engine from a browser, and then build a binary around it that can take a website as input and produce a PDF as output. That way you can get results consistent with how browsers would print a page and if you pick the right engine you'll not have to keep up with any changes to HTML standards, the engine developers will do that for you.

This is essentially the approach wkhtmltopdf takes: it extracts the open-sourced Webkit renderer used inside browsers like Safari and Chrome and bundles it up into a Linux CLI application which produces some pretty impressive results.

I thought I'd jump right in and start by compiling it on my Debian webserver. The wkhtmltopdf site has some instructions for building it on Ubuntu, which I thought were worth a try. The basic procedure was as follows:

#apt-get update
#apt-get install libqt4-dev qt4-dev-tools build-essential cmake

#svn checkout http://wkhtmltopdf.googlecode.com/svn/trunk/ wkhtmltopdf
#cd wkhtmltopdf
#cmake -D CMAKE_INSTALL_PREFIX=/usr .
#make
#sudo make install

In my case, this installed a terrifying amount of new packages to my server, but everything went very smoothly. I was left with a binary in /usr/bin and ploughed right in!

#wkhtmltopdf http://ciaranmcnulty.com /tmp/ciaranmcnulty.pdf
wkhtmltopdf: cannot connect to X server

Argh. The rendering engine depends on there being a GUI running on the machine so it can do cool things like generate graphics, render fonts and so forth. A typical webserver won't be running X, but luckily there are ways around it.

One such way is xvfb, or the X Virtual Frame Buffer. This is a handy bit of code that basically runs an X instance but without a lot of the overheads. You can create a temporary X buffer and run a command in it using the xvfb-run binary, the benefit of which is that the x instance gets thrown away afterwards. I installed xvfb and then invoked it as follows:

#apt-get install vfb
#xvfb-run -a -s "-screen 0 640x480x16" wkhtmltopdf --dpi 200 
  --page-size A4 http://ciaranmcnulty.com /tmp/ciaranmcnulty.pdf

The options should be fairly self-explanatory, the key things to note are that -a makes xvfb pick an unused display number (to avoid collisions) and -screen starts up the virtual framebuffer with a display with the correct bit depth and dimensions.

The results are fairly good, certainly better than PHPDOM would generate given the same input. My site layout uses a fair bit of floating and absolute positioning, and the PDF came out exactly as I'd expect:

Website PDF

It's important to note that this isn't a bitmap, the text in the PDF is still 'text'.

A quick dig around showed that to print the backgrounds I'd need to have Qt4.5 installed, something I wasn't really prepared to risk my server for. However, I thought I'd quickly try doing what I should have in the first place. The wkhtml project provides a linux binary that's statically compiled against Qt.

I downloaded this binary and gave it a whirl. The results were much better:

Website PDF with backgrounds

Frankly I think this is a great rendition of the page, and certainly good enough for an autogenerated PDF on a website. A bit of further investigation and experimentation has left me pretty impressed with the breadth of CSS print functionality webkit can support.

The next step for me is going to be to try and replace some of the DOMPDF installations in some of my smaller sites, and see how it performs under load. The time taken to generate a PDF is pretty high, and I've not really checked out how xvfb is with concurrency so I'd hesitate to throw it onto a production site straight away, but it'll be my first port of call next time I want to do something with a PDF.

Bookmark and Share

Comments

1.

Creating a PDF from a web page is not that common an occurrence for me. However this looks great and I am racking my brains to think of somewhere cool to implement this!

Russell
9th April 2009, 21:45

2.

I think that creating a PDF from an existing Web page is only one use case for this. You could use it in *any* situation where you want to generate a PDF in some dynamic, automated way such as via PHP. Rather than try to figure out the PDF format, or the workings of some arcane library, you can generate a page of HTML (which you already know, and is well documented) and then convert that.

It'll be interesting to see how well this plays with concurrency and load etc. I'll look forward to a follow-up post somewhere down the line!

Simon Harris
9th April 2009, 22:14

3.

Yeah as Simon says, there are really two use cases - one is PDF versions of existing pages, but the other is generating PDFs for different uses using HTML as your templating system.

The idea that your PDF templates can be in HTML and sit in your application alongside page templates, and basically be edited by the same people, is pretty attractive.

Ciaran McNulty
9th April 2009, 23:05

4.

thanks for help

Kowalikus
17th February 2010, 12:56

Add a comment