Sunday, January 06, 2013


PDF is still the most used format to distribute electronic documents - especially for "paper based" workflows. There are obvious reasons still using PDF even if evolving technologies brought alternatives to the table.

On the other hand HTML and related technologies (namely CSS and JavaScript) has made a major step forward to produce enterprise looking documents with additional value a PDF cannot provide.

I do not want to compare the pro and cons - just want to mention well suited tools to convert from one format into the other.


Whenever you have to create "paper based" workflows based on data you already have a HTML (5) version from the best tool I'm aware of is PrinceXML.  Styling is done using CSS and JavaScript - no additional technology you need to learn.

This technology is not able to fully replace advanced rendering technologies like XSL-FO (and corresponding commercial layout engines) or even more advanced ones based on vendor specific layout definitions.
But there are many use-cases where you just have the HTML you have to distribute as PDF (e.g. confirmation of orders, archiving of a business transaction, etc.) this product might enable this for a reasonable price.

Note: there are many commercial and open source alternatives out there for this approach - non of them are as robust as the mentioned solution based on my experience.


You already have PDF created within your workflow and need HTML5 to support more advanced integration in your content distribution you can tryout PDF2HTML5 Converter.

There are different options you can choose from how the content is transformed (one of them is creating SVG instead of HTML5 but this is a different story). The most useful conversion mode if the content should be further integrated creates:
  • one HTML file per page
  • all distinct block of content is stored within a div tag
  • layout is applied using CSS
  • some JavaScript
Note: The created HTML5 does not result in a responsive design (using absolute positioning and non semantic markup) - but that is more or less by design and cannot be solved in a general purpose implementation.

Both tools are commercial but in case you have a good use-case for one of the two mentioned conversion the price is reasonable. Try out first - both tools provide ways to easy try out the results.

Regular Expressions: Still Two Problems?

You sometimes stumble upon regular expressions which are difficult to understand, e.g. what exactly means:
The most common problem is to get the structure out of deeply nested regular expressions.  
Using the online service makes it much easier to see the structure:

Now you see that this regular expression matches all content beside "foo1" or "foo2" using the pattern explained here:

Using the example taken from Regular Expressions: Now You Have Two Problems:

<\/?p>|<br\s?\/?>|<\/?b>|<\/?strong>|<\/?i>|<\/?em>|   <\/?s>|<\/?strike>|<\/?blockquote>|<\/?sub>|<\/?super>|   <\/?h(1|2|3)>|<\/?pre>|<hr\s?\/?>|<\/?code>|<\/?ul>|   <\/?ol>|<\/?li>|<\/a>|<a[^>]+>|<img[^>]+\/?>

Try it out - you probably see what the author try to achieve.
Note: you have to use regular expressions in JavaScript style to use the service.