Wednesday, April 23, 2014

The small scale approach: How to cleanup messy data, standardize enrich data

You ever faced to the following challanges:

  • You  have a huge amount of master data needs to be cleaned up. Over time the same terms are used differently, buildup of the names using different styles, etc.
  • Have to split data currently part of one column into different one, e.g. a product name contains dimension which should be processed in a different column in the future, etc.
  • common faults in the data needs to be harmonized (multiple whitespaces, upper vs. lower case letters, etc.)
  •  Transform well structured data into a table for relational usage of the data, e.g.
  • The data needs to be enriched based on public available or internal services, e.g. you have the address of your customer available and want to add latitude, longitude to display amount of customer on a map.
  • You have to merge two sets of data from two different systems and want to identify same entries even if the both lists does not share a common key.
  • ....
For those tasks you have several ways to work with.

One interesting tool which is very useful in those kind of tasks is "OpenRefine" (formerly known as Google Refine).
You can easily download the latest version and install it on your local client (or server) and start it just by following the easy installation instruction provided on the homepage.

A good overview of the main functions is shown in :

The tool is very easy to use. It works for huge amount of data and for the most tasks it is pretty fast  (especially compared to doing the same with spreadsheet software). It is more likely used for ad-hoc tasks compared to fully automated and repeated tasks in the enterprise - but it provides  handy functions to reapply and reuse already created rules.

For more sophisticated analyses goes out of the scope of one columns and rows of one table tools like RapidMiner are more sufficient. But those kind of tools require a bit more upfront invest to get the first problem solved.

Tuesday, April 22, 2014

Copy tables from PDF for further processing

You ever tried to extract data part of a table from PDF to further process it in MS Excel or any other Spreadsheet application? That can be a painful job.

You might use one of the many PDF to Excel conversion tools - but most of them cannot be used without submitting your PDF to an online service or buying a commercial license. In addition you have to evaluate the quality of the results especially for data part of tables.

Few days ago I stumbled upon the tool Tabula. Still marked as experimental but the results are still pretty useful at least for data oriented tables.



The PDF "The Mobile Economy 2013" contains a table on page 56 which you want to process in Excel:


You have to follow the following steps to extract as csv:

  • Download the file to your local disk
  • install and start the tool follow the instructions on the homepage
  • upload the PDF and select "Submit"
  • navigate to the table and select the table:

  • choose "Repeat this selection" if you want to select the following tables as well using the same coordinates.
  • choose "Download all data" and you get:

  • choose "Download data" to get a csv file with the extracted tables. This file can be open with MS Excel or any other application which can read the csv format for further processing.
The results are very useful and it works for those kind of data as well:

Won't work with data provided as graphic as part of the PDF this is an topic for another story.

Sunday, January 06, 2013


PDF is still the most used format to distribute electronic documents - especially for "paper based" workflows. There are obvious reasons still using PDF even if evolving technologies brought alternatives to the table.

On the other hand HTML and related technologies (namely CSS and JavaScript) has made a major step forward to produce enterprise looking documents with additional value a PDF cannot provide.

I do not want to compare the pro and cons - just want to mention well suited tools to convert from one format into the other.


Whenever you have to create "paper based" workflows based on data you already have a HTML (5) version from the best tool I'm aware of is PrinceXML.  Styling is done using CSS and JavaScript - no additional technology you need to learn.

This technology is not able to fully replace advanced rendering technologies like XSL-FO (and corresponding commercial layout engines) or even more advanced ones based on vendor specific layout definitions.
But there are many use-cases where you just have the HTML you have to distribute as PDF (e.g. confirmation of orders, archiving of a business transaction, etc.) this product might enable this for a reasonable price.

Note: there are many commercial and open source alternatives out there for this approach - non of them are as robust as the mentioned solution based on my experience.


You already have PDF created within your workflow and need HTML5 to support more advanced integration in your content distribution you can tryout PDF2HTML5 Converter.

There are different options you can choose from how the content is transformed (one of them is creating SVG instead of HTML5 but this is a different story). The most useful conversion mode if the content should be further integrated creates:
  • one HTML file per page
  • all distinct block of content is stored within a div tag
  • layout is applied using CSS
  • some JavaScript
Note: The created HTML5 does not result in a responsive design (using absolute positioning and non semantic markup) - but that is more or less by design and cannot be solved in a general purpose implementation.

Both tools are commercial but in case you have a good use-case for one of the two mentioned conversion the price is reasonable. Try out first - both tools provide ways to easy try out the results.

Regular Expressions: Still Two Problems?

You sometimes stumble upon regular expressions which are difficult to understand, e.g. what exactly means:
The most common problem is to get the structure out of deeply nested regular expressions.  
Using the online service makes it much easier to see the structure:

Now you see that this regular expression matches all content beside "foo1" or "foo2" using the pattern explained here:

Using the example taken from Regular Expressions: Now You Have Two Problems:

<\/?p>|<br\s?\/?>|<\/?b>|<\/?strong>|<\/?i>|<\/?em>|   <\/?s>|<\/?strike>|<\/?blockquote>|<\/?sub>|<\/?super>|   <\/?h(1|2|3)>|<\/?pre>|<hr\s?\/?>|<\/?code>|<\/?ul>|   <\/?ol>|<\/?li>|<\/a>|<a[^>]+>|<img[^>]+\/?>

Try it out - you probably see what the author try to achieve.
Note: you have to use regular expressions in JavaScript style to use the service.

Thursday, November 01, 2012

interactive timelines

You have to create, view and provide interactive timelines for your articles you might try out

Easy to use and customize with only a few bugs all of them can be easily worked around.
And last but not least the service is free of charge right now... but you should not contribute confidential content.

Legacy leads to complexity? Windows 8!

If you ever tried to understand why the success of the past introduce complexity for the future you should read "Turning to the past to power Windows' future: An in-depth look at WinRT".

A massive summary of what WinRT (Windows 8) is and how the bunch of existing technologies fits together creating their next big thing. Read the complete article with a bit IT experience background gives you a very good overview of what they build.

Do you really think that such a complex beast will lead to more consistent and improved user experience "on any device"?

The site "" was introduced to collect best practices for new web technologies like HTML5. Many major vendors are contribute to the platform, e.g. W3C, Google, Microsoft etc.

Worth to take a look at.

To understand the state of the technologies so far - the site is provided as HTML5 - but try to play the intro on the homepage...still Flash on FF16 (yes I know it is simple a YouTube hosted video).

Time will change...

Decision oriented user assistance

If you think about business applications (e.g. ERP, PLM, CMS) you get aware that the amount of functions in all kind of application increasing in each new version and most of those application are feature complete in terms of critical functions which are relevant for daily operations.

Does this means that the users are happy with the applications they have to work with? What is the problem of the most application which are available today? Complexity is the major pain of todays software applications.

All major IT trends during the last couple of years leads to additional functions, additional user interfaces and at the end to complex applications.


  1. IT trend "mobile devices"
    The trend to access services using mobile devices leads to additional UI which make an application accessible from a mobile device. Vendors wants to make their application mobile ready and therefore provide existing functions or a selected subset of the functions to mobile devices. In the best case they optimized the application behavior to the look and feel of the specific mobile devices.
  2. IT trend "social media"
    The trend to socialize daily operation leads to new functions as well. You can now comment the work of your colleague right from within your application - great. In the best case you can collaborate on the same peace of object within your team.

Paradigm "Automation"

But all kind of extensions are focused around one single paradigm. "Automation" of tasks which substitute manual operation. This approach is the major topic for business oriented application since at least 20 years until now.  You can calculate return on invest based on this approach without too much effort and thinking.
On the other hand most of the applications with adequate market penetration already implemented the tasks with measurable value otherwise they didn't win any software selection process.

So far so good. But what should happen next. Does the next two functions really provide enough return on invest to take the effort to upgrade the application? What might be a structure improvement and unique buying point for a business application in the future?

Paradigm "Decision"

Automation is all about efficiency. But at the end of the day each business process or each operational step results contains at least one valuable decision which drives the success of the process output and in many cases the value of result now and in the future.

Making the right decision is more related to ensure the effectivity of the process output without loosing efficiency.


To setup an decision oriented application you need two basic principles:
  1. The application must know the business context of the user who operations within the application.
  2. The application must reuse the knowledge of already realized processes

Knowing the business context

This means the application is driven by the business process (the relevant subset which is in the scope of the application). In the most of todays business application the process simple fulfills the job to automate certain tasks and notify different users on certain events (based on states of a resources and state transitions). But the real process is not in scope of the application.

The IT Trend BPM has found his way into many products and projects up to now. Even some business application using a BPM approach and infrastructure to implement workflows.

But there is no application out there which core is based on BPM. This mean that each operation takes place as part of an underlying business process, each function is more or less just a decision of the user how to proceed in the process and each automation is just a replacement of a human task.

Creating such an application means, you have to provide:
  • A collection of automated tasks
  • A collection of human tasks to request human input and choices.
  • Triggers for a user or software to make a decision for the next step (e.g. select "edit content", "send to review", etc.)
  • A backend that lets you model the process using the above building blocks and a back end which let you create, run and complete those processes
  • A UI makes all of the items above visible to the user
Now all functions of the application a invoked in a well know context. So far so good. Sounds like a traditional BPM project, right? And yes there are already LOB projects out there following this approach.

Reusing the business context

What is the next big thing? You are able to store all decisions of your users in the context of your persistent process and use the results to improve the decisions of other users.

In each process different users can learn from the experience of others or individual users can be guided to avoid making a wrong decision again - or reusing best practices from the past. To achieve this goal you have to use the collected information from previous processes and transform them into valuable guidelines for your users:
  • show them what other users did when they are in the same context as the user currently have to do an operation
  • prevent them from doing an operation which leads to errors later on in the following steps of the process based on the experience of previous processes
  • show them additional information other users searched for in the same situation as the current user
  • let them add guidelines for later steps in the same process
  • let them attach additional information they always need if they are in the same context again (e.g. reference material for an operation, etc.)
  • etc.
Also very obvious and easy things can be done:
  • Only provide functions which makes sense in the context of the process (real context aware function) to reduce the amount of choices a individual user can choose from
  • Identify functions no one use in a particular context to simple remove them or add hints / best practices for the users which makes this function usable
  • Identify best and bad practices from what the users did and improve the process (the application) based on real world usage.
  • etc.
With the availability of tools in the area of "Big Data" you might think of enhanced KPIs like
  •  identify patterns in the process leads to results not valuable to your business through querying the process, corresponding data and decisions and the results according of the question you have to answer.
  • identify related choices in the process and the corresponding information as a baseline for process improvement
  • etc.


This kind of approach leads to
  • structural usability
    The application is possible to guide the users as much as possible and provide as much information as possible for its next relevant decision (operation)
  • reduce complexity
    Only provide relevant information, functions and user information which leads to reduced complexity for the user of the application.
  • social experience (common improvement)
    The application can share information, discussion and best practices in the context those information is relevant for.
  • improved effectivity
    Best practices can be established for all users based on real world experience - not only on theoretical thoughts.
  • improved efficiency
    Critical operations can be identified and additional automation can be added based on real world business value.

Too complicated and complex?

No that is not the case. Using todays IT tools available on the marked place makes this kind of application easy to implement (even the core subset of the mentioned approach as baseline for future extension).
BUT it is not possible to simple extend existing products with this kind of approach without re-implementing the core part of the application from scratch.

This means existing and strong vendors might struggle to do this - but in case you thinking about creating a new line of business application think about using a different approach than your competitors...

Monday, October 01, 2012

Stumbled upon: Write the Freaking Manual

I stumbled upon the thread "WTFM: Write the Freaking Manual" triggered by the following blog post

I would recommend to follow the thread (which already contains more than 200 thoughts) in case you want to understand:
  • the different views of the developers
  • the different views of users of a certain software
  • the different views of tech writers
I also had discussions with companies creating software products what needs to be documented, is it possible to create a software product which doesn't required additional documentation because of "intuitive usability" etc.

The answer is easy and difficult at the same time:

You have to deliver relevant information for your audience.

This means you have to understand:
  • What is your audience?
  • What is relevant?
 And always keep in mind that  "Your user does not have the same context compared to you".


If you develop a software infrastructure should support other developers to do their job faster you should deliver:
  • orientation for your user (which tasks does the library support)

    the concept of all major parts of your framework from top to down
    => basic overview of all implemented concepts and than describe each concept

    good example is provided by IBM for their ICU library (
    This library isn't very trivial but you have well described concepts for all components of the library.
  • how-to setup
    provide how-to to setup the software for initial use
  • how-to use
    provide as many code samples / demos / real working code for the operation of your user
    e.g. by providing your well-documented unit-test library.

How can you identify which information your audience needs? 

You have to understand their daily operation with your software and all questions which cannot be answered by the software itself without additional information in a short amount of time.

If you identified those areas well the resulting documentation will add value to the software and will increase the audience using your software.

Friday, September 14, 2012

HTML5 for any device? yet?

Mark Zuckerberg made an widely recognized statement  on the usage of HTML5 for mobile devices:

When I’m introspective about the last few years I think the biggest mistake that we made, as a company, is betting too much on HTML5 as opposed to native… because it just wasn’t there. And it’s not that HTML5 is bad. I’m actually, on long-term, really excited about it. One of the things that’s interesting is we actually have more people on a daily basis using mobile Web Facebook than we have using our iOS or Android apps combined. So mobile Web is a big thing for us.
A more technical detailed feedback is provided here:

This means two major things:
  • HTML5 is not ready yet (that is no real news) for a simple replacement of native apps
  • HTML5 is the major enabling technology to deploy feature rich content to the mobile web.
If you ever tried to create a productive web application using the HTML5 stack which should run on "all" common mobile devices you are aware that this is a pretty tough job and still requires to limit the functions to a small subset of functionality and as a result UI experience. In case you have to provide a feature rich application like Facebook you obviously have to workaround hundreds of issues and the result is still not sufficient for an individual user on one device.

A very helpful overview of the state of the different mobile browsers Facebook introduced ringmark a test suite (including results for most common mobile browsers) which shows which relevant API function is implemented on a particular mobile browser prioritized by different levels of importance.

The current state of the standards is published by W3C on a regular basis, latest release

What you see in the test results is that HTML5 can be used in case:
you want to deploy content driven application focus on online access and integration.

In any case you just have to start small, test and verify the behavior for your defined target audience. The HTML5 path is definitely the right path to follow but still requires lot of work from either the vendors and the standardization groups.

Thursday, September 13, 2012

First web page of the INTERNET

I'm not 100% sure but according to this article the first web page of the INTERNET was published by the W3C and is still available here

All links still OK.

In case you create a web page today with this amount of links and come back in 10 years - how many links still pointing to the correct content? Even if you only have links to self owned content. Do you think you are able to reproduce your targets in ten years?

Compare PDF in automated test scenarios

Do you ever had to test a process which creates a PDF based on well defined test data? You want to ensure that the result is equal or confirms to an given acceptance criteria which you can describe by an existing PDF? You want to automate this operation?

In this case you looking for a tool which compares two PDF files and at least provide you the answer to the question "are those two files the same?". Based on your use-case this means:
  • the contained text is the same on the same page of the PDF
  • the contained appearance (layout) is the same
In addition you need an command line interface to use the function within automated test procedure.

As you might know Adobe Acrobat provides a compare function which is very sufficient (see  but requires a commercial license and to integrate this function in your automated test environment isn't simple (from technical and commercial point of view).

Fortunately the tool comparepdf is available as free software. It is very simple to install, integrate and use. It provides different compare modes for the scenarios mentioned above. In addition a rough overview of the kind of difference is provided and can be used to integrate in automated test reports.

Once you have identified a difference you might be interested what and where the difference appearce. Therefore the GUI based DiffPDF tool can be used free of charge. Not as powerful as the Adobe Acrobat compare function but in many scenarios it helps to see whats going wrong without the need to buy a Adobe Acrobat license.