Today the buzzword "Big data" getting more and more popular. Nice label for a common statement "the amount of data and the valuable usage getting important".
There are millions of information and products out there which promise to help you storing and analyzing those data. But one of the major issues with data is not current usage it is the maintenance of the information over time.
The "Web of Data" is one common example. It is the biggest data store we currently faced with. Pretty simple to access and analyze. So far so good. But there is one maintenance of this data (required?). Collect 100 links to resources on the web today. than 24 month later try access them...how many of those links still work, and if they work the resulting information still using the same semantic as it was once you build up the link?
The "Web of Data" currently decided not to maintain data just provide them now, enrich them and just replace them with different semantic...The Web Wayback machine (
http://archive.org/web/web.php) is an approach to help individual users to keep their individual value of data for some scenarios.
Now think about your cooperate information you collect right now. The speed and adaption rate of this data will increase and new demands to enrich the data will appear. Do you ever thought about how you ensure that all that data can be adapt to new needs? Based on my personal experience at least more than 60 % of the over all project costs are related to data migration in IT project dealing with information in a certain domain of the organization. Those costs are related to adapting data to the new tools which maintains the data, converting data between different data models and formats and ensure the quality of the data and their usage in existing business processes.
What does this mean for each IT project dealing with data?
- Initial load is important
You always have to define how to get the data you need for the initial start (and not only during the regular operation of your business process) and how to verify that this data is valid for your future need.
- Expandability of your data might be important
You can use static data models and tools (e.g. classical relational data models) compared to more flexible approaches like typed graphs of data where content using different models can simpler coexist.
- Adaptability of your IT systems might be important
What happens to your existing data once the model will be extended, changed. Do not only take care of the data itself also take into account the relation to the data. Today you only access a specific level of your data few years later some use-case requires you to access the individual step or introduce an additional level not yet exists.
- Ensure the maintenance of your data.
Do not "use" any data which you do not have any value in your primary business process. The usage of information requires the correctness of data. Your data will never be correct if the process creating this data does not have any value out of the data itself. This means that the data will be simple partially incorrect, incomplete.
It is and will be the most expensive IT task in your organization "how to preserve the value of big data over time...."