Time to ditch your Datawarehouse - and add you ERP content

For most of the life of enterprise software we have seen the classic separation between transactional and analytical applications. OLTP vs OLAP is as old as data warehouses have been around.

A recent discussion on LinkedIn though triggered some thinking, discussion and research for me - and question the traditional, conventional wisdom.

Why did we separate OLTP and OLAP?

Right it was for a few relevant reasons, but mainly

Storage Cost
Let's keep in mind the venerable relational database concept was also created to save disk space. For 98% of the life of computing, disk space has been the critical and expensive resource.
But that time is over - when you see that a smart start up like BackBlaze can offer unlimited backup for as little as $3.95 per month - with internal costs for one TB being around 50 cents. It will be hard to challenge any IT budget on disk cost these days.
Performance
The transactional systems were tuned to allow good insert record performance - not so much for the need of reporting, creating dimensions on the data (or facts). So information needed to be stored in a different way.
But that time seems to be getting over, too - when you see that, thanks to Hadoop et al we can search data at will - though not like with the performance that we would like, but that is getting better every few months.

So let's be aggressive and to use the famous Gretzky's quote - skate where the puck will be.

Taken from Indicee Webinar announcement here.

Ditch the data warehouse...

... and move all it's content into Hadoop clusters. For any reports built upon the data warehouse - find a solution on top of Hadoop. You may have to setup a report generation infrastructure - so canned reports are instantly available. You did or even do this for your existing reporting, too. But you will have the instant benefit of allowing skilled users to find new insights in the enterprise's data.

Go back to the wish list for dimensions that there was never time and / or money to build them.
Go back to the reports that were not feasible or nice to have.
Go back to insights the business was suspecting to have - but never go to because of... you know.
Don't forget the requests made but postponed because that data mining software package was too expensive.
Make your user community happy in regards of announcing that any query is now possible ... start with a hack a query workshop with your technical users and then an ask for any insight workshop with your business users and make sure there are no silly queries being excluded. Pretty sure you will run into some insights.

... and add the enterprise OLTP data en route!

So why stop with de-commissioning the data warehouse? Doesn't most of the data come from the transactional systems in your enterprise anyway? So while you do changes in your ETL software to feed your Hadoop clusters - why stop with the information that was foreseen for the data warehouse? You will end up with limited data and thus limited insights if you follow that approach... So get all the data from the origination in the transactional systems to the destinations in your analytical system - your Hadoop clusters. Something truly revolutionary - as traditionally the insight questions triggered the transfer of data into facts and dimensions. Now you just move data from the transactional system to the analytical system... and execute the analysis on that later.

Wanted: Identifiers

But how would Hadoop do it's magic and combine the data. After all you can't relate something that is not related. So it's time to enrich your transactional data. Some examples:

Your customer master should include the website address of that business. In a batch job you can add the IP address (no need to confuse a business user displaying it). And web traffic starts making sense.
Your contacts should have email addresses and more social identifiers - like twitter and Facebook IDs. And your social data can get tied into your transactional contents.
Your employee records should have links to all user IDs you have in house - as well as all their social identifier you can get the employees to disclose. And your system usage gets meaningful.
...

Crazy - could you do it? LinkedIn and Facebook will help...

The good news is, that the large internet properties are working hard on this, have larger problems to solve than the average enterprise and are ... opensourcing their tools. Just this week LinkedIn announced to opensource its DataBus code-that will help you to keep all the different operational stores in synch. A very good tool to run your Hadoop clusters in experimentation mode.

And as I mentioned above - performance is still a concern right now - but if it really matters to you - have a look at memcached (to which Facebook contributed), equally being open source. If it can serve 150 GB per second from a cluster of flash memory servers at Facebook - it should give you some confidence, that the Hadoop performance problem is addressable today and will be put to bed soon for good.

MyPOV:

The purpose of this post was to push the needle from what is possible today, imposed by conventional wisdom - to what will be possible soon - and could be done by an agressive - type A company (as Gartner calls them) today. Well ok, I would experiment on it first, too.