A difficult decision faced by many businesses is selecting which tools to use to meet their desired strategic outcomes. With a host of business Intelligence (BI) storage solutions available, how do you choose the right one for your organisation? With extensive experience using a range of solutions, we can help you make that decision. One of Business Data Partner’s recommended solutions is Hadoop.
What is Hadoop?
Hadoop is a set of open-source programmes, frameworks, and procedures for an organisations big data solution. Traditionally, Hadoop has been used as a Data Lake to hold a company’s data. When useful data is identified within the lake it is extracted to a data warehouse and implemented on a traditional Relational Database Management System (RDMBS), for BI purposes.
Given that the Hadoop platform includes a number of SQL engines, can it cover the BI use case and remove the need for a traditional RDBMS? We think so.
BDP & Hadoop
Here at BDP we have extensive experience working to implement data warehouses directly on Hadoop using the Impala SQL engine. Following this work, we’ve learnt a lot about Hadoop and how to utilise it successfully as a business storage solution. We asked Nick White, BDP Big Data expert, to share his key take-away lessons learnt whilst using Hadoop for BI data storage solutions and explain why we’ll continue to use it
Key Lessons Learned
- Don’t get too hung up on Hadoop being a Big Data platform; it is also “just” a database. Much of the knowledge and experience you have when implementing traditional RDBMSs will still apply.
- There are reasons why the data industry use dedicated ETL tools to load traditional RDBMSs. Just as we wouldn’t hand craft SQL to load an Oracle DB, don’t consider hand crafting Sqoop (or any other) code to load a Hadoop DB.
- Performance tuning takes time and experience. If you don’t have someone with this experience, be prepared for your team to learn on the job and give them the time and space to do this.
- The standard Hadoop SQL engines (Hive and Impala) don’t support updates. There are ways that updates can be implemented on these SQL Engines, but they are relatively time-consuming to build and costly in processing effort to run. Therefore, consider carefully what updates you will need especially if you are implementing Slowly Changing Dimensions. Apache Kudu appears to be an interesting new technology that may provide solutions to this type of problem.
- While ANSI SQL is supported, the more advanced functionality you may be used to in your traditional RDBMS (analytical functions, hierarchical queries, etc.) is unlikely to be available. If you have use cases that need this type of functionality, make sure you are aware of them and have solutions before you begin the build.
- The ecosystem around Hadoop is limited compared to what you might be used to with a traditional RDBMS. If you are used to interacting with your Database using a tool such as TOAD then you may well find the limitations of Hue frustrating and there don’t seem to be many (any?) SQL editors that can connect to Hadoop – other than via ODBC/JDBC with the limited functionality that can offer.
Whilst we recommend building a BI solution on Hadoop there are a few things to keep in mind. Don’t forget everything you’ve learnt working with traditional RDBMSs as much of it is still applicable. Bear in mind Hadoop is relatively immature compared to what you may be used to working with, so be open to this way of working.
Business Data Partners have experience implementing BI data storage solutions directly onto Hadoop. If you have any questions or would like to get in touch to see how we could help you do the same, reach out today.