Tableau and Impala Hadoop for Business Analytics

In 2018 it is not difficult to acknowledge the important role Big Data plays in industry, finance, entertainment and medicine. In fact, the 2018 NewVantage Partners annual survey shows that 97% of the executives interviewed are investing in Big Data, data analytics, and AI projects (1). Data, and Big Data in particular, bring higher productivity, market value and profitability to firms according to a study by the Massachusetts Institute of Technology (MIT) (2). And a crucial step to obtain this profit is data visualization because we all know that a picture is worth a thousand words.

More than 50,000 customers among which companies like Audi AG, Citibank, LinkedIn, New York Times, and Skype stand out use Tableau for business analytics and data visualization. Tableau supports connecting to data stored on your computer, in a database on a server in your company or available on a public domain on the web, or to a cloud database source enabling users to access and data visualization without writing SQL or complex code.

Cloudera Impala is an open source massively parallel processing engine that allows users to perform low latency SQL-like queries interactively on top of Hadoop Distributed File System (HDFS) with fast and efficient operations giving results in seconds. These real-time results obtained with Impala are ideal for visualization tools like Tableau, which already comes with connectors allowing to query and perform visualizations directly from Graphical User Interface. Tableau connects to Cloudera Hadoop by connecting with Impala via Imapala’s ODBC driver installed on your client machine (3). These native connectors make linking Tableau to Hadoop easy, without the need for special configuration.

Impala supports all the numeric data types and character data types and date data types and comes with an in-built support of processing all Hadoop supported file formats and all the aggregation functions like SUM, AVG, MIN, MAX, COUNT. It has conditional functions to decide based on the values dynamically. And all the JOIN operations can be performed. However, joining large tables is not suggested as Impala performs operations in memory as much as possible, writing on disk only if the data size is too big to fit in taking increasing time to response. For example, if your data set is over 1 billion rows, data visualizations will not be fully interactive. Impala’s Massively Parallel Processing (MPP) architecture ‘scans’ the entire Hadoop dataset for an answer to a query so the more complex the query is the more scans will be necessary to answer the query delaying the updated visualization. One alternative to overcome that problem is to work on a subset of extracted Impala data even though this limits the data analysis to that particular subset. Tableau has the extract function that saves the actual view of the data to your local or Tableau server file system. Another alternative is to create new files on HDFS that will account for new tables, for example denormalizing data by inserting into the fact table, the one containing the content of the data warehouse, the data related to its foreign keys (4).

Among the successful use cases of Impala and Tableau working together to produce meaningful data visualizations is Gamefly, the number one video game subscription service in the US that saved 100s of hours and increased 5 times the trial offers by leveraging these technologies (5). Also Ebates, the online coupons and codes cash back company with more than 30 million members globally made the transition to

run business intelligence on Tableau connected to Hadoop (6,7).

The main advantage that comes with the association of Imapala and Tableau for business analytics is to allow the visualization of huge amounts of data in (almost) real-time. And therefore, we will hear of many more of this association in successful business use cases in the near future.

Recommended Links



2- E. Brynjolfsson, M. Lorin and H. Heekyung. Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance? Thirty Second International Conference on Information Systems, Shanghai 2011.