The world of databases has evolved considerably over the years. First there were the standalone single-user databases. Then came the RDBMS; followed by schema-less, distributed databases. And finally Big Data made its entry.
Big data has shaken up the world of business like never before. The insights offered by data unlock several new possibilities, and business processes are far more efficient with it. However, for all the positives, businesses are still short of viable tools. In such a context, Apache Kudu, a new Open Source data engine developed by Cloudera, is raising eyebrows.
Cloudera started work on Kudu in late 2012, aimed at bridging a noteworthy gap between the Hadoop File System (HDFS) and the incumbent HBase database for Hadoop, and also to leverage the new hardware that emerged recently. Since then, Cloudera has donated Kudu and Impala, the accompanying query engine, to the Apache Software Foundation late. Kudu is now a top-level Apache project.
Why Apache Kudu?
The question is why businesses need to embrace a just out Apache Kudu in an already overcrowded Big Data engine space; more so when it does not support SQL directly.
Hadoop, for all its advantages, has critical gaps in its storage layer. As technology evolved, and new possibilities emerged, most incumbent Big Data engines began to feature hybrid architectures, stitched together to solve such gaps, leading to complexities and inefficiencies. The limitation is especially rife in use cases requiring the simultaneous availability of capabilities, when no single tool can provide for all the required capabilities. For instance, HDFS is good for analytics when there are batch uploads, but it is not suitable to update data in real time. On the other hand, HBase is good for real time streaming, allowing write data quickly with low latency, and read the data randomly, but poor on the analytics front.
Side by side, hardware has grown notably over the years, but legacy software is incapable of leveraging the full powers of such new hardware. A case in point is the steady growth of RAM. Where 32GB per node was the norm in 2012, 128GB or 256GB is the norm today. The architecture of HBase, HDFS, and other Hadoop were effected in a context where the speed of the disk underlying the Hadoop cluster was the most common bottleneck to overall system performance, when this is no longer the case today.
Kudu is good on all these fronts, offering both real-time and heavy analytical capabilities. It offers the capabilities of HBase, and matches the performance of HDFS for analytics, without the complications usually involved in achieving it. In fact, the very raison d’etre of Kudu is a storage system for the Hadoop ecosystem suitable for mixed workloads.
When Cloudera started its work on Kudu, the goals were to develop a Big Data engine which would:
- Deliver strong performance for both scan and random access
- Extract high CPU and IO efficiency
- Have the capability to update data in place
- Have the ability to support active-active replication clusters, spanning across several data centers, spread across the world
Since then, Kudu has evolved to become the first Big Data storage engine, closely resembling the capabilities of a traditional relational store, however, with exceptional performance, the capability to handle data volumes, and the ability for data distribution.
A Columnar Structure Overcomes Limitations of Key Value Based Engine
Kudu is a key value engine, with no direct support for SQL. Key value engines offer an inherent advantage of extremely fast and reliable writes. They are easy to use, but come with a big disadvantage with regard to queries. These databases scatter data all over the cluster, often based on a key hash. When users query the data, the engine performs extensive table scans to “gather” all the data back, to get a sensible result. As such, even the most common operations with a database – such as sort, search, grouping, and more – consume a lot of resources, rendering the database highly inefficient.
Incumbent low key databases such as HBase and Cassandra co-opt Column Families, which mitigates the situation to an extent. However, purely random queries are not easy, and it requires a clever database design to make the engine do what is required. Other databases have invested in indexing capabilities and query languages as a workaround, but these add additional layers to compensate for the limitations of the initial underlying design, leading to complexities.
Apache Kudu adopts a flexible columnar structure, organizing data into columns, as opposed to the traditional row structure of an RDBMS, to overcome the big limitation of conventional low key value engines. Such a column structure is similar to the organization of files using the Apache Parquet data format. It resembles a legacy table-scan but is much faster as the engine needs to query only the given column as opposed to the entire database. The column data may be compressed or otherwise optimized for even faster searching.
Hadoop Implementations using Kudu also have the option of utilizing Query Engines such as Apache Spark, Apache Impala, or Apache Drill as the interface to the engine. The advantage of using such Query Engines is familiarity, for most users are already accustomed to these methods when using traditional database engines.
Fast and Real-Time Capabilities
Apache Kudu reads and writes in any way the application wants, making it the closest Big Data engine to a traditional database. Most columnar engines usually make a trade-off, favoring read-efficiency over write-efficiency and flexibility. Apache Kudu offers the best of both worlds, supporting concurrent high-speed writes, even when queries are in progress. It writes data as it comes in and runs concurrent queries, sparing the need and the associated hassles for staging data into data engine. Such functionality is invaluable as newer IoT-based applications unfold.
Data distribution and reliability
Apache Kudu is a fully distributed database with built-in reliability.
Its Tables come with a well-defined schema, and a preset number of typed columns. Each table features a primary key, which has columns, thereby enforcing a unique constraint. The tables are made of a series of logical subsets of data called tablets, which are similar to partitions in relational database systems. The engine features the Raft Consensus Algorithm, enabling configuration of multiple copies, usually three, of the data, and thereby ensuring data safety in the event of hardware failure. The other data safety and integrity features on offer include built-in failover and rebalancing.
A variation of log-structured storage buffers updates, inserts, and deletes in memory before merging it into the columnar storage, thereby offering effective protection against spikes in query latency usually associated with such architectures. This process also ensures constant undertaking of small operations, to preempt large maintenance operations.
Kudu’s two-tier sharding mechanism, by key and/or by key-range, supports data distribution. Such an intelligent ground-up design enables support of multiple query types easily, enabling looking up any specific value by its key, or an array of keys, classified in key-order. It also becomes easy to execute any arbitrary query, across any number of columns. Such possibilities can generate fast query results for specific data items, and perform arbitrary random queries when specific queries are not available or feasible.
Direct APIs in C++ and Java in Kudu facilitate point and batch retrieval of rows, writes, deletes, and schema changes, among others.
While it resembles the semantics of a standalone engine, Apache Kudu API puts forward a well-developed distributed columnar engine, and supports several features to fill a big void in the Big Data space. The software is still in the pre-1.0 release phase, though many organizations have already embraced it for production, with keen interest. The real-time analytical capabilities, the efficient utilization of the latest CPU and I/O resources, the ability to do updates in place, and a simple and expandable data model make it a database with a bright future.
The data engine has been named quite aptly too – Kudu is an African antelope with vertical stripes, which has a striking resonance with the columnar data store of the Apache Kudu project!