Engineers at Netflix and Apple created Apache Iceberg several years ago to address the performance and usability challenges of using Apache Hive tables in large and demanding data lake environments. Now the data table format is the focus of a burgeoning ecosystem of data services that could automate time-consuming engineering tasks and unleash a new era of big data productivity.
Apache Hive emerged over a decade ago to make Apache Hadoop clusters look and function more like standard relational databases accessible through SQL. While Hadoop usage has waned in the age of cloud data lakes like AWS S3 and Azure Data Lake Storage (ADLS), the Hive legacy continues, both as a query engine for large, batch-oriented analytic jobs but arguably more so as a table format and a metadata catalog used by other query engines, including Apache Spark and Presto, among others.
In this manner, Hive acts as a unifying layer that enables these engines to function on a common set of data stored in Hadoop clusters and S3-compatible data lakes. While Hive marked a significant step forward in big data storage and analytics a decade ago, its technical limitations today are forcing data engineers and analysts to embark upon expensive and time-consuming workarounds to store and analyze massive data sets effectively.
One of the big problems with Hive is that it doesn’t adapt well to changing datasets. It can handle static data just fine, but if a user or an application makes changes to the data, such as an ETL job that updates a Parquet file, then those changes have to be coordinated with other applications or users. If this coordination does not happen, then the data is at risk of becoming corrupt and giving the wrong answer when queried.
This was one of the main drivers behind the Apache Iceberg project. Engineers at Apple and Netflix started the Iceberg project around 2018 to address the limitations in using Hive tables to store and query massive data sets. Ryan Blue, a senior engineer at Netflix and the PMC Chair of the Apache Iceberg project, recently discussed the genesis of Iceberg and the direction it’s headed in a session at the Subsurface 2021 conference, which was sponsored by Dremio and held last month.
“Iceberg exists because Netflix slowly realized we needed a new table format,” Blue said. “Many different services and engines were using Hive tables. But the problem was, we didn’t have that correctness guarantee. We didn’t have atomic transactions. And sometimes changes from one system caused another system to get the wrong data and that sort of issue caused us to just never use these services, not make change to our tables, just to be safe.”
Iceberg to the Rescue
The number one goal of the Iceberg project was to ensure correctness in the data, Blue said.
“Quite simply, tables shouldn’t lie to you when you query them,” he said. “It’s a really simple thing. But we survived for a very, very long period of time where these tables were being updated, or your file system was, say, S3 and didn’t provide a consistent listing, your tables could easily lie to you.”
Iceberg, which is written in Java and also offers a Scala API, effectively solves this dilemma by enforcing transactional consistency in the data, even when it’s accessed by multiple applications. According to Dremio’s description of Iceberg, the Iceberg table format “has similar capabilities and functionality as SQL tables in traditional databases but in a fully open and accessible manner such that multiple engines (Dremio, Spark, etc.) can operate on the same dataset.”
In addition to support for atomic transactions, the second major obstacle the Iceberg project tackled was enabling operations to be performed at a finer-grained level than simply partitioning the data level, Blue said.
“We needed to be able to rewrite data at the file level in order to do more efficient writes,” he said. “We wanted appends that could append to multiple partitions at the time. And we also had a lot of performance needs on the query side. We also wanted to be able to…filter down to individual files that are needed for a particular query.”
The third major need with Iceberg was to improve other tasks that were consuming large amounts of time for Netflix’s data engineers.
“The way our schemas were evolving, you could evolve by name or you could evolve by partition,” Blue said. “You had different behaviors depending on what underlying file formats you had. We wanted to build something where all of those problems were solved and we didn’t have to worry about ‘Is this system identifying columns by name? If so, oh, then I can’t delete and add something that has the name.’ Just things that tables should not do., and those concerns were costing our data engineers quite a bit of time.”
For example, if a dataset was changed from partitioning data by the day to storing it by the hour, that could be done in a fairly painless manner (i.e. not requiring all the data to be copied to a brand new table with the right attributes, and then rewriting all the queries to use the new table). “That was just a huge cost and could seriously be a month of someone’s time just to slightly change the layout of a table,” Blue said. “So we wanted to have in-place table evolution.”
Those three goals have been met with Iceberg, which is emerging as a new standard for table formats in S3- compatible data lakes and Hadoop. In addition to Apple and Netflix, it’s been adopted by Expedia and Adobe, among others.
“We’ve solved those three challenges,” Blue said. “We have high performance tables that operate at tens of petabytes, if not hundreds of petabytes–we simply haven’t gone up that far. And we’re able to do job planning much more quickly, job execution much more quickly because we’re reading far less data. We have vectorization. We really have achieved high-performance tables at scale, meeting the performance and usability [goals] that we set out to do. That’s really exciting.”
In addition to supporting Spark and Presto, integrations have been built that enable Iceberg to be used in Trino (formerly Presto SQL), Apache Flink, and the Dremio query engine. Somebody is building an integration to enable Apache Beam to read and write data in Iceberg table formats, too.
A New Data Service Ecosystem
“We’re seeing that the way to move forward, or at least the way we are moving forward, is that we’re building services that actually coordinate through Iceberg tables,” Blue said. “We’re building services that are actually really good at one thing, like knowing when to delete data, to clean it up, and to focus on our compliance concerns, or to compact data, or do some other maintenance. And all of those disparate services and processing engines are able to coordinate through Iceberg tables.”
Hive was great because it provided a standard method that multiple applications could use to to store and access big data. As the successor to Hive tables, Iceberg is building on that Hive legacy, while providing that correctness guarantee, which is critical to operating at massive scale.
“We were putting so much work on data engineers and data analysts and anybody writing data,” Blue said. “We were telling them that they had to handle the small file problem. They needed to think about sort order. They needed to thing about how do I tune Parquet settings and all these responsibilities.”
Iceberg allowed users to take those responsibilities off the shoulders of the analysts and engineers and instead put them on the data platform. Data compaction and other data maintenance tasks, such as optimizing data, can become the sole responsibility of various data services that run automatically, Blue said.
“I’m really excited about the potential in this area for data services and I really want to see what everyone out there decides to build,” he said. “Indexing, metadata changes, view materialization–all of these things are now on the table and I’m really excited to see what the committee produces. Hopefully we have an ecosystem of services that are coordinating through Iceberg table and really doing great things.”