Cloudera and Apache Iceberg – Collaborating on the same data
Cloudera recently announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). This article provides background on data lake storage, the challenges of organizing data in data lake storage, the emergence of Apache Iceberg as the data management standard in data lakes, and finally, the benefits for existing and potential users of Cloudera CDP.
Data Organization Challenges in Data Lake Storage
Data lakes provide virtually unlimited storage for structured and unstructured data. A data lake is a shared repository of data that allows organizations’ applications to access various tasks, including reporting, analysis, and processing.
The Apache Hadoop Distributed File System (HDFS), the roots of Cloudera, formed the foundation of traditional data lakes. The trend today is towards cloud data lakes that use object storage systems such as Amazon S3 and Microsoft Azure Data Lake Storage (ADLS).
Data is stored in the data lake precisely when it is collected. A structured dataset retains the original structure without further indexing or metadata. Likewise, unstructured data such as social media posts, images, and MP3 files land in the original native format.
A data lake can only work if the data can be extracted and used for analysis, which requires data governance. Data catalogs, such as Hive Metastore (HMS), apply metadata and hierarchical logic to incoming data, so datasets receive the necessary context and traceable lineage.
The limits of a catalog
While catalogs provide a shared definition of the structure of the dataset in the data lake storage, data changes or schema evolution between applications are not tracked. For example, the structure of a large dataset, including column names and datatypes, may be cataloged by Hive, but the data files present as part of the dataset are unknown. Therefore, applications must read file metadata to identify which files are part of a dataset at any given time.
Data integrity isn’t much of an issue if the dataset is static and doesn’t change. When one application writes and modifies the dataset, another application reading from the same dataset must be synchronized with the changes. For example, an ETL (Extract, Transform, Load) process updates the dataset by adding and removing multiple files from storage; another application reading the dataset may process a partial or inconsistent view of the dataset and generate incorrect results.
What is Apache Iceberg?
Apache Iceberg is a new open table format that allows multiple applications to work together on the same data in a transactional way. It tracks the changing state of the datasets and changes over time.
Those familiar with traditional SQL tables will immediately recognize the Iceberg table format. It is open and accessible so that multiple engines can run on the same data set.
HMS, for example, tracks data at the “folder” level that requires file list operations when working with data in a table, which can often lead to performance degradation.
Iceberg avoids this by keeping track of a complete list of all files in a table using a persistent tree.
Apache Iceberg was developed at Netflix to address issues with huge petabyte-scale tables given to the open source community in 2018 as the Apache Incubator project.
Benefits for Cloudera CDP users
General Availability covers running Iceberg in Cloudera Data Platform (CDP) essential data services, including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), and Cloudera Machine Learning (CML).
Cloudera has integrated Iceberg into the Shared Data Experience (SDX) layer of CDP, so the productivity and performance benefits of the open table format are immediate. Additionally, the native Iceberg integration takes advantage of SDX’s enterprise-grade features, such as data lineage, auditing, and security.
Iceberg tables in CDP integrate into the SDX Metastore for table structure and access validation, enabling the creation of fine-grained and auditing policies. Iceberg allows CDP to expose the same dataset to multiple analytics engines, including Spark, Hive, Impala, and Presto.
The CDP Iceberg integration has four other benefits that users will appreciate:
Changing tables on site saves time.
Users can scale a table schema or change the layout of partitions with a single command, just like you would with SQL. Iceberg does not require laborious and expensive processes, such as rewriting table data or migrating to a new table.
Time travel for forensic visibility and regulatory compliance
Iceberg saves previous table snapshots, allowing the generation of time travel queries or table undoes.
Multifunctional analysis from edge to AI
Iceberg enables seamless integration between different streaming and processing engines while maintaining data integrity between them. Multiple engines can modify the table simultaneously, even with partial writes, without correctness issues and without the need for expensive read locks.
Improved performance with very large scale datasets
Partitioning speeds up queries by grouping similar rows together when writing or dividing a table into parts based on certain attributes.
Iceberg simplifies partitioning by implementing hidden partitioning and handling all partitioning and querying details without the knowledge of the user.
I like what Cloudera has done here. Analysts and data scientists can easily collaborate on the same data using analytical tools and engines. This feature requires no effort to get the benefits of Iceberg as part of CDP. No more locking, unnecessary data transformations, or moving data between tools and clouds to extract insights from data.
This is very much in line with Cloudera’s strategy to take open-source technologies and add enterprise-grade quality and stability. Larger enterprises with large amounts of data look to Cloudera as the company to manage that data end-to-end on-premises or in the public cloud or even to collect data that comes from a SaaS application. Cloudera does a great job bringing it all together as a one-stop-shop for data management.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.
Moor Insights & Strategy, like all research and technology industry analytics companies, provides or has provided paid services to technology companies. These services include research, analysis, consulting, consulting, benchmarking, acquisition matching and conference sponsorship. Company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cognitive Systems , CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Five9, Flex, Foundries.io, Foxconn, Frame ( now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, Infinidat, Infosys, Inseego, IonQ, IonVR, Inseego, Info sys, Infiot, Intel, Interdigit al, Jabil Circuit, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, MulteFire Alliance, National Instruments, Neat, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), onsemi, UNOG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), St ratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tan ium, Telesign, TE Connectivity, TensTorrent, Tobii Technology, Teradata, T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom and Zscaler. Patrick Moorhead, Founder, CEO and Chief Analyst of Moor Insights & Strategy, is an investor in dMY Technology Group Inc. VI, Dreamium Labs, Groq, Luminar Technologies, MemryX and Movandi.