Introduction to Apache Iceberg | Baeldung

1. Introduction

This tutorial will discuss Apache Iceberg, a popular open table format in today’s big data landscape.

We’ll explore Iceberg’s architecture and some of its important features through a hands-on example with open-source distributions.

2. Origin of Apache Iceberg

Iceberg was started at Netflix by Ryan Blue and Dan Weeks around 2017. It came into existence mainly because of the limitations of the Hive table format. One of the critical issues with Hive was its inability to guarantee correctness in the absence of stable atomic transactions.

The design goals of Iceberg were to address these issues and provide three key improvements:

Support ACID transactions and ensure the correctness of data
Improve performance by allowing fine-grained operations at the level of files
Simplify and obfuscate table maintenance

Iceberg was later open-sourced and contributed to the Apache Foundation, where it became a top-level project in 2020. As a result, Apache Iceberg has become the most popular open standard for table formats. Almost all major players in the big data landscape nowadays support Iceberg tables.

3. Architecture of Apache Iceberg

One of Iceberg’s key architecture decisions was tracking the complete list of data files within a table instead of directories. This approach has many advantages, like better query performance.

This all happens in the metadata layer, one of the t****here are three layers in the architecture of the Iceberg:

What do we have here? When a table is read from Iceberg, it loads the table’s metadata using the current snapshot (s1). If we update this table, the update creates a new metadata file optimistically with a new snapshot (s2).

Then, the value of the current metadata pointer is atomically updated to point to this new metadata file. If the snapshot on which this update was based (s1) is no longer current, the write operation must be aborted.

3.1. Catalog Layer

The catalog layer has several functions, but most importantly, it stores the location of the current metadata pointer. Any compute engine that wishes to operate on Iceberg tables must access the catalog and get this current metadata pointer.

The catalog also supports atomic operations while updating the current metadata pointer. This is essential for allowing atomic transactions on Iceberg tables.

Available features depend on the catalog we use. For instance, Nessie provides a Git-inspired data version control.

3.2. Metadata Layer

The metadata layer contains the hierarchy of files. The one on the top is a metadata file that stores metadata about an Iceberg table. It tracks the table’s schema, partitioning config, custom properties, snapshots, and also which snapshot is the current one.

The metadata file points to a manifest list, a list of manifest files. The manifest list stores metadata about each manifest file that makes up a snapshot, including information like the location of the manifest file and what snapshot it was added to.

Finally, the manifest file tracks data files and provides additional details. Manifest files allow Iceberg to track data at the file level and contain useful information that improves the efficiency and performance of the read operations.

3.3. Data Layer

The data layer is where data files sit, most likely in a cloud object storage service like AWS S3. Iceberg supports several file formats, such as Apache Parquet, Apache Avro, and Apache ORC.

Parquet is the default file format for storing data in Iceberg. It’s a column-oriented data file format. Its key benefit is efficient storage. Moreover, it comes with high-performance compression and encoding schemes. It also supports efficient data access, especially for queries that target specific columns from a wide table.

4. Important Features of Apache Iceberg

Apache Iceberg offers transactional consistency, allowing multiple applications to work together on the same data.

It also has features like snapshots, complete schema evolution, and hidden partitioning.

4.1. Snapshots

Iceberg table metadata maintains a snapshot log that represents the changes applied to a table.

Hence, a snapshot represents the state of the table at some time. Iceberg supports reader isolation and time travel queries based on snapshots.

For snapshot lifecycle management, Iceberg also supports branches and tags, which are named references to snapshots:

Snapshots with branches and tags in Apache Iceberg

Here, we tagged the important snapshots as “end-of-week,” “end-of-month,” and “end-of-year” to retain them for auditing purposes. Their lifecycle management is controlled by branch- and tag-level retention policies.

Branches and tags can have multiple use cases, like retaining important historical snapshots for auditing.

The schema tracked for a table is valid across all branches. However, querying a tag uses the snapshot’s schema.

4.2. Partitioning

Iceberg partitions the data by grouping similar rows when writing. For example, it can partition log events by date and group them into files with the same event date. This way, it can skip files for other dates that don’t have useful data and make queries faster.

Interestingly, Iceberg supports hidden partitioning. That means it handles the tedious and error-prone task of producing partition values for rows in a table. Users don’t need to know how the table is partitioned, and the partition layouts can evolve as needed.

This is a fundamental difference from partitioning supported by earlier table formats like Hive. With Hive, we must provide the partition values. This ties our working queries to the table’s partitioning scheme, so it can’t change without breaking queries.

4.3. Evolution

Iceberg supports table evolution seamlessly and refers to it as “in-place table evolution.” For instance, we can change the table schema, even in a nested structure. Further, the partition layout can also change in response to data volume changes.

To support this, Iceberg does not require rewriting table data or migrating to a new table. Behind the scenes, Iceberg performs schema evolution just by performing metadata changes. So, no data files get rewritten to perform the update.

We can also update the Iceberg table partitioning in an existing table. The old data written with an earlier partition spec remains unchanged. However, the new data gets written using the new partition spec. Metadata for each partition version is kept separately.

5. Hands-on With Apache Iceberg

Apache Iceberg has been designed as an open community standard. It’s a popular choice in modern data architectures and is interoperable with many data tools.

In this section, we’ll see Apache Iceberg in action by deploying an Iceberg REST catalog over Minio storage with Trino as the query engine.

5.1. Installation

We’ll use Docker images to deploy and connect Minio, Iceberg REST catalog, and Trino. It’s preferable to have a solution like Docker Desktop or Podman to complete these installations.

Let’s begin by creating a network within Docker:

docker network create data-network

The commands in this tutorial are meant for a Windows machine. Changes might be required for other operating systems.

Let’s now deploy Minio with persistent storage (mount host directory “data” as volume):

docker run --name minio --net data-network -p 9000:9000 -p 9001:9001 \
  --volume .\data:/data quay.io/minio/minio:RELEASE.2024-09-13T20-26-02Z.fips \
  server /data --console-address ":9001"

As the next step, we’ll deploy the Iceberg Rest catalog. This is a Tabular contributed image with a thin server to expose an Iceberg Rest catalog server-side implementation backed by an existing catalog implementation:

docker run --name iceberg-rest --net data-network -p 8181:8181 \
  --env-file ./env.list \
  tabulario/iceberg-rest:1.6.0

Here, we are providing environment variables as a file containing all the necessary configurations for the Iceberg REST catalog to work with the Minio:

CATALOG_WAREHOUSE=s3://warehouse/
CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
CATALOG_S3_ENDPOINT=http://minio:9000
CATALOG_S3_PATH-STYLE-ACCESS=true
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_REGION=us-east-1

Now, we’ll deploy Trino to work with the Iceberg REST catalog. We can configure Trino to use the REST catalog and Minio that we deployed earlier by providing a properties file as a volume mount:

docker run --name trino --net data-network -p 8080:8080 \
  --volume .\catalog:/etc/trino/catalog \
  --env-file ./env.list \
  trinodb/trino:449

The properties file contains the details of the REST catalog and Minio:

connector.name=iceberg 
iceberg.catalog.type=rest 
iceberg.rest-catalog.uri=http://iceberg-rest:8181/
iceberg.rest-catalog.warehouse=s3://warehouse/
iceberg.file-format=PARQUET
hive.s3.endpoint=http://minio:9000
hive.s3.path-style-access=true

As before, we also feed the environment variable as a file with access credentials for Minio:

AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_REGION=us-east-1

The property hive.s3.path-style-access is required for Minio and isn’t necessary if we use AWS S3.

5.2. Data Operations

We can use Trino to perform different operations on the REST catalog. Trino comes with a built-in CLI to make this easier for us. Let’s first get access to the CLI from within the Docker container:

docker exec -it trino trino

This should provide us with a shell-like prompt to submit our SQL queries. As we have seen earlier, a client for Iceberg needs to begin by accessing the catalog first. Let’s see if we have any default catalogs available to us:

trino> SHOW catalogs;
 Catalog
---------
 iceberg
 system
(2 rows)

We’ll use iceberg. Let’s begin by creating a schema in Trino (which translates to a namespace in Iceberg):

trino> CREATE SCHEMA iceberg.demo;
CREATE SCHEMA

Now, we can create a table inside this schema:

trino> CREATE TABLE iceberg.demo.customer (
    -> id INT,
    -> first_name VARCHAR,
    -> last_name VARCHAR,
    -> age INT);
CREATE TABLE

Let’s insert a few rows:

trino> INSERT INTO iceberg.demo.customer (id, first_name, last_name, age) VALUES
    -> (1, 'John', 'Doe', 24),
    -> (2, 'Jane', 'Brown', 28),
    -> (3, 'Alice', 'Johnson', 32),
    -> (4, 'Bob', 'Williams', 26),
    -> (5, 'Charlie', 'Smith', 35);
INSERT: 5 rows

We can query the table to fetch the inserted data:

trino> SELECT * FROM iceberg.demo.people;
 id | first_name | last_name | age
----+------------+-----------+-----
  1 | John       | Doe       |  24
  2 | Jane       | Brown     |  28
  3 | Alice      | Johnson   |  32
  4 | Bob        | Williams  |  26
  5 | Charlie    | Smith     |  35
(5 rows)

As we can see, we can use the familiar SQL syntax to work with a highly scalable and open table format for massive volumes of data.

5.3. A Peek Into the Files

Let’s see what type of files are generated in our storage.

Minio provides a console that we can access at http://localhost:9001. We find two directories under warehouse/demo:

data and
metadata

Let’s first look into the metadata directory:

It contains the metadata files (*.metadata.json), manifest lists (snap-*.avro), and manifest files (*.avro, *.stats). The .stats file contains information about the table’s data used to improve the query performance.

Now, let’s see what’s there in the data directory:

It has a data file in the Parquet format that contains the actual data that we created through our queries.

6. Conclusion

Apache Iceberg has become a popular choice for implementing data lakehouses today. It offers features like snapshots, hidden partitioning, and in-place table evolution.

Together with the REST catalog specification, it’s fast becoming the de-facto standard for open table formats.

Persistence

REST

Security