What are open data formats?

Open data formats are file formats that are publicly available and free from restrictions, allowing anyone to use, modify, and share data without needing proprietary software or licensing. They are typically designed to ensure interoperability, transparency, and accessibility across different systems and platforms.

Open data formats are column-oriented with impressive data compression, which can lower the cost of storing large data volumes by 5-15X. Supported open data formats on Hydra:

  • CSV (Comma-separated values)

  • Parquet

  • Delta

  • Iceberg

Continue to the next section to learn how to use open data formats to query existing data from a data lake, back up specific PostgreSQL tables to an object store, import data from the data lake to support operational applications, and more.

Using open data formats

  • Querying data stored in Parquet, CSV, and Iceberg format by enabling DuckDB extensions read_parquetread_csv, and iceberg_scan

    • For example, use the DuckDB Iceberg extension by SELECT duckdb.install_extension('iceberg') and read Iceberg files with iceberg_scan.
  • Read and Write support for object storage (AWS S3, Cloudflare R2, or Google GCS):

    • Read parquet and CSV files:

      • SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)

      • SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)

      • You can pass globs and arrays to these functions, just like in DuckDB

    • Write a query — or an entire table — to parquet in object storage.

      • COPY (SELECT foo, bar FROM baz) TO 's3://...'

      • COPY table TO 's3://...'

      • Read and write to Parquet format in a single query

        COPY (
        SELECT count(*), name
        FROM read_parquet('s3://bucket/file.parquet') AS (name text)
        GROUP BY name
        ORDER BY count DESC
        ) TO 's3://bucket/results.parquet';
        
  • Query and JOIN data in object storage with Postgres tables, views, and materialized views.

  • Toggle DuckDB execution on/off with a setting:

    • SET duckdb.force_execution = true|false
  • Cache remote object locally for faster execution using SELECT duckdb.cache('path', 'type'); where

    • ‘path’ is HTTPFS/S3/GCS/R2 remote object

    • ‘type’ specify remote object type: ‘parquet’ or ‘csv’

Setup

Hydra supports connecting your data lake in AWS S3, Cloudflare R2, or Google GCS.

  1. Add a credential to enable DuckDB’s httpfs support.

    -- Session Token is Optional
    INSERT INTO duckdb.secrets
    (type, key_id, secret, session_token, region)
    VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
    
  2. Copy data directly from Postgres to your bucket - no ETL pipeline!

    COPY (SELECT user_id, item_id, price, purchased_at FROM purchases)
    TO 's3://your-bucket/purchases.parquet;
    
  3. Perform analytics on your data. Example:

    SELECT SUM(price) AS total, item_id
    FROM read_parquet('s3://your-bucket/purchases.parquet')
    AS (price float, item_id int)
    GROUP BY item_id
    ORDER BY total DESC
    LIMIT 100;
    

Examples

Sample Data

-- Remote Parquet scans
SELECT count(*) FROM read_parquet('https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet') AS (o_orderkey int);
SELECT avg(c_acctbal) FROM read_parquet('https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet') AS (c_acctbal float);
SELECT count(*)::int as aws_service_cnt FROM read_parquet('https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet') AS (service_id int);
SELECT code, city, state FROM read_parquet('https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet') AS (code text, city text, state text);
SELECT cloud_provider, sum(ip_address_cnt)::int as cnt FROM read_parquet('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.parquet') AS (cloud_provider text, ip_address_cnt int) GROUP BY cloud_provider;

-- Remote CSV scan
SELECT * FROM read_csv('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv') AS (cloud_provider text, cidr_block text, ip_address text) limit 30;

Reading a Parquet file

The following query uses pg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.

SELECT Title, max("Days In Top 10")::int as MaxDaysInTop10
FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet')
    AS ("Days In Top 10" varchar, Title varchar, Type varchar)
WHERE Type = 'TV Show'
GROUP BY Title
ORDER BY MaxDaysInTop10 DESC
LIMIT 5;
             Title              | MaxDaysInTop10
--------------------------------+----------------
 Cocomelon                      |             99
 Tiger King                     |             44
 Jurassic World Camp Cretaceous |             31
 Tiger King: Murder, Mayhem …   |              9
 Ozark                          |              9
(5 rows)

Reading an Iceberg table

In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb, installing duckdb extensions is done using the  duckdb.install_extension(<extension name>) function.

-- Install the iceberg extension
SELECT duckdb.install_extension('iceberg');
-- Total quantity of items ordered for each `l_shipmode`
SELECT l_shipmode, SUM(l_quantity) AS total_quantity
FROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths := true)
    AS l(l_shipmode varchar, l_quantity int)
GROUP BY l_shipmode
ORDER BY total_quantity DESC;
 l_shipmode | total_quantity
------------+----------------
 TRUCK      |         219078
 MAIL       |         216395
 FOB        |         214219
 REG AIR    |         214010
 SHIP       |         213141
 RAIL       |         212903
 AIR        |         211154
(7 rows)

Writing back to your Data Lake

Access to Data Lakes is not just read-only in pg_duckdb, you can also write back by using the COPY command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.

COPY (
  SELECT Title, max("Days In Top 10")::int AS MaxDaysInTop10
  FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet')
    AS ("Days In Top 10" varchar, Title varchar, Type varchar)
  WHERE Type = 'TV Show'
  GROUP BY Title
  ORDER BY MaxDaysInTop10 DESC
  LIMIT 5
) TO 's3://my-bucket/results.parquet';