Connect Your Data Lake

Hydra supports AWS S3, Cloudflare R2, or Google GCS buckets.

  1. Add a credential to enable DuckDB’s httpfs support.
   -- Session Token is optional
   INSERT INTO duckdb.secrets
   (type, key_id, secret, session_token, region)
   VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
  1. Copy data directly from Postgres to your bucket - no ETL pipeline!

    COPY (SELECT user_id, item_id, price, purchased_at FROM purchases)
    TO 's3://your-bucket/purchases.parquet';
    
  2. Perform analytics on your data. Example:

    SELECT SUM(price) AS total, item_id
    FROM read_parquet('s3://your-bucket/purchases.parquet')
    AS (price float, item_id int)
    GROUP BY item_id
    ORDER BY total DESC
    LIMIT 100;
    

Reading a Parquet file

The following query uses pg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.

SELECT Title, max("Days In Top 10")::int as MaxDaysInTop10
FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet')
    AS ("Days In Top 10" varchar, Title varchar, Type varchar)
WHERE Type = 'TV Show'
GROUP BY Title
ORDER BY MaxDaysInTop10 DESC
LIMIT 5;
             Title              | MaxDaysInTop10
--------------------------------+----------------
 Cocomelon                      |             99
 Tiger King                     |             44
 Jurassic World Camp Cretaceous |             31
 Tiger King: Murder, Mayhem …   |              9
 Ozark                          |              9
(5 rows)

Reading an Iceberg table

In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb, installing duckdb extensions is done using the duckdb.install_extension(extension_name) function.

-- Install the iceberg extension
SELECT duckdb.install_extension('iceberg');

-- Total quantity of items ordered for each `l_shipmode`
SELECT l_shipmode, SUM(l_quantity) AS total_quantity
FROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths := true)
    AS l(l_shipmode varchar, l_quantity int)
GROUP BY l_shipmode
ORDER BY total_quantity DESC;
l_shipmode | total_quantity
------------+----------------
 TRUCK      |         219078
 MAIL       |         216395
 FOB        |         214219
 REG AIR    |         214010
 SHIP       |         213141
 RAIL       |         212903
 AIR        |         211154
(7 rows)

Writing back to your Data Lake

Access to Data Lakes is not just read-only in pg_duckdb, you can also write back by using the COPY command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.


COPY (
  SELECT Title, max("Days In Top 10")::int AS MaxDaysInTop10
  FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet')
    AS ("Days In Top 10" varchar, Title varchar, Type varchar)
  WHERE Type = 'TV Show'
  GROUP BY Title
  ORDER BY MaxDaysInTop10 DESC
  LIMIT 5
) TO 's3://my-bucket/results.parquet';

This opens up many possibilities for performing the following operations directly in PostgreSQL:

  • Query existing data from a Data Lake

  • Back up specific PostgreSQL tables to an object store

  • Import data from the Data Lake to support operational applications.