Hydra supports AWS S3, Cloudflare R2, or Google GCS buckets.
Add a credential to enable DuckDB’s httpfs support.
-- Session Token is optional INSERT INTO duckdb.secrets (type, key_id, secret, session_token, region) VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
Copy data directly from Postgres to your bucket - no ETL pipeline!
COPY (SELECT user_id, item_id, price, purchased_at FROM purchases)TO 's3://your-bucket/purchases.parquet';
Perform analytics on your data. Example:
SELECT SUM(price) AS total, item_idFROM read_parquet('s3://your-bucket/purchases.parquet')AS (price float, item_id int)GROUP BY item_idORDER BY total DESCLIMIT 100;
The following query uses pg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.
SELECT Title, max("Days In Top 10")::int as MaxDaysInTop10FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet') AS ("Days In Top 10" varchar, Title varchar, Type varchar)WHERE Type = 'TV Show'GROUP BY TitleORDER BY MaxDaysInTop10 DESCLIMIT 5;
Title | MaxDaysInTop10--------------------------------+---------------- Cocomelon | 99 Tiger King | 44 Jurassic World Camp Cretaceous | 31 Tiger King: Murder, Mayhem … | 9 Ozark | 9(5 rows)
In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb, installing duckdb extensions is done using the duckdb.install_extension(extension_name) function.
-- Install the iceberg extensionSELECT duckdb.install_extension('iceberg');-- Total quantity of items ordered for each `l_shipmode`SELECT l_shipmode, SUM(l_quantity) AS total_quantityFROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths := true) AS l(l_shipmode varchar, l_quantity int)GROUP BY l_shipmodeORDER BY total_quantity DESC;
l_shipmode | total_quantity------------+---------------- TRUCK | 219078 MAIL | 216395 FOB | 214219 REG AIR | 214010 SHIP | 213141 RAIL | 212903 AIR | 211154(7 rows)
Access to Data Lakes is not just read-only in pg_duckdb, you can also write back by using the COPY command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.
COPY ( SELECT Title, max("Days In Top 10")::int AS MaxDaysInTop10 FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet') AS ("Days In Top 10" varchar, Title varchar, Type varchar) WHERE Type = 'TV Show' GROUP BY Title ORDER BY MaxDaysInTop10 DESC LIMIT 5) TO 's3://my-bucket/results.parquet';
This opens up many possibilities for performing the following operations directly in PostgreSQL:
Query existing data from a Data Lake
Back up specific PostgreSQL tables to an object store
Import data from the Data Lake to support operational applications.