The following query uses pg_duckdb to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.
SELECT Title,max("Days In Top 10")::intas MaxDaysInTop10FROM read_parquet('s3://duckdb-md-dataset-121/netflix_daily_top_10.parquet')AS("Days In Top 10"varchar, Title varchar,Typevarchar)WHEREType='TV Show'GROUPBY TitleORDERBY MaxDaysInTop10 DESCLIMIT5;
Title | MaxDaysInTop10--------------------------------+---------------- Cocomelon |99 Tiger King |44 Jurassic World Camp Cretaceous |31 Tiger King: Murder, Mayhem … |9 Ozark |9(5rows)
In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb, installing duckdb extensions is done using the duckdb.install_extension(extension_name) function.
-- Install the iceberg extensionSELECT duckdb.install_extension('iceberg');-- Total quantity of items ordered for each `l_shipmode`SELECT l_shipmode,SUM(l_quantity)AS total_quantityFROM iceberg_scan('s3://us-prd-motherduck-open-datasets/iceberg/lineitem_iceberg', allow_moved_paths :=true)AS l(l_shipmode varchar, l_quantity int)GROUPBY l_shipmodeORDERBY total_quantity DESC;
l_shipmode | total_quantity------------+---------------- TRUCK | 219078 MAIL | 216395 FOB | 214219 REG AIR | 214010 SHIP | 213141 RAIL | 212903 AIR | 211154(7 rows)
Access to Data Lakes is not just read-only in pg_duckdb, you can also write back by using the COPY command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.
COPY (SELECT Title,max("Days In Top 10")::intAS MaxDaysInTop10FROM read_parquet('s3://us-prd-motherduck-open-datasets/netflix/netflix_daily_top_10.parquet')AS("Days In Top 10"varchar, Title varchar,Typevarchar)WHEREType='TV Show'GROUPBY TitleORDERBY MaxDaysInTop10 DESCLIMIT5)TO's3://my-bucket/results.parquet';
This opens up many possibilities for performing the following operations directly in PostgreSQL:
Query existing data from a Data Lake
Back up specific PostgreSQL tables to an object store
Import data from the Data Lake to support operational applications.