Read & write to open data formats (CSV, Parquet, Iceberg)
Modern data lakehouses benefit from open data formats in storage, such as parquet, CSV, and iceberg. These open data formats are column-oriented with impressive data compression, which can lower the cost of storing large data volumes by 5-15X.
An open data format is a file format that is publicly documented and free from legal restrictions on its use. It allows data to be easily shared, accessed, analyzed, and modified by anyone. Open data formats are important because they ensure interoperability between different software applications and systems.
This opens up many possibilities for performing the following operations directly in Hydra Postgres:
- Query existing data from a Data Lake
- Back up specific PostgreSQL tables to an object store
- Import data from the Data Lake to support operational applications.
Feature Overview
-
Querying data stored in Parquet, CSV, and Iceberg format by enabling DuckDB extensions
read_parquet
,read_csv
, andiceberg_scan
- For example, use the DuckDB Iceberg extension by
SELECT duckdb.install_extension('iceberg')
and read Iceberg files withiceberg_scan
.
- For example, use the DuckDB Iceberg extension by
-
Read and Write support for object storage (AWS S3, Cloudflare R2, or Google GCS):
- Read parquet and CSV files:
SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)
SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)
- You can pass globs and arrays to these functions, just like in DuckDB
- Write a query — or an entire table — to parquet in object storage.
-
COPY (SELECT foo, bar FROM baz) TO 's3://...'
-
COPY table TO 's3://...'
-
Read and write to Parquet format in a single query
-
- Read parquet and CSV files:
-
Query and
JOIN
data in object storage with Postgres tables, views, and materialized views. -
Toggle DuckDB execution on/off with a setting:
SET duckdb.force_execution = true|false
-
Cache remote object locally for faster execution using
SELECT duckdb.cache('path', 'type');
where- ’path’ is HTTPFS/S3/GCS/R2 remote object
- ’type’ specify remote object type: ‘parquet’ or ‘csv’
Setup
Hydra supports connecting your data lake in AWS S3, Cloudflare R2, or Google GCS.
-
Add a credential to enable DuckDB’s httpfs support.
-
Copy data directly from Postgres to your bucket - no ETL pipeline!
-
Perform analytics on your data. Example:
Examples
Sample Data
Reading a Parquet file
The following query uses pg_duckdb
to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.
Reading an Iceberg table
In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb
, installing duckdb extensions is done using the duckdb.install_extension(<extension name>)
function.
Writing back to your Data Lake
Access to Data Lakes is not just read-only in pg_duckdb
, you can also write back by using the COPY
command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.