Open data formats
What are open data formats?
Open data formats are file formats that are publicly available and free from restrictions, allowing anyone to use, modify, and share data without needing proprietary software or licensing. They are typically designed to ensure interoperability, transparency, and accessibility across different systems and platforms.
Open data formats are column-oriented with impressive data compression, which can lower the cost of storing large data volumes by 5-15X. Supported open data formats on Hydra:
-
CSV (Comma-separated values)
-
Parquet
-
Delta
-
Iceberg
Continue to the next section to learn how to use open data formats to query existing data from a data lake, back up specific PostgreSQL tables to an object store, import data from the data lake to support operational applications, and more.
Using open data formats
-
Querying data stored in Parquet, CSV, and Iceberg format by enabling DuckDB extensions
read_parquet
,read_csv
, andiceberg_scan
- For example, use the DuckDB Iceberg extension by
SELECT duckdb.install_extension('iceberg')
and read Iceberg files withiceberg_scan
.
- For example, use the DuckDB Iceberg extension by
-
Read and Write support for object storage (AWS S3, Cloudflare R2, or Google GCS):
-
Read parquet and CSV files:
-
SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)
-
SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)
-
You can pass globs and arrays to these functions, just like in DuckDB
-
-
Write a query — or an entire table — to parquet in object storage.
-
COPY (SELECT foo, bar FROM baz) TO 's3://...'
-
COPY table TO 's3://...'
-
Read and write to Parquet format in a single query
-
-
-
Query and
JOIN
data in object storage with Postgres tables, views, and materialized views. -
Toggle DuckDB execution on/off with a setting:
SET duckdb.force_execution = true|false
-
Cache remote object locally for faster execution using
SELECT duckdb.cache('path', 'type');
where-
‘path’ is HTTPFS/S3/GCS/R2 remote object
-
‘type’ specify remote object type: ‘parquet’ or ‘csv’
-
Setup
Hydra supports connecting your data lake in AWS S3, Cloudflare R2, or Google GCS.
-
Add a credential to enable DuckDB’s httpfs support.
-
Copy data directly from Postgres to your bucket - no ETL pipeline!
-
Perform analytics on your data. Example:
Examples
Sample Data
Reading a Parquet file
The following query uses pg_duckdb
to query Parquet files stored in S3 to find the top TV shows in the US during 2020-2022.
Reading an Iceberg table
In order to query against data in Iceberg, you first need to install the DuckDB Iceberg extension. In pg_duckdb
, installing duckdb extensions is done using the duckdb.install_extension(<extension name>)
function.
Writing back to your Data Lake
Access to Data Lakes is not just read-only in pg_duckdb
, you can also write back by using the COPY
command. Note that you can mix and match native PostgreSQL data, so you can use this to export from your PostgreSQL tables to external Data Lake storage.