Architecture
Hydra’s all-in-one database architecture is commonly known as a data lakehouse.
Our Data Lakehouse Architecture
A data lakehouse is a relational, column-oriented SQL database management system (DBMS) which decouples data storage from compute to unlock efficient analytical processing, scalability, and cost optimization.
A lakehouse architecture allows data storage and data processing resources to be managed independently — storage can be scaled up to accommodate large datasets without needing to increase compute resources. Likewise, a lakehouse’s compute resources can be scaled up temporarily without affecting storage costs.
A data lakehouse can achieve sub-second response times on complex analytical queries with:
- Small to large datasets — billions of rows
- Filtering, complex JOINs, GROUP BYs, aggregate queries, window functions, CTE (Common Table Expressions), UNIONs, INTERSECTs, full text search, and more.
- timeseries, logging, events, traces data
Modern lakehouses leverage open data formats in storage, such as parquet and iceberg. These common data formats are column-oriented with high degrees of data compression, which can help lower costs of storing large data volumes by 5-15X.
What’s an all-in-one database?
Hydra is a multi-workload data store with the following core capabilities:
- high throughput, ACID-compliant online transactional processing (OLTP)
- vectorized, parallel execution of online analytical processing (OLAP)
It is a unified database management system (DBMS) where transactional and analytical operations can be employed independently or interchangeably. With a multi-workload datastore, developers can ship applications that perform efficient sub-second analytical processing on recently committed transactions without the need for ETL data pipelines.
Hydra removes the complexity, costs, and bottlenecks associated with running multiple heterogenous datastores that rely on data movement and data duplication across silos. Hydra has access to a broad set of an organizations data, such as user events, billing data, logging, product usage, and more.
Architecture, in depth
Hydra is built on pg_duckdb, an open source program co-developed with the creators of DuckDB & MotherDuck which processes analytical queries 100-1000X faster than standard Postgres.
pg_duckdb embeds DuckDB’s state-of-the-art analytics engine and features into Postgres. Also, pg_duckdb connects your Postgres database to Hydra’s bottomless, globally distributed object storage to unlock petabyte scale without a database migration. Voila, your lakehouse is ready for move-in day!
Here are the high-level principles:
- Transactions write normally to Postgres tables (heap), which is shown in the diagram as green.
- DuckDB performs all SELECTs inside Postgres on row (green) and columnar tables (blue)
- Columnar tables from Hydra Lake are cached to Postgres.
- DuckDB performs JOINs between row tables, columnar tables, and with each other.
In summary, Postgres is best for hot data that’s being frequently updated. For example, the last 30 days of data. Data beyond 30 days can be incrementally moved into columnar format in object storage for lower cost storage with efficient data compression. Depending on the data, Columnar compression can range from 5X - 15X (for example: 150GB becomes 15GB with a 10X rate).
If it can be avoided, we recommend limiting Postgres’ on-disk storage to 500GB. Keeping data small in Postgres is ideal for supporting higher throughput and reduces the potential for resource bottlenecks.
For ideal performance, parquet / iceberg files should be cached to the server where Postgres is to eliminate network overhead. When possible, we recommend creating standard Postgres VIEWS with cached parquet or iceberg files for efficient processing.