Hydra is multistore database designed for transactional and analytical processing (OLTP, OLAP). It integrates online transactions with a scalable data lakehouse architecture.

Hydra’s multistore eliminates the friction of moving and transforming data between a rowstore and columnstore with ETL processing. The transactional rowstore and analytical columnstore are managed within Hydra automatically, making it easier to build applications and analytics on realtime data.

What’s a rowstore?

In a rowstore, table rows are stored sequentially. This arrangement enables fast retrieval of rows because the column values for each row are grouped together. A rowstore make it easy to add / modify a record, but may scan unnecessary data during read operations. For example, PostgreSQL is a rowstore.

What’s a columnstore?

In a columnstore, tables are organized by storing all the values of each column in sequence. This format enhances the efficiency of filtering or aggregating columns but complicates the retrieval of individual rows due to the gaps between row data. For example, DuckDB is a columnstore.

What’s a data lakehouse?

A data lakehouse is a relational, column-oriented SQL database management system (DBMS) which decouples data storage from compute to unlock efficient analytical processing, scalability, and cost optimization.

A lakehouse architecture allows data storage and data processing resources to be managed independently — storage can be scaled up to accommodate large datasets without needing to increase compute resources. Likewise, a lakehouse’s compute resources can be scaled up temporarily without affecting storage costs.

A data lakehouse can achieve sub-second response times on complex analytical queries with:

  • Small to large datasets — billions of rows

  • Filtering, complex JOINs, GROUP BYs, aggregate queries, window functions, CTE (Common Table Expressions), UNIONs, INTERSECTs, full text search, and more.

  • timeseries, logging, events, traces data

A modern lakehouse leverages open data formats in storage, such as parquet, iceberg, and delta. These common data formats are column-oriented with high degrees of data compression, which can help lower costs of storing large data volumes by 5-15X.

What’s a multistore database?

Hydra is a multistore, “multi-workload data store” with the following core capabilities:

  • high throughput, ACID-compliant online transactional processing (OLTP)

  • vectorized, parallel execution of online analytical processing (OLAP)

It is a unified database management system (DBMS) where transactional and analytical operations can be employed independently and interchangeably. With a multistore, developers can ship applications that perform efficient sub-second analytical processing on recently committed transactions without the need for ETL data pipelines.

Hydra removes the complexity, costs, and bottlenecks associated with running multiple heterogenous datastores that rely on data movement and transformation between the transactional rowstore and the analytical columnstore. Hydra has access to a broad set of an organizations data, such as user events, billing data, logging, product usage, and more.

Hydra multistore, in depth

Hydra runs pg_duckdb, an open source program co-developed with the creators of DuckDB & MotherDuck which processes analytical queries 100-1000X faster than standard Postgres.

pg_duckdb embeds DuckDB’s state-of-the-art analytics engine and features into Postgres. Also, pg_duckdb connects your Postgres database to Hydra’s bottomless, globally distributed object storage to unlock petabyte scale without a database migration. Voila, your lakehouse is ready.

Here are the high-level principles:

  • Transactions write normally to Postgres tables (heap)

  • DuckDB performs all SELECTs inside Postgres on row (heap) and columnar tables

  • Columnar tables from Hydra Lake are cached to Postgres.

  • DuckDB performs JOINs between row tables, columnar tables, and with each other.

In summary, Postgres is best for hot data that’s being frequently updated. For example, the last 30 days of data. Data beyond 30 days can be incrementally moved into columnar format in object storage for lower cost storage with efficient data compression. Depending on the data, Columnar compression can range from 5X - 15X (for example: 150GB becomes 15GB with a 10X rate).

If it can be avoided, we recommend limiting Postgres’ on-disk storage to 500GB. Keeping data small in Postgres is ideal for supporting higher throughput and reduces the potential for resource bottlenecks.

For ideal performance, parquet / iceberg files should be cached to the server where Postgres is to eliminate network overhead. When possible, we recommend creating standard Postgres VIEWS with cached parquet or iceberg files for efficient processing.