Storage
Hydra employs both Postgres storage on-disk as well as Amazon EFS. On-disk storage is use for Postgres’ rowstore “heap” tables and EFS is used for the analytics schema’s columnstore.
Bottomless storage
Events, time-series, user sessions, click, logs, IOT sensor readings, etc. produce a lot of data over time. To guarantee high performance analytics at every size we concluded that on-disk Postgres storage just won’t cut it. On-disk storage works well for Postgres’ rowstore, known as “heap” tables- but, it’s a poor choice for scaling events data. Traditionally, large projects often dump sizable timeseries and event data into object store, such as AWS S3. However, performing analytics on S3 can run slowly over the network (high latency). Additionally, siloing event and time-series data in AWS S3 can make it difficult combine with application and user data in Postgres. To offer bottomless storage for analytics without impacting Postgres’ transactional heap tables it was necessary to separate compute from store.
The benefit of separating storage and compute is the ability to scale compute and storage resources independently. As compute needs peak, you only need to provision compute. As your storage needs increase, only more storage is needed.
To accomplish this, the analytics schema sets on a FUSE-based filesystem which keeps track of metadata indicating which data blocks are live and where to find them. The storage works by aggregating writes into a single file - when a block is overwritten, it is marked stale and the metadata is updated to point to the new location for the data. When reads are executed, the metadata points the query to the block location and performs a direct read. Since full pages are sized 256KB and always flushed this performs well. When writing large blocks of consecutive data it works great because they can be tracked using ranges rather than individual blocks.
On-disk storage (rowstore)
Standard Postgres heap tables are stored on-disk up to 500GB. To learn about how continuous backups and point-in-time recovery, navigate to our backups documentation.
New capabilities
Past snapshot layer files in the analytics schema are immutable. Some of the most important new features and capabilities are:
-
Zero-copy snapshots and forks
-
Automatic cacheability
Zero-copy snapshots enable data sharing with additional teams in your organization. Hydra’s serverless processing guarantees that these different users can access the analytics schema concurrently without sharing compute resources.