What is Amberstone?

Amberstone is a massively scalable compression, storage and aggregation engine for machine generated data. Machine generated data is loaded into it and stored in tightly compressed, column oriented fact tables. Once loaded, Amberstone can very efficiently output time series aggregate tables which can be loaded into any database or column store for reporting.

Amberstone is a batch aggregation engine that incorporates core data warehouse design principals such as; star schemas, column orientation, pivot tables, time series and incremental aggregation.

Scalability Through Efficiency

Amberstone's core design philosphy is to achieve high scalability through efficiency. Due to its high compression rates and high speed aggregation, Amberstone can work with hundreds of billions of records on a single server.

Amberstone can be run on a single node, or clustered to increase storage capacity and load/aggregation throughput.

Core Use Case

Amberstone is designed to be used as part of the data pipeline feeding a time series reporting system. In this scenario, machine generated data flows into the data warehouse throughout the day. When it arrives, the data is loaded into Amberstone for long term storage and aggregation. After the data is loaded, Amberstone outputs time series aggregate tables which are loaded into a database for reporting. The reporting system uses the aggregate tables to generate on-demand time series reports and pivot tables for end users.

Compression

Amberstone's compression strategy is two pronged. The first part of the strategy is to store data in column oriented fact tables. This design allows compression to be applied on individual columns, which can result in very high compression rates. Amberstone's second approach is to provide built-in functions at load time that replace text data with surrogate integer keys. This approach creates on-the-fly star schemas from machine generated data, and leads to very compact fact tables.

By combining these two approaches Amberstone can often achieve 90% or better compression rates on machine generated data. High compression rates allow for longer data retention on less hardware.

Aggregation

Amberstone creates time series aggregates from the data stored in its fact tables. These aggregates are multi-dimensioned and automatically include time series dimensions. Amberstone can include any column from the fact table as a dimension and can sum any column in the fact table to build its aggregates.

Amberstone performs aggregation at very high speeds. It deploys a read-ahead thread to read data off disk in the background as aggregation is being performed. Aggregation is performed in memory in high performance hash tables optimized to work with the surrogate integer keys in the fact table. Combining these approaches with a high end 8-12 core CPU can result in aggregation speeds approaching 30 million records per second on a single node. When run in a cluster, Amberstone can perform aggregation at hundreds of millions of records per second.

Amberstone has built-in support for incremental aggregation.

Amberstone also aggregates sessions and transactions.

Handling High Cardinality

Amberstone builds its aggregations in memory, which is much faster then the disk based, sort approach taken by Hadoop. But how does Amberstone deal with the issue of high cardinality?

Amberstone has a two pronged approach for dealing with high cardinality. The first approach is sliding window aggregation. Amberstone's fact tables are sorted in ascending time order. Amberstone reads the fact tables linearly and cuts daily or hourly aggregate files as it goes. Using this approach, only a window of the aggregation is kept in memory at any given time. This allows Amberstone to build aggregations on high cardinality dimensions that could not be done if the entire aggregation was kept in memory at once.

The second approach for handling high cardinality is horizontal partitioning. Amberstone allows fact tables to be split across any number of partitions and then aggregated separately. Using this approach, a fact table can be split on a high cardinality dimension across N partitions. Each partition can then be aggregated separately, reducing the cardinality of the dimension by a factor of N. The aggregates from each partition can then be quickly merged into a single master aggregate.

Combining sliding window aggregation with horizontal partitioning allows Amberstone to effectively manage high levels of cardinality.

License

Amberstone is released under the Apache 2.0 open source license.

http://www.apache.org/licenses/LICENSE-2.0