Anssr Analytics can utilise a number of different storage solutions. The main two solutions are Amazon S3 (and compatible stores) and HDFS provided by the Apache Hadoop project.
Amazon S3 (Simple Storage Service) is an object storage system
capable of storing data files up to 5
terabytes in size.
It’s durable, scalable and versatile and can be very
cost efficient. It also includes a useful
lifecycle policy that lets you automatically migrate your older data to
lower cost storage and version control that easily allows you to restore
data which has accidentally been overwritten or deleted.
If S3 is the big data storage option you choose, we can provide the technical support and expertise to
ensure Druid and S3 work together as seamlessly as possible.
HDFS is an open source, distributed file system that’s specifically designed for the processing of large data
sets, stewarded by the Apache Hadoop community.
Hadoop's services include data access, data
governance, security and operations and it can store, manage and analyse vast amounts of structured and
unstructured data quickly and reliably at extremely low-cost.
Hadoop’s many benefits include scalability and performance (data can be stored, processed and analysed at petabyte scale), resilience (if a node fails, processing is immediately re-directed to the remaining nodes in the cluster) and flexibility (data can be stored in any format).
At Spicule, we operate Druid using Hadoop via the innovative ANSSR platform we developed in conjunction with Canonical.
For our purposes, Hadoop is especially valuable when interrogating huge amounts of data both in real time and historically.
Because Druid supports both streaming and batch ingestion, and combines seamlessly with Hadoop’s distributed filesystem, the Druid platform powered by Hadoop takes the headache out of running interactive analytics at scale and ensures we receive the best query latencies possible.
Hadoop is open source and doesn’t require any expensive or specialist hardware to implement.
A Hadoop cluster can consist of millions of nodes, providing huge storage system capability and massive computing power.
Hadoop distributes and splits the data across all the nodes within a cluster. It also replicates the data over the entire cluster. If any of the nodes unexpectedly fail, your data won’t be lost and your analysis will continue uninterrupted.
Deploying on Hadoop gives you scope to utilise other Hadoop functionality. Transform your data before it gets ingested or run further post query analysis on your Druid results.
Parallel processing means data can be processed simultaneously across all nodes in the cluster (saving a lot of time) and heterogeneous clusters means each node can be from a different vendor running a different type or version of operating system.
Like Druid, Hadoop can be scaled up or down depending upon your requirements. You won’t be paying for more processing power than you need.