Maximize the Power of Deduplication

By Gaurav Yadav | | Archive & DR

Deduplication is a powerful technique used by most storage solutions to reduce the amount of data stored at rest. For all-flash infrastructures, deduplication is a must-have to offset the higher costs of flash appliances . Backup and archival applications generate a lot  of duplicate data, which lends itself quite well to deduplication savings. There is no arguing the benefits of deduplication, but the devil is always in the details of the implementation. There are multiple ways to implement deduplication, but the key idea is always the same. First you start with discovering any duplicate data. Once you have located the duplicates you must only keep any unique instances of the data chunks, and use references to access them. In this blog, we will discuss different ways to implement deduplication and explore the unparalleled benefits of Hedvig’s deduplication implementation.

Let’s start with the different ways to implement deduplication:

Inline deduplication – Data is deduplicated before it is stored at rest in the persistent media at the backend servers. This assures that only unique data will ever hit the disks or tape drives.

Offline deduplication – Data is stored in the persistent media first and deduplicated later. This avoids any overhead of deduplication in real-time. However, this can lead to  the side effect of a higher number of media read and write operations. This occurs because all ingested data must be written to the media first and then read again as part of the offline deduplication process.

Client-side deduplication – Data is deduplicated at the client or source, before being sent to the server or backend. Depending on the storage solution, this can prevent duplicate data from being sent over the wire from the compute to the storage layer. A higher deduplication ratio means a significant boost in the virtual network throughput.

Server-side deduplication – Data is first sent to the backend server(s), and then an inline, offline, or hybrid deduplication process is implemented. There is no change in the network load from compute to storage layer with this implementation.

Global deduplication – Data is deduplicated across multiple source volumes, clients, or workloads. The storage cluster processes all data ingested through different mediums and executes the deduplication process as if all data were coming from a single source.

Hedvig’s advanced deduplication process is inline, client-side, and global. Hedvig provides native block, file, and object storage implementations, and deduplication is supported for all of these storage protocols. Before discussing any further, let us take a quick look at Hedvig’s architecture. Hedvig is a two-tier architecture consisting of the HSS (Hedvig Storage Service) at the storage layer and the HSP (Hedvig Storage Proxy) at the compute layer.

The HSS is the brain behind the distributed storage cluster and is responsible for managing the data stored in all cluster nodes. HSS is installed in each and every storage node in the cluster and consolidates locally attached storage media to present a virtual disk abstraction that can be consumed by the end user. These virtual disks can act as iSCSI volumes, NFS volumes, or object storage buckets.

The HSP, on the other hand, acts as the first point of contact for all user-facing volume-level I/O operations and is responsible for translating the storage protocols to simple read and write operations directed toward the HSS. The HSP acts as an iSCSI initiator, NFS server, and object storage server for the block, file, and object storage, respectively. Multiple HSPs can point to the same Hedvig cluster and serve a large number of Hedvig volumes. For detailed information on Hedvig architecture, refer to Hedvig’s technical overview whitepaper.

Hedvig’s deduplication process is executed real-time at the HSP layer. Before transferring any chunk of data to the backend HSS process, all HSPs first consult the storage layer to detect if the given chunk of data is previously written by any of the dedupe-enabled volumes in the entire cluster. When a duplicate chunk of data is encountered, only the reference to the previously written unique instance of the data is persisted at the backend, and no data blocks are transferred.

Benefits of Hedvig’s deduplication process

Workload agnostic deduplication

Data sprawl is one of the biggest problems with modern applications, and as a side effect, multiple copies of data are generated and stored without the knowledge of the infrastructure admins. Hedvig solves this problem by enabling deduplication for primary Tier-1 workloads, along with secondary workloads such as backup and archival solutions, using the same storage cluster. For instance, a music application might create multiple instances of the audio files for backup and test/dev purposes, but Hedvig will make sure only a unique copy of the data is stored, and a substantial amount of storage capacity is saved from data sprawl.

Granular dedupe control

Enabling dedupe for Hedvig volumes is just a matter of flipping an option during volume creation. There are storage solutions that provide deduplication at the cluster level, and the only control provided to the end-user is to turn it on or off for all volumes. The problem with this approach is that if you want to store encrypted or compressed data, the deduplication process is simply a performance and management overhead. By granting users control over the deduplication property of volumes, Hedvig opens the door for storing a diverse range of data.

Enhanced savings with compression

It’s well known that the majority of data that compresses well also provides good deduplication, and vice-versa. Hedvig’s compression is always turned on when deduplication is selected. The compression process for Hedvig is inline, which means all data blocks are first deduplicated, and only the unique set of blocks are then compressed before writing to the persistent media at the backend. Inline deduplication and compression not only yield significant storage reduction, the durability of storage media is also vastly extended with fewer read and write operations, as compared to offline deduplication and compression processes.

Dedupe for distributed applications

Hedvig can be installed across any combination of on-premises datacenter(s) and public cloud location(s) and provides a single pane of glass view for all your geographically distant applications. Hedvig’s deduplication process is independent of the number of clients and their locations. User applications can span multiple data centers and public clouds and still leverage Hedvig’s deduplication to store only unique data. This is another way Hedvig mitigates data sprawl risks caused by globally distributed applications.

Highly available storage

Keeping multiple copies of data is a common requirement these days for high availability. Hedvig not only provides control over the number of replicas of unique data, it also lets the user select data residence from a list of racks, datacenters, and public cloud locations. This enables locality of data at multiple isolated locations and minimizes data transfer over WAN. Hedvig utilizes the power of a distributed cluster and stripes all unique data from dedupe-enabled volumes so that multiple storage nodes can participate in read and write processes.

Hedvig’s software-defined approach to storage, coupled with advanced storage reduction algorithms, has enabled our customers to move away from traditional and expensive deduplication appliances. Hedvig is designed for modern applications seeking cloud-like simplicity and flexibility from their storage solution. For more of the Hedvig features take a look at our product datasheet here.

Contact us if you have any questions or want to learn more.