Why and how Hedvig provides non-disruptive upgrades

By Srividhya Anantharamakrishnan | | Software-defined Storage

Hedvig: It’s time to upgrade your software version!
Customer:  I see. How much time is that going to take?
Hedvig:  30 minutes to one hour.
Customer:  That’s really quick, are you sure?
Hedvig:  Yes, of course.
Customer:  Do I need to stop my services? Will I have access to my data? Can I continue running my applications?
Hedvig:  It will be completely non-disruptive. We can upgrade the storage system, while it continues to be fully operational.
Customer:  Voila!

Yes, the above pretty much sums up the upgrade process with Hedvig software. Customers find it really smooth and easy to move forward with software versions. We write code with Non-Disruptive Upgrades (NDU) in mind. Performing an upgrade of a large-scaled distributed system that may span across hundreds, if not thousands, of nodes while maintaining data access (without downtime) is complex, yet at Hedvig, we have engineered our software to do it gracefully every single time; without missing a beat. Let’s face it, manual processes are prone to errors, time consuming, and rely hugely on individuals with the appropriate expertise. Automation eliminates these pain points.

Why am I focused on upgrades? Of course, this is a normal course-element of maintaining any system. In the traditional storage world, upgrades were available and performed typically once a year. And that assumes a standard software update and not a forklift upgrade of the entire array, which is even more painful. However, the cadence of software releases in the age of Continuous Integration / Continuous Deployment (CI/CD) delivers features and improvements more rapidly and more frequently than traditional solutions. This dynamic with modern IT makes simplicity of upgrades even more important!

Let’s see how we do it from the technical standpoint.

The Hedvig software is structured to provide uninterrupted access in a truly distributed fashion so upgrades become really easy. Depending on the way the cluster is structured – from single datacenter deployments to deployments spanning multiple data-centers – we employ sequential node upgrades with different levels of parallelism. Storage nodes running the Hedvig Storage Service undergo upgrades first. Agnostic clusters are upgraded one node at a time. rack-aware are upgraded one rack at a time. And as usual, Hedvig excels at operations across data centers. Datacenter-aware clusters are upgraded one data center at a time. This ensures that there is minimal manual intervention and maximum transparency in how the upgrade is happening.

Customers simply download the upgrade software packages and extract the binaries. We then run a simple command which does a bunch of things in the following sequence:

  • It picks up one data-center/single node, stops the Hedvig processes on those nodes, and pushes the latest binaries over, including any config changes that are required for the new software version.
  • It then restarts the Hedvig processes and performs health-checks to ensure that the nodes are up with the latest software.
  • It repeats the above on the rest of the nodes in the cluster.

For customers who have a multi-datacenter replication policy that provides for data to be stored at more than one site, while one set of nodes is undergoing an upgrade, applications continue to store and access data with the help of the other data centers. When the upgrade is complete, all data is resynchronized to ensure up-to-date data is available at all sites per the configured policy.

Once all the storage nodes are upgraded, we move to the Hedvig Storage Proxies. Storage proxies are typically deployed in active/passive HA-failover setups. During the upgrade process, we upgrade the passive storage proxy of the pair first. We then make the newly upgraded passive proxy active and upgrade the second proxy, which is now the passive of the HA pair. The process is seamless because a virtual IP (VIP) assigned to the HA pair simply redirects network traffic automatically to the active proxy at any given time. This eliminates any interruption to reads or writes during the upgrade procedure.

Throughout the entire process of upgrading storage nodes and proxies, you can still access the Hedvig UI and get the entire view of the cluster without any hiccups. This allows admins to continue doing their operations and management work even while the cluster upgrade is taking place.

Writing software to ease the management burden on customers is a discipline to develop and sustain – especially as enterprises move forward with distributed software-defined, distributed architectures and infrastructures. Non-disruptive upgrade functionality is a perfect example of this thinking. The culture here at Hedvig empowers engineers to focus on real-world customer requirements and write software that attacks the critical pain points that stand in the way of running a no-friction, scalable infrastructure. to support today’s business apps and needs.

To learn more about the inner-workings of our software-defined storage solution, just click the button below to check out a few of our technical whiteboard videos.

Watch Video