Nutanix Benefit 5: Enterprise Grade Replication and Disaster Recovery

Nutanix.dev - Nutanix Benefit 5 Enterprise Grade Replication and Disaster Recovery

Table of Contents

View all current content in this series and make sure you don’t miss upcoming installments: Nutanix Top 10 Benefits Series.

In our previous post we discussed how Nutanix natively provides granular and efficient snapshot technology, the differences in the Nutanix implementation of snapshots, and why they are differentiators. In this blog, we will discuss the benefits of the native enterprise grade replication and disaster recovery (DR) orchestration that is built-in to the Nutanix platform. 

Replication

From the beginning, Nutanix has included replication at a granular level in our platform. We called this extent-based replication (EBR) and it worked at the scope of a storage extent. An extent is 1 MB of logically contiguous blocks of data. But what if we only need to replicate 64 KB of changed data? Prior to AOS 6.5 Long Term Support (LTS), the platform would replicate the entire 1 MB extent where the changed data was located to the remote destination and then the remote system would discard what wasn’t needed. As an example, if 16KB of data changed inside 100 different extents, this would lead to 100MB of data being replicated to a target cluster. 

Included in AOS 6.5 LTS is the all-new Range Based Replication (RBR). With RBR, the Nutanix platform will only replicate the exact ranges of new or updated data. Using the above example with RBR, the snapshot data would be 1.6MB instead of 100MB. In our testing using up to 9 snapshots and four different data change rates, RBR results in up to 63% savings in time spent replicating data. Also, data that already exists on the target cluster never touches the network, saving bandwidth as well. This helps customers stay within their desired RPO removing risk from the DR Plan. To learn more about RBR, please check out my range based replication blog post here. 

Nutanix replication is policy-based and defined by the recovery point objective (RPO), which is the maximum amount of data a customer is willing to lose. The RPO defines the oldest acceptable point in time if a failure occurs. This is done through a protection policy where the administrator defines the snapshot interval, the replication interval, as well as the replication target(s). As an example, ACME Inc. requires that there be no more than 2 hours of data loss for their systems. The administrator would configure a snapshot and replication schedule of 2 hours. This means that at any given time, there would only be a maximum of 2 hours of lost data if the need to restore from the latest snapshot arose. 

Let’s go through each of the supported RPOs:

  • Asynchronous replication (Async) – Policies with a configured RPO of 60 minutes or greater are considered asynchronous. This means that a snapshot will be taken minimally every 60 mins. These snapshots can be kept locally and replicated to one or more clusters.
  • Near-Synchronous replication (Near-Sync) – Policies with a configured RPO of 1 minute to 15 minutes are considered Near-Sync. For ESXi, we also offer vStore based replication with an RPO of 20 seconds. Near-Sync uses a new snapshot technology called light-weight snapshot (LWS). Unlike the traditional vDisk based snapshots used by async, LWS leverages markers and is completely OpLog based unlike vDisk snapshots which are done in the Extent Store. This architecture makes LWS highly scalable and performant. LWS are replicated continuously to the remote site. We create an intermediate snapshot every hour and retain it for 6 hours to serve as a checkpoint to help with RTO.
  • Synchronous replication (Metro Availability/Sync) – Policies with a configured RPO of 0 minutes are considered synchronous. With Synchronous replication, we can achieve a zero RPO at the VM granularity level. Synchronous replication is supported between sites that have less than 5 ms network round trip time between them. In order to achieve continuous availability of applications and zero data loss, a secondary copy of all data including VM data, VM metadata, and Protection Policies applied to VMs is maintained across two clusters. This ensures that there is no data loss in case of site failure. Synchronous replication also allows VM Live migration to be easily supported between sites. 

Disaster Recovery Orchestration

Disaster recovery is measured in terms of the recovery point objective (RPO), the maximum amount of data the customer is willing to lose and recovery time objective (RTO), the time allowed to restore operations when a failure occurs. A disaster recovery solution is a combination of data replication and recovery orchestration. 

Nutanix Disaster Recovery and Orchestration relies on a policy driven approach to configuring snapshot schedules, replication, and failover orchestration. This ensures repeatable results when a disaster strikes.

Let’s take a look at a few constructs inside of Nutanix Disaster Recovery:

  • Availability Zones – (AZs) are a logical grouping of physically separate infrastructure to failover to in the event of a disaster. Pairing AZs is done by connecting two Prism Central instances which can then share categories, DR policies, and other VM metadata. Paired AZs allow failover from one AZ or Prism Central to another.
  • Category – A logical grouping of VMs using a simple text-based tag. As an example, a category could be an application grouping, an RPO, or any other label that identifies a common attribute of these VMs.
  • Protection Policies – Define the RPO (snapshot frequency), recovery location (remote cluster / Nutanix Disaster Recovery as a Service (DRaaS)), snapshot retention (local vs. remote cluster), and associated categories. Nutanix also supports granular retention of snapshots to suit your needs in the form of linear and roll-up policies. A linear retention policy specifies the number of recovery points to retain. As an example, if the RPO is 1 hour and the retention is set to 6, the system would keep the latest 6 hours of recovery points. A roll-up retention policy will “roll-up” snapshots dependent on the RPO and retention duration. For example, if the RPO is 1 hour and the retention is set to 5 days it’ll keep 1 day of hourly and 4 days of daily recovery points. For more information on retention policies, refer to the Leap DR section of The Nutanix Bible.
  • Recovery Plans – define how the recovery should take place. This includes network and IP Address mapping, runbook automation where the customer defines the power on sequence of VMs, and post boot scripts to be called when executing a failover or test failover.

Administrators execute failover and test failovers by running one or more recovery plans from the target AZ Prism Central. While a failover or test failover is being executed, a detailed audit trail is created where each step is listed with its timestamp, giving businesses a clear picture of how well their recovery plan performed.

Observability

While Nutanix Disaster Recovery (DR) solutions have expanded from on-premises to multi-hybrid clouds, the DNA has remained the same. With expanded topologies such as hybrid cloud, DRaaS with Nutanix DRaaS and Nutanix Cloud Clusters (NC2), observability has become a priority for customers in their hybrid cloud journey with Nutanix. With the AOS 6.0 release we have added an observability dashboard in Prism Central that provides insights into some key performance indicators around DR such as RPO SLAs, DR readiness, alerts, and more.

​​For a deeper look into DR Observability please have a look at the disaster recovery dashboard blog post.

Why Does This Matter?

Today’s CIOs and IT leaders expect their DR environments to be smart and efficient, auto-protect new VMs, prioritize VM-level replication, and provide meaningful and actionable recommendations on improving the RPO and RTO for applications and workloads. In short, customers are looking to simplify their DR operations while meeting RPO and RTO SLAs as well as regulatory compliance, all within a budget. 

Nutanix has recognized these challenges and has simplified DR for our customers by including these features in our platform without the need to install or manage another piece of software or VM:

  • Efficient Replication – By leveraging RBR, we reduce the time it takes to replicate data by up to 63% ensuring the business RPO SLA’s are met.
  • Disaster Recovery Orchestration – Policy driven DR plans allow for quickly creating and adding VM’s or applications to policies. Assign a boot order for the VM’s and map your networks for IP address management after failover. All of this in the same user interface (UI) without the hassle of plugins, adapters and extra VM’s to manage.
  • Observability – Our customers see their entire DR infrastructure in an easy to understand dashboard. From showing VM’s are protected (being replicated and in a recovery plan) to how many replications do I currently have running and have any replications failed, configuration alerts to understanding if customers businesses are recovery ready (successfully testing recovery plans).

Nutanix has our customers covered for their DR needs whether that is private cloud, hybrid cloud, DRaaS, or hybrid multi-cloud.

In the next blog, Brian Suhr will show how Nutanix allows freedom of hypervisor choice so you can tailor your deployment to meet your business and application needs.

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.