Farsighted News SNIA
Community Advertising Subscribe to FarSighted Feedback Contact

Table of Contents

Home Page
Only in FarSighted
Spotlight on SNIA
Analyst Watch
Events

Archives

September 2008
June 2008
March 2008
November 2007
August 2007
May 2007
February 2007



IT Corner

Failover and Failback for Disaster Recovery with Asynchronous Mirroring
By Yael Tzur, LSI Technologies Israel Ltd.

Introduction
The essence of a disaster recovery plan is the planning and testing required to ensure that operations remain resilient during an outage. If the plan doesn't include testing failover and failback, the plan may be considered incomplete. Failing over is a complex procedure, especially when it comes to disaster recovery with asynchronous replication. Since data is not the same between the two sites, the operational considerations can become complicated and relying on untested procedures and policies introduces risk in the very function designed to mitigate risk during an outage.

The rule of thumb is simple and known to all: if the plan has never been pre-tested, it may not work. Like fire drills in schools, disaster recovery procedures must be periodically tested and verified. However, some tests as well as drills may seem to be more trouble than they're worth and some may simply abandon the test, hoping that disaster will never strike.

This article describes data handling for disaster recovery failover procedures, outlining the problems uncovered during testing, and suggests solutions.

Asynchronous Mirror-Based Disaster Recovery - Concept
A well-crafted business continuity plan includes protection for critical business data by replicating the data outside of the data center, yet is still accessible to continue operations during an outage. For the purposes of this paper, tape has been determined not to meet the RTO/RPO objectives of this client. Asynchronous data replication is updated on the main site and then asynchronously copied to the disaster recovery site, allowing the application to remain unaffected by distance-induced latency, as opposed to synchronous replication where data is written to the target location before completing the I/O, thus delaying the next I/O operation until the current synchronous operation is complete.

Asynchronous mirroring does not require the fast response times and throughput of high speed communications like synchronous mirror does, and therefore enables the use of much more affordable IP lines. IP lines usually have lower bandwidth and slower response times compared with Fibre Channel lines, but are very cost-effective and support practically unlimited distances.

A challenge for asynchronous mirroring is creating the initial full copy of volumes at the target disaster recovery site. Available IP bandwidth may be sufficient for transferring data changes but in many cases may not be sufficient for copying all of the application's volumes within a reasonable time.

The risk associated with data currency during the creation of the initial copy creates new challenges for the plan since this process may take days or weeks to complete. Using tape may quickly address building the target sight but may be too far out of sync to be useful or difficult to reconcile.

Asynchronous Mirror Implementation Methods
There are several types of asynchronous replication behaviors. Two of the most popular implementation methods of asynchronous mirroring involve queuing all "write" commands from the source site and sending them one after the other to the disaster recovery or target site. The other method is collecting all the changed blocks on the source site disk and then sending them to the disaster recovery site while resuming the collection of the changed blocks. The latter is occasionally combined with snapshots (each collected, changed block is practically a snapshot), known as snapshot-based asynchronous mirror. Each methodology has certain advantages over the other.

Issues with Failovers
The time delay between the moment data is saved on the main site and the moment it is sent to the disaster recovery site may result in data loss if an outage occurs while data is in flight to the target/disaster recovery site. Proper testing helps develop procedures to avoid or recover data lost in flight. The real goal is to recover the application at the disaster recovery site. Since in real failover we may suffer some data loss, a good disaster recovery procedure must include testing the scenario that includes data loss.

While operating from a disaster recovery site, whether due to an outage or due to testing, a new element of risk has been introduced and that is the fact that the data center is no longer disaster-recovery protected. A contingency plan for this scenario must be considered.

Failback is in many instances even more challenging as it is required at the end of each test, but also required after recovering from a disaster. A good example would be companies with offices in Manhattan after September 11th. Although these offices did not suffer any physical damage, all roads were blocked and employees could not reach their workplace. This case clearly required failover to the disaster recovery site so that once the roads were reopened, they could failback to the main site.

Nonetheless since the application was running for quite some time and the data at the disaster recovery site is more updated, there is still a major challenge with failback to the main site.

Failover Procedure
Failover to a disaster recovery site is fairly simple. On the occasion of a disaster or disaster simulation the main site is shut down immediately and the application resumes at the disaster recovery site with the available data. When dealing with tests imposing no data loss, the application may be gracefully shut down. After a short time, while the mirror will complete copying all remaining changes, following which, the mirror may be broken and the application on the disaster recovery site may be started using the copied data.

In case of a lasting operation of the disaster recovery site, a disaster recovery plan for the disaster recovery site should be put into practice. The issue is the initial build of the mirror, which is a resource-consuming process. Occasionally the main site can be used as the disaster recovery site; however the problem of having updated data becomes an issue. This paper does not discuss the building of a new mirror. Building a mirror to the main site is explained below.

Failback To The Main Site
The main problem with failback is getting the updated data back at the main site. The failback process should not result in any data loss and this may require stopping the application at the disaster recovery site and copying the data by tape to the main site. However, this technique will result in very lengthy downtime and therefore is not applicable for testing and is a very poor solution for real failback.

Since building a new mirror is a long and problematic process, why not use the data already in the main site? This has the potential to dramatically reduce the amount of time and effort required to get the good data in the right place. In order to use the main site's data, one must find a point in time when the data is synchronized between the main site and disaster recovery site to reassure data integrity when resuming production. At such a point in time, when the data is synchronized between sites, a new asynchronous mirror from the disaster recovery site to the main site could have been started and the failback procedure is the same as a graceful failover: application shutdown, waiting for the last changes to be copied and subsequently resume work.

The question is how to find that point in time when the data is synchronized?

The best way to find the point of synchronization for asynchronous replication based on queuing I/Os is by gracefully quiesceing the application. Quiesceing the application means to put the application in a special mode in which it closes all open files and ensures the data is consistent; while in this mode the data on the disk will not be changed. Obviously, when dealing with a real disaster or testing a real disaster that involve data loss, there is no possible way to find that synchronized point in time.

With snapshot-based asynchronous mirroring, however, snapshots serve as multiple synchronized points in time, enabling asynchronous mirror back from the disaster recovery site to the main site, copying back only the changes that have occurred since the last snapshot was recorded.

Summary
It seems that although disaster recovery has been a major trend in the last few years, failback procedures (being a critical part of it) are not getting the attention they deserve. Unfortunately very few vendors offer failback with their asynchronous mirror. Most organizations have implemented only very expensive, risky and time consuming procedures for operating the disaster recovery site, testing it and failing back to the main site.

Due to the complexity and cost of disaster recovery testing, many organization give up disaster recovery testing or perform minimal testing, putting their business at risk in case of a real disaster.

Snapshot-based mirroring has a clear technological advantage and can provide much easier failover and failback options that covers all cases as opposed to I/O queuing mirroring that covers only one simple case.

About the Author
Yael Tzur is the Software Business International Marketing Manager for LSI Technologies Israel Ltd.










Training at the SNIA Tech Center