|
Failover and Failback for
Disaster Recovery with Asynchronous Mirroring
By Yael Tzur, LSI Technologies Israel Ltd.
Introduction
The essence of a disaster recovery plan is the planning and testing
required to ensure that operations remain resilient during an outage. If the
plan doesn't include testing failover and failback, the plan may be
considered incomplete. Failing over is a complex procedure, especially when
it comes to disaster recovery with asynchronous replication. Since data is
not the same between the two sites, the operational considerations can
become complicated and relying on untested procedures and policies
introduces risk in the very function designed to mitigate risk during an
outage.
The rule of thumb is simple and known to all: if the plan has never been
pre-tested, it may not work. Like fire drills in schools, disaster recovery
procedures must be periodically tested and verified. However, some tests as
well as drills may seem to be more trouble than they're worth and some may
simply abandon the test, hoping that disaster will never strike.
This article describes data handling for disaster recovery failover
procedures, outlining the problems uncovered during testing, and suggests
solutions.
Asynchronous Mirror-Based Disaster Recovery - Concept
A well-crafted business continuity plan includes protection for critical
business data by replicating the data outside of the data center, yet is
still accessible to continue operations during an outage. For the purposes
of this paper, tape has been determined not to meet the RTO/RPO objectives
of this client. Asynchronous data replication is updated on the main site
and then asynchronously copied to the disaster recovery site, allowing the
application to remain unaffected by distance-induced latency, as opposed to
synchronous replication where data is written to the target location before
completing the I/O, thus delaying the next I/O operation until the current
synchronous operation is complete.
Asynchronous mirroring does not require the fast response times and
throughput of high speed communications like synchronous mirror does, and
therefore enables the use of much more affordable IP lines. IP lines usually
have lower bandwidth and slower response times compared with Fibre Channel
lines, but are very cost-effective and support practically unlimited
distances.
A challenge for asynchronous mirroring is creating the initial full copy
of volumes at the target disaster recovery site. Available IP bandwidth may
be sufficient for transferring data changes but in many cases may not be
sufficient for copying all of the application's volumes within a reasonable
time.
The risk associated with data currency during the creation of the initial
copy creates new challenges for the plan since this process may take days or
weeks to complete. Using tape may quickly address building the target sight
but may be too far out of sync to be useful or difficult to reconcile.
Asynchronous Mirror Implementation Methods
There are several types of asynchronous replication behaviors. Two of the
most popular implementation methods of asynchronous mirroring involve
queuing all "write" commands from the source site and sending them one after
the other to the disaster recovery or target site. The other method is
collecting all the changed blocks on the source site disk and then sending
them to the disaster recovery site while resuming the collection of the
changed blocks. The latter is occasionally combined with snapshots (each
collected, changed block is practically a snapshot), known as snapshot-based
asynchronous mirror. Each methodology has certain advantages over the
other.
Issues with Failovers
The time delay between the moment data is saved on the main site and the
moment it is sent to the disaster recovery site may result in data loss if
an outage occurs while data is in flight to the target/disaster recovery
site. Proper testing helps develop procedures to avoid or recover data lost
in flight. The real goal is to recover the application at the disaster
recovery site. Since in real failover we may suffer some data loss, a good
disaster recovery procedure must include testing the scenario that includes
data loss.
While operating from a disaster recovery site, whether due to an outage
or due to testing, a new element of risk has been introduced and that is the
fact that the data center is no longer disaster-recovery protected. A
contingency plan for this scenario must be considered.
Failback is in many instances even more challenging as it is required at
the end of each test, but also required after recovering from a disaster. A
good example would be companies with offices in Manhattan after September
11th. Although these offices did not suffer any physical damage, all roads
were blocked and employees could not reach their workplace. This case
clearly required failover to the disaster recovery site so that once the
roads were reopened, they could failback to the main site.
Nonetheless since the application was running for quite some time and the
data at the disaster recovery site is more updated, there is still a major
challenge with failback to the main site.
Failover Procedure
Failover to a disaster recovery site is fairly simple. On the occasion of a
disaster or disaster simulation the main site is shut down immediately and
the application resumes at the disaster recovery site with the available
data. When dealing with tests imposing no data loss, the application may be
gracefully shut down. After a short time, while the mirror will complete
copying all remaining changes, following which, the mirror may be broken and
the application on the disaster recovery site may be started using the
copied data.
In case of a lasting operation of the disaster recovery site, a disaster
recovery plan for the disaster recovery site should be put into practice.
The issue is the initial build of the mirror, which is a resource-consuming
process. Occasionally the main site can be used as the disaster recovery
site; however the problem of having updated data becomes an issue. This
paper does not discuss the building of a new mirror. Building a mirror to
the main site is explained below.
Failback To The Main Site
The main problem with failback is getting the updated data back at the main
site. The failback process should not result in any data loss and this may
require stopping the application at the disaster recovery site and copying
the data by tape to the main site. However, this technique will result in
very lengthy downtime and therefore is not applicable for testing and is a
very poor solution for real failback.
Since building a new mirror is a long and problematic process, why not
use the data already in the main site? This has the potential to
dramatically reduce the amount of time and effort required to get the good
data in the right place. In order to use the main site's data, one must find
a point in time when the data is synchronized between the main site and
disaster recovery site to reassure data integrity when resuming production.
At such a point in time, when the data is synchronized between sites, a new
asynchronous mirror from the disaster recovery site to the main site could
have been started and the failback procedure is the same as a graceful
failover: application shutdown, waiting for the last changes to be copied
and subsequently resume work.
The question is how to find that point in time when the data is
synchronized?
The best way to find the point of synchronization for asynchronous
replication based on queuing I/Os is by gracefully quiesceing the
application. Quiesceing the application means to put the application in a
special mode in which it closes all open files and ensures the data is
consistent; while in this mode the data on the disk will not be changed.
Obviously, when dealing with a real disaster or testing a real disaster that
involve data loss, there is no possible way to find that synchronized point
in time.
With snapshot-based asynchronous mirroring, however, snapshots serve as
multiple synchronized points in time, enabling asynchronous mirror back from
the disaster recovery site to the main site, copying back only the changes
that have occurred since the last snapshot was recorded.
Summary
It seems that although disaster recovery has been a major trend in the last
few years, failback procedures (being a critical part of it) are not getting
the attention they deserve. Unfortunately very few vendors offer failback
with their asynchronous mirror. Most organizations have implemented only
very expensive, risky and time consuming procedures for operating the
disaster recovery site, testing it and failing back to the main site.
Due to the complexity and cost of disaster recovery testing, many
organization give up disaster recovery testing or perform minimal testing,
putting their business at risk in case of a real disaster.
Snapshot-based mirroring has a clear technological advantage and can
provide much easier failover and failback options that covers all cases as
opposed to I/O queuing mirroring that covers only one simple case.
About the Author
Yael Tzur is the Software Business International Marketing Manager for LSI
Technologies Israel Ltd.
|