Were you to have been asked that question some years ago, the answer most likely would have bee “I’m not sure”....
No Transaction Left Behind or Forgotten
A special though leadership piece with:
With apologies to Lilo and Stich, that was the idea behind RDF/ZLT, or Remote Database Facility/Zero Lost Transactions.
Based on a custom version of RDF written for a major UK bank, along with “Nomadic Disk” technology, RDF/ZLT used a remote mirror to ensure that transactions committed to the primary database, but not yet replicated to the backup database, would not be lost in the case of a catastrophic failure.
You can read the brochure linked above for the technical details, while I focus on some of the more interesting aspects of marketing it.
RDF was marketed as the way to change the TMF-driven fault-tolerance of the NonStop server into disaster-tolerance. While TMF could survive any single, and some double failures, it could not survive a direct hit from a small thermonuclear device. But by replicating transactions from one NonStop server to another one in a different building or in a different city, a whole new level of NonStop server survivability was achieved. But there was a secret.
You Can Lose What!?
Because RDF is reading what TMF writes to the audit trail, there is a very small window of time in which a transaction can be committed and acknowledged, but not yet replicated. The UK bank knew this, which is why they had custom code to prevent this from occurring.
Other customers either didn’t know about this small hole, or they didn’t care. Even more interesting was that several HP employees didn’t realize it either. When RDF/ZLT was announced, the secret was no more, which caused a large amount of discussion in the Ridgeview Court halls.
Initial sales results of RDF/ZLT were very positive and even better, changes made to the Diskprocess to support HP StorageWorks enterprise storage units meant that we also were helping another division to increase their footprint, while lowering our customers’ RPO to zero.
RTO/RPO Are Only Wishes
Anyone who attended my business continuity presentations should be aware of the meaning of RTO and RPO – the recovery time objective and recovery point objective. The first is how quickly you need to be back in business and the second is how much data can be lost when you come back online.
They keyword in both phrases is “objective,” and should come from the line of business. But after many snafus, Business continuity professionals have started to use two new terms; recovery time capability (RTC) and recovery point capability (RPC). The difference is what the line of business has requested, versus what the organization actually can achieve. And if there is a mismatch, you have a continuity gap.
And Active/Active May Not be the Answer
Many IT professionals and vendors believe that active/active database replication is the solution to continuous availability. But there is much more to continuous availability. When one site goes away, how are your transactions routed to the surviving site? Does someone need to push a button to make that happen? How do you decide when to push the button? Is the decision time included in your RTC?
If ransomware somehow infects one system, is it immediately replicated to the other system, trashing both? Do you have offline backups that will allow you to recover from ransomware? Does your incident response plan include a response for ransomware?
What is the Answer?
This is a trick question since there may not be a single answer to achieving continuous availability. Your organization needs to understand the risks against it, create plans and implement technology to address those risks, test those plans and technology until they run like clockwork, then run a live fire exercise to ensure that everything will work in the real world – and you can meet your RTO/RPO or force the line of business to re-think their requirements.
You also need to keep up with the latest cyber threats through various information sharing portals such as the National Vulnerability Database (NVD), InfraGard, or the EU computer emergency response site (CERT-EU).
Finally, whether it’s politically correct or not, business continuity and cyber security must be joined at the hip. The two teams need to be practicing their incident response together, since a cyber incident could instantly become a disaster recovery response.