A Quintuple Parity Error Correcting Code - A Game Changer in Data Protection

Library Content Type:
Publish Date: 
Wednesday, September 29, 2021
Event Name: 
Event Track:
Abstract: 

We describe a replacement for RAID 6, based on a new linear, systematic code, which detects and corrects any combination of E errors (unknown location) and Z erasures (known location) provided that Z+2E≤4. The code is at the core of a RAID technology called PentaRAID, for which the two co-inventors were awarded a US utility patent. The problem that we address is that of weak data protection of RAID 6 systems. The known vulnerability is that RAID 6 may recover from not more than 2 failed disks in a RAID, if we know which disks failed. If we do not know which disks are the source of errors, we are protected from only 1 disk failure. Moreover, if 1 disk fails, the failed data needs to be recovered, and with hard disks reaching 40TB, the recovery process lasts for weeks (degraded mode). While in degraded mode, the second disk failure results in a system with no error detection and correction. In addition, Undetected Disk Errors (UDE) can only be detected but not corrected event with one failed disk. The natural solution is to increase redundancy from 2 to more disks. There is very little payoff from using 3 disks. It turns out that a practical solution is possible with 5 redundant disks, and this solution is employed in PentaRAID. The payoff is immense, as the RAID extends Mean Time to Data Loss (MTDL) from days to far beyond the age of the universe (100 quadrillion years) under typical assumptions in regard to disk error rates. The new RAID can tolerate a loss of 2 drives at unknown locations (thus seemingly operating normally but generating UDE), and up to 4 disks at known locations, e.g. due to power failure (typically detected by the disk controller). In addition, the recovery process involves a fixed, small number of Galois field operations per error, and therefore has virtually fixed computational cost per error, independent of the number of disks in the array. Parity calculation has also constant time per byte of data, if distributed computation is utilized. In short, the computational complexity is on the par with that of RAID 6. Notably, the solution does not require Chien search commonly used in Reed-Solomon coding, which makes it possible to utilize large Galois fields. The application of the new technology will dramatically increase data durability with significant reduction of the number of hard disks necessary to maintain data integrity, and will simplify the logistics of operating RAID in degraded mode and recovery from disk failures.

  • Giving the audience an overview of the new error correcting code and the decoding algorithm aimed as a replacement for RAID 6.
  • Explaining how quantatively the data protection and durability will be impacted by the new technology.
  • Explaining the technology from the point of view of an IT department.
  • Explaining the technology implementation from the point of view of an electrical or software engineer.

Watch video: