For StorPool, it’s a never-ending mission to provide the best software-defined block storage on the market. We are really excited to be featured on Architecting IT. In this series of posts, you’ll learn more about StorPool’s technology, with a hands-on and in-depth look at how the distributed storage architecture works, how it performs, and how it integrates into an on-premises public cloud strategy.
In this post, we will look at failure modes and data recovery.
Persistent storage forms the basis for ensuring the long-term retention of data in the enterprise. Before the development and formalisation of RAID in 1987, enterprise customers (mainly mainframe-based) had to rely on data recovery from backups in the event of a hardware failure. Modern IT systems offer highly resilient storage using RAID or other data redundancy techniques to ensure close to 100% uptime, with little or no need to recover data from backups after hardware failures.
Keeping systems running after a system or component failure is only one aspect of data resiliency. Although modern disk drives and SSDs are incredibly reliable, both types of devices can produce intermittent errors, failing to read disk sectors or producing unrecoverable bit errors (UBE), where a read request fails.
Modern hard drives offer reliability ratings based on AFR – Annual Failure Rate, typically around 0.44%. This means in a set of 1000 unprotected drives that 4-5 devices will fail each year. This is, of course, an average, and some customers will see greater or lower failure rates in their infrastructure. Unrecoverable bit error rates (UBERs) are typically around 10^15, or one failed sector read in 500 reads of an entire 2TB HDD. The UBER risk may seem small but isn’t evenly distributed across drives and sectors. So, some drives may see many more errors than others and in a much shorter time frame.
Solid-state disks offer similar levels of AFR to hard drives. UBER rates are generally much better at around 10^17, although SSDs have limited endurance compared to HDDs.
In addition to device/component failures and media read errors, a third recovery scenario occurs with distributed storage solutions such as StorPool. In a distributed architecture, nodes communicate over the network to replicate and share data. If a node drops out of the system for a short time, either for planned work or due to a network or server error, any changed or new data is immediately out of synchronisation. Distributed systems must re-establish consistency without compromising data integrity. This scenario can also happen when individual drives are unexpectedly removed, either in error or due to a systems fault.
Data Management Processes
From the challenges already described in managing media and server nodes, we can summarise the expected tasks that need to be performed in a distributed storage solution.
- Data resiliency – implementation of RAID or data mirroring.
- Data recovery – automated recovery from device failure using redundant data.
- Data consistency – recovering from node failure.
- Data integrity – validating written data is correctly stored on persistent media.
We will look at each of these four requirements and show how StorPool addresses each of them.
The standard process for protecting data in modern storage systems is to use data redundancy, either through a RAID architecture or through data mirroring. In the first post of this series, we looked at data resiliency and placement across placement groups and disk sets.
This placement process ensures that logical volumes use all available storage performance (a process called wide striping) while maintaining resiliency. Data is distributed across nodes, so any single media or node failure will not result in data loss.
However, as clusters grow and become more complex, individual workloads may need additional protection, whereas some temporary workloads could run with no data protection in place. StorPool provides all these capabilities and allows changes to be made dynamically using Placement Groups and Templates.
It’s also possible to drop physical drives from logical volumes. This ability could be needed where a drive is known to be failing and exhibiting errors. This drive can be removed from critical or highly active volumes, while redundancy for that data is rebuilt elsewhere.