How to prevent silent data corruption

How to prevent silent data corruption

Silent Data Corruption is a type of error caused by bugs in disk firmware, drivers, memory, networks, or different types of drive crashes. They can be described as faults that develop in the background, without any notification to the users of a system. The silent errors are one of the most dangerous for your data durability and reliability, as there is no indication that there is something wrong – usually, you discover them when it is too late. As infrastructure size and drive capacity are constantly growing, silent data corruption is becoming an even-greater concern. 

What causes data corruption?

Hardware is not an ever-lasting thing. The most common reasons for silent data corruption can be hardware failures or interruptions, firmware bugs or software bugs, failures of RAM, disks, cables, or networks. Silent data corruption can also happen because of problems on higher levels, for example, a cloud orchestration system attaching the same disk to two virtual machines. 

Data corruption can happen in multiple places:

  • In the storage hardware (hard disks, SSDs, NVMes)
  • In the operating system
  • During network transfers

Are corrupted files dangerous?

It is doubtful what is worse – to lose data or to use corrupted files. The biggest issue when a situation like this happens is the chance to use corrupted files, without being able to notice the existing problems. Undetected data corruption may cause loss of customer trust and financials. 

No matter what type of data are you managing, it is vital either to have a backup or to use data replication, so that you can ensure your data is safe and your business will not be affected by silent data corruption. 

How does StorPool combat silent data corruption?

StorPool is the only storage solution on the market that implements end-to-end data integrity. All other storage systems implement data integrity checks at different steps of the data processing, but not end-to-end. In contrast, StorPool implements its data protection from the moment it sees the data, to the moment it’s returned to the end-user – it calculates a strong checksum when the data is submitted and verifies it before returning the data in read requests. The checksum always travels with and is recorded together with it. 

In addition, StorPool includes the following measures to prevent silent data corruption:

  • StorPool does not return I/O errors on timeouts and other conditions and tries indefinitely to correct the error. This is needed as there is no good/meaningful handling of I/O errors on the application and even filesystem level.

This has been exemplified in some of our tests (we’re working on publishing an article how in XFS data can be discarded on I/O errors without notification to user space), and in discussions on PostgreSQL and proper ways to persist data and handle IO errors, at https://lwn.net/Articles/752063/ .

  • StorPool does end-to-end checksums of all writes and reads, meaning that a checksum is computed in the client, verified when writing by the server, and verified in the client when reading. This protects from network errors and some memory errors.
  • StorPool does periodic “scrubbing” of the data – once a week it reads everything that’s stored and verifies its checksums. This ensures that any corruption that has occurred during the period is detected.
  • StorPool is triple-replicated by default, meaning that if an error is detected, it can be corrected by the other two copies of the data. Even in cases where another drive in the system has failed, with three copies there’d still be a copy from which the redundancy to be restored and the data to be kept safe.
  • StorPool employs the “fail-fast” paradigm and makes sure to stop processing if abnormal situations are detected (for example, memory errors or misbehaving hardware). This, combined with node redundancy ensures that operations would continue uninterrupted and will be correct.

How to deal with silent data corruption?

The most important thing you need to understand when you manage data is that you need to take preventive measures before such a situation occurs. – so for example you don’t have to go looking for information on how to fix corrupted files. The best solution is to either ensure (either via backups or replication)  that any such corruption can’t affect you.

Leave a Reply

Your email address will not be published. Required fields are marked *