Command Completion Time Limit (CCTL) RAID Error Recovery
Let's start from the foundation that we all understand : desktop drives in desktop computers. Drive error recovery is used when the data from the media is bad in normal read. The time of error recovery could be long depended on the method used. Some recoveries procedures for read data involves in: re-read, off-track read, modify the channel parameters, head element reset, change error correction parameters, software corrections, etc. The time to recover the data can be up to several seconds. If the error spreads into multiple sectors, the recovery can be long time.

For non-RAID desktop application this long recovery time is needed because the data is not available any other place. The longer recovery can possibly retrieve bad data, especially in situations where multiple blocks are involved and must be reread and rewritten.

Most often the desired file simply will not read or write. In more extreme situations, either the application or the O/S will hang and the user will perform a three key reset or push reset button.
For the RAID systems, the situation is different. The RAID controller is capable of re-constructing data when one of the disks is bad. It’s important to understand the RAID controller can deal with a failed drive, also the RAID controller can deal with bad blocks on an operable disk drive. Think of Command Completion Time Limit (CCTL) as error handling coordination between the Disk Drive and the RAID controller. The key benefit of Command Completion Time Limit (CCTL) error handling, it helps the RAID controller avoid mistaking some bad blocks for an entire failed drive.
In case if the data is bad, the RAID system can take the other data from other HDD to create the correct data for the host faster than a few ms. Therefore the RAID system will want to have HDD to perform basic and essential retry, but avoid extensive retries to take up much of the time.
The trade off is as followed. The longer the retry takes, the longer the delay of the data on the RAID, and if there is many retries, it hurt the performance. In addition, during the RAID rebuild, the time of retry also make the rebuild slower. When enterprise performance is depended on the IO per second, the long retry can be a problem.

We will now examine the use of Desktop drives with RAID controllers. Including RAID controllers as typically found in Servers, NAS or SAN systems. Desktop drives lack error handling coordination similar to CCTL. This creates a situation where a drive will attempt error recovery for a long time (more than 8 seconds). RAID controllers are typically set to wait on a non-responding disk drive for 8 seconds and then if the drive does not respond, the RAID controller (mistakenly) assumes the drive is failed and will drop the drive from the RAID array and start operating the RAID volume under RAID parity recovery mode.
In the simplest of terms, the solution to the problem is simple, most RAID controllers wait for 8 seconds on a non-responding drive, so we program the drive to send an error message to the RAID adapter after 7 seconds of attempted error recovery.
Command Completion Time Limit is error handling logic (similar logic has been proven in SCSI systems for many years). Its purpose is to allow the Enterprise disk drive to perform error handling, and coordinate error messages with the RAID controller; so the RAID controller will not mistake error recovery with a disk drive failure.
Because long error recovery by a disk drive usually only involves one or more files and not a completely failed drive; the RAID controller should treat long error recovery differently than a complete drive failure.

The example above shows the Enterprise disk drive is programmed to provide an Error message to the RAID controller before the RAID controller deems the drive as failed and drops the drive off the RAID array.
Most RAID controllers wait for 8 seconds on a non-responding drive; a Samsung CCTL enabled drive sends an error message from the drive to the controller after 7 seconds of attempted error recovery. And this avoids a situation where the RAID controller mistakes bad block recovery for a failed drive.
Enterprise workloads with lots of I/O including random I/O and Streaming I/O. Applications which create high levels of I/O, lots of read/write requests include Email, Video Surveillance, Call center, Software Build, Super computing, Broadcast video, Database, eCommerce Disk-to-disk backup and archival tasks. Also drives in enterprise systems typically encounter higher levels of heat and vibration.
Error recovery can occur and drive failures can occur but the Enterprise workloads with lots of I/O does not stop. Once a disk drive is failed (or dropped from the RAID array), the remaining disk drives, operate under parity mode. IN Parity Mode, every read or write sent to the RAID volume involves a parity calculation and big impact to performance. The next step is a systems admin will need to replace the drive in the server. At that point, the RAID controller will rebuild the raid volume, while the RAID volume is performing parity rebuild, basically the data on the failed drive is recreated for the new drives using parity recalculation. Parity recalculation involves high I/O and creates a substantial impact to performance. Furthermore, with large capacity disk drives, large RAID volume parity recalculation can take many hours.
During the parity mode and parity recovery mode, the data is accessible but not protected. It’s not unlike driving a car without a spare tire. If you encounter a second flat tire your car is unusable. Similarly, if RAID volume during parity rebuild, if another drive fails (including error correction timeout) the data on the entire RAID volume is lost.
It bears repeating, if a RAID array is under parity rebuild and therefore experiencing increased I/O and if a second disk drive fails or has an extended error correction; then the RAID array will fail and all the data on the RAID volume is unusable. Command Completion Time Limit (CCTL) error management - handles coordination between the RAID controller and the Enterprise SATA disk drive - help to avoid the initial error recovery problem and help to avoid a catastrophic second error recovery problem as well.
There are well known mature industry standards that apply. The vast majority of Enterprise disk drives and the vast majority of SATA RAID controllers adhere to T13 SCT command for the CCTL (Command Completion Time Limit). This provides options for the RAID system to set the command retry time for various conditions. The command used to set the time limit is specified in the T13 specification. This is a standard method for all system to use. The RAID application can utilize this option to invoke all HDD that support this function, as a consistent way for the setting of time limit. The time limit can be set from hundreds to thousand of milliseconds. Samsung F1R Enterprise SATA disk drives for Enterprise applications support this SCT option so Samsung F1R drives can be successfully used with SATA and SAS RAID controllers adhering to the T13 SCT command set.






