HITACHI HDDのDMAエラー、または寿命

前に書いた *1 エラーが直らない。
同じハードの他のサーバーではエラーが出ないと書いたが、10台中7台ぐらいエラーが出るようになった。これはまずい。

Sep  3 23:00:49 node2 kernel: [5538990.017176] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep  3 23:00:49 node2 kernel: [5538990.035765] ata3.00: failed command: WRITE DMA
Sep  3 23:00:49 node2 kernel: [5538990.045068] ata3.00: cmd ca/00:08:58:c7:58/00:00:00:00:00/e1 tag 0 dma 4096 out
Sep  3 23:00:49 node2 kernel: [5538990.045069]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep  3 23:00:49 node2 kernel: [5538990.082851] ata3.00: status: { DRDY }
Sep  3 23:00:49 node2 kernel: [5538990.092359] ata3: hard resetting link
Sep  3 23:00:54 node2 kernel: [5538995.609179] ata3: link is slow to respond, please be patient (ready=0)
Sep  3 23:00:59 node2 kernel: [5539000.033140] ata3: SRST failed (errno=-16)
Sep  3 23:00:59 node2 kernel: [5539000.042265] ata3: hard resetting link
Sep  3 23:01:04 node2 kernel: [5539005.557139] ata3: link is slow to respond, please be patient (ready=0)
Sep  3 23:01:09 node2 kernel: [5539010.037139] ata3: SRST failed (errno=-16)
Sep  3 23:01:09 node2 kernel: [5539010.046119] ata3: hard resetting link
Sep  3 23:01:14 node2 kernel: [5539015.561139] ata3: link is slow to respond, please be patient (ready=0)
Sep  3 23:01:36 node2 kernel: [5539037.457202] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep  3 23:01:36 node2 kernel: [5539037.489655] ata3.00: configured for UDMA/33
Sep  3 23:01:36 node2 kernel: [5539037.498744] ata3.00: device reported invalid CHS sector 0
Sep  3 23:01:36 node2 kernel: [5539037.507778] ata3: EH complete

他のサーバーでも。ハードウェア構成はまったく同じだが別のエラーが出る。

Sep  4 00:57:09 node4 kernel: [5840828.368425] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Sep  4 00:57:09 node4 kernel: [5840828.368457] ata1.00: BMDMA stat 0x65
Sep  4 00:57:09 node4 kernel: [5840828.368483] ata1.00: failed command: READ DMA EXT
Sep  4 00:57:09 node4 kernel: [5840828.368513] ata1.00: cmd 25/00:00:00:16:00/00:04:00:00:00/e0 tag 0 dma 524288 in
Sep  4 00:57:09 node4 kernel: [5840828.368514]          res 51/40:6c:94:16:00/40:01:00:00:00/e0 Emask 0x9 (media error)
Sep  4 00:57:09 node4 kernel: [5840828.368606] ata1.00: status: { DRDY ERR }
Sep  4 00:57:09 node4 kernel: [5840828.368632] ata1.00: error: { UNC }
Sep  4 00:57:09 node4 kernel: [5840828.392583] ata1.00: configured for UDMA/133
Sep  4 00:57:09 node4 kernel: [5840828.408359] ata1.01: configured for UDMA/100
Sep  4 00:57:09 node4 kernel: [5840828.408491] ata1: EH complete

hdparmで情報を見る。

# hdparm -i /dev/sda

/dev/sda:

 Model=Hitachi HDT721032SLA360, FwRev=ST2OA31B, SerialNo=STF202ML1RP4VP
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=56
 BuffType=DualPortCache, BuffSize=15001kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=625142448
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5 udma6 
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode


# hdparm -i /dev/sdb

/dev/sdb:

 Model=Hitachi HDT721032SLA360, FwRev=ST2OA31B, SerialNo=STF202ML1RPAAP
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=56
 BuffType=DualPortCache, BuffSize=15001kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=625142448
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode

OSや細かいバージョンは違うが

VT8237関連でググってみると、ごく稀に「ハードディスク周りのエラーが出るんだけど」という話題にあたる。問題提起だけで、解決に結びつく情報は無いけど。
実はHITACHIのこのHDD、シリアルATA版の方はマザーボードとの相性が悪くて動かないらしい。
http://d.hatena.ne.jp/eel3/20081124/1229782890

と書いてある。
片方のHDDが自動でudma2に変わっている現象も似ている。

HDDの型番は違うけどHITACHIのHDDがダメなの？
ちなみにXen + DRBDを使っていない場合だとエラーはほとんど起きない。が、昔見た気がする。ログは消えてるのでもう不明だけど。
Xen + DRBD + ソフトウェアRAIDだと、頻繁にエラーが起きるHDDでは処理速度が 1/100 になるくらいにプチフリーズを頻発する。
一番エラーが多かったHDD の smartctl の結果は

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   011   011   016    Pre-fail  Always   FAILING_NOW 3322075356
  2 Throughput_Performance  0x0005   087   087   054    Pre-fail  Offline      -       1087
  3 Spin_Up_Time            0x0007   121   121   024    Pre-fail  Always       -       225 (Average 187)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       32
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       12729
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       53
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       53
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Lifetime Min/Max 25/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

Raw_Read_Error_Rate がもうダメぽい。

同じHDDでエラーが出てない場合。

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   091   091   016    Pre-fail  Always       -       2883598
  2 Throughput_Performance  0x0005   099   099   054    Pre-fail  Offline      -       348
  3 Spin_Up_Time            0x0007   123   123   024    Pre-fail  Always       -       225 (Average 181)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   123   123   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       13966
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       43
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Lifetime Min/Max 23/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

エラーがたまに出るHDD

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   130   130   054    Pre-fail  Offline      -       123
  3 Spin_Up_Time            0x0007   120   120   024    Pre-fail  Always       -       228 (Average 188)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       21
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   125   125   020    Pre-fail  Offline      -       33
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       12936
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       543
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       543
194 Temperature_Celsius     0x0002   206   206   000    Old_age   Always       -       29 (Lifetime Min/Max 25/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       24
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       64
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

...

Error 81 occurred at disk power-on lifetime: 12872 hours (536 days + 8 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 24 5c bd 19 e8  Error: UNC 36 sectors at LBA = 0x0819bd5c = 135904604

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 80 bc 19 e0 08   2d+14:43:28.800  READ DMA EXT
  27 00 00 00 00 00 e0 08   2d+14:43:28.800  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 08   2d+14:43:28.800  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08   2d+14:43:28.700  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 08   2d+14:43:28.700  READ NATIVE MAX ADDRESS EXT

Error 81 って。Raw_Read_Error_Rateは機能していないように見える。
あまり使ってなかったと思っていたが起動時間はそこそこ経過している。約一年半？ソニータイマー的な何かが発動したんだろうか。
二年も経たないうちに、しかも短期間に半分以上も壊れそうになることってあるのかねぇ…もっとバラついても良さそう。まぁこれまでの経験だと壊れるときはだいたい一年半から二年っていう気もする。
何となくイメージ的にはHITACHIは壊れにくいって気がしていたけど違ったようだ。
HITATCHI + SATA の相性かと思ったけど寿命かもわからんね。

で、HDD を Seagate に変えた。
同じ構成で大量に書き込んだりRAIDをリビルドしたりしているが、今のところは順調。得に何のエラーも出てない。

バックアップに使っている Seagate のHDDは毎日そこそこ書き込んでるけど二万時間以上動いていて未だ元気だ。

*1: http://d.hatena.ne.jp/n314/20110630/1309409374

ログ日記

作業ログと日記とメモ

HITACHI HDDのDMAエラー、または寿命