I'm working with embedded ARM platform with built NAND flash. My roofs partition is squashfs. Both u-boot and kernel use OMAP_ECC_BCH8_CODE_HW. The problem is that some boards (not just one) stopped working after a power outage (they were used for about 2 months).
These errors can be seen while booting:
[ 8.270507] end_request: I/O error, dev mtdblock9, sector 25184
[ 8.278930] SQUASHFS error: squashfs_read_data failed to read block 0xc40396
[ 8.286376] SQUASHFS error: Unable to read fragment cache entry [c40396]
[ 8.293579] SQUASHFS error: Unable to read page, block c40396, size d696
[ 8.300628] SQUASHFS error: Unable to read fragment cache entry [c40396]
[ 8.307647] SQUASHFS error: Unable to read page, block c40396, size d696
[ 8.314819] SQUASHFS error: Unable to read fragment cache entry [c40396]
[ 8.321838] SQUASHFS error: Unable to read page, block c40396, size d696
[ 8.328887] SQUASHFS error: Unable to read fragment cache entry [c40396]
[ 8.335906] SQUASHFS error: Unable to read page, block c40396, size d696
[ 8.343017] SQUASHFS error: Unable to read fragment cache entry [c40396]
[ 8.350006] SQUASHFS error: Unable to read page, block c40396, size d696
/usr/sbin/lighttpd: '/usr/lib/libpcre.so.1' is not an ELF file
/usr/sbin/lighttpd: can't load library 'libpcre.so.1'
How should I debug this? I haven't erased the flash so it's still possible to make some tests on it.
What I've done so far:
I used nanddump (with -o, read oob data) on bad partition and I noticed three ecc correction warnings. When I write this dump to another board it booted without a problem.
When I used nanddump with additional option -n (--noecc, Read without error correction) and write it to another board (using nandwrite -n), the second board was unable to boot.
It seems to me that these errors are recoverable and that's why nanddump corrected them in the first case. I compared these 2 dumps and they are only three differences (3 ecc corrections reported by nanddump?)
# diff mtd_without_ecc.hex mtd_with_ecc.hex
486347c486347
< 076bca0: 59d2 d8bc 3e89 1c67 a6c2 74a0 bc38 4873 Y...>..g..t..8Hs
---
> 076bca0: 59d2 d8bc 3e09 1c67 a6c2 74a0 bc38 4873 Y...>..g..t..8Hs
783769c783769
< 0bf5980: e31e f50a e5b5 6ae5 5a67 8be1 7636 9cf2 ......j.Zg..v6..
---
> 0bf5980: e31e f50a e5b5 6aa5 5a67 8be1 7636 9cf2 ......j.Zg..v6..
1315929c1315929
< 1414580: a9ec ef89 ac52 c8a5 61f5 5d0b 6ee2 af41 .....R..a.].n..A
---
> 1414580: a9ec af89 ac52 c8a5 61f5 5d0b 6ee2 af41 .....R..a.].n..A
The question is: why these errors weren't corrected by system automatically? Is it because squashfs is not "mtd-aware" filesystem and it shouldn't be used on mtd devices? If so, should I use squashfs over UBI? What about the kernel then (as far as I know it has to be raw image in order to boot it from u-boot)?
Thanks for any help!