5
votes

I'm working with embedded ARM platform with built NAND flash. My roofs partition is squashfs. Both u-boot and kernel use OMAP_ECC_BCH8_CODE_HW. The problem is that some boards (not just one) stopped working after a power outage (they were used for about 2 months).

These errors can be seen while booting:

[    8.270507] end_request: I/O error, dev mtdblock9, sector 25184
[    8.278930] SQUASHFS error: squashfs_read_data failed to read block 0xc40396
[    8.286376] SQUASHFS error: Unable to read fragment cache entry [c40396]
[    8.293579] SQUASHFS error: Unable to read page, block c40396, size d696
[    8.300628] SQUASHFS error: Unable to read fragment cache entry [c40396]
[    8.307647] SQUASHFS error: Unable to read page, block c40396, size d696
[    8.314819] SQUASHFS error: Unable to read fragment cache entry [c40396]
[    8.321838] SQUASHFS error: Unable to read page, block c40396, size d696
[    8.328887] SQUASHFS error: Unable to read fragment cache entry [c40396]
[    8.335906] SQUASHFS error: Unable to read page, block c40396, size d696
[    8.343017] SQUASHFS error: Unable to read fragment cache entry [c40396]
[    8.350006] SQUASHFS error: Unable to read page, block c40396, size d696
/usr/sbin/lighttpd: '/usr/lib/libpcre.so.1' is not an ELF file
/usr/sbin/lighttpd: can't load library 'libpcre.so.1'

How should I debug this? I haven't erased the flash so it's still possible to make some tests on it.

What I've done so far:

  1. I used nanddump (with -o, read oob data) on bad partition and I noticed three ecc correction warnings. When I write this dump to another board it booted without a problem.

  2. When I used nanddump with additional option -n (--noecc, Read without error correction) and write it to another board (using nandwrite -n), the second board was unable to boot.

It seems to me that these errors are recoverable and that's why nanddump corrected them in the first case. I compared these 2 dumps and they are only three differences (3 ecc corrections reported by nanddump?)

# diff mtd_without_ecc.hex mtd_with_ecc.hex 

486347c486347
< 076bca0: 59d2 d8bc 3e89 1c67 a6c2 74a0 bc38 4873  Y...>..g..t..8Hs
---
> 076bca0: 59d2 d8bc 3e09 1c67 a6c2 74a0 bc38 4873  Y...>..g..t..8Hs
783769c783769
< 0bf5980: e31e f50a e5b5 6ae5 5a67 8be1 7636 9cf2  ......j.Zg..v6..
---
> 0bf5980: e31e f50a e5b5 6aa5 5a67 8be1 7636 9cf2  ......j.Zg..v6..
1315929c1315929
< 1414580: a9ec ef89 ac52 c8a5 61f5 5d0b 6ee2 af41  .....R..a.].n..A
---
> 1414580: a9ec af89 ac52 c8a5 61f5 5d0b 6ee2 af41  .....R..a.].n..A

The question is: why these errors weren't corrected by system automatically? Is it because squashfs is not "mtd-aware" filesystem and it shouldn't be used on mtd devices? If so, should I use squashfs over UBI? What about the kernel then (as far as I know it has to be raw image in order to boot it from u-boot)?

Thanks for any help!

1

1 Answers

0
votes

Indeed, the Linux MTD layer doesn't do any maintenance on the NAND/NOR memory.

For example, when a bitflip happens on your NAND, it's corrected by the ECC. The MTD layer is aware of that, but it doesn't DO anything about it. It just returns the error.

So you need another layer on top of MTD to take care of that.

One solution is to use UBI, which is designed to solve this kind of problems. Have a look at the UBI documentation on linux-mtd. If you want to stick with squashfs, it's possible to add another MTD abstraction on top of UBI (gluebi), then run squashfs on top of that. The result looks like that:

---------------------
|      SquashFS     |
---------------------
|     MTD block     |
---------------------
| MTD API (gluebi)  |
---------------------
|        UBI        |
---------------------
|     MTD driver    |
---------------------
|     Flash Chip    |
---------------------

It makes a scary picture, but it works pretty well ;)

Have a look at this slides from free-electrons for more info (the picture comes from slide 47).

About the kernel, I'm not sure but I think U-Boot does support UBI. Never tried it though...