6
votes

I am struggling on flashing a previous ROM dump of an embedded device in Linux. My previous dump contains oob data. I wrote it with nandwrite -n -N -o /dev/mtd0 backup.bin, and then take a ROM dump again.

By comparing the old and new ROM dump, I see some un-explainable situation: the last 24 bytes of the oob (ecc bytes) of any empty blocks (filled with 0xFF) is ought to be 0xFF also, but those in the new ROM dump is filled with 0x00, causing later write failures.

oob ought to be:

FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF

but for nandwrite:

FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
FFFFFFFF FFFFFFFF 00000000 00000000
00000000 00000000 00000000 00000000

Anyone has any idea why?


I added a hack in the nandwrite code, to skip writing to NAND if the content to be written is 0xFF, and it worked. So the problem exists when trying to write an empty page to the NAND?


ADDED:

Now I am having this problem also when writing a bootloader image. The image isn't page-aligned so nandwrite padded it with 0xFF. But for pages with only 0xFF the ecc bytes are still polluted by 0x00 just like above. Seems that my hack doesn't totally solve my problem. Anyone can help? Perhaps it could be a bug in kernel 2.6.35?

This is my hack:

int i;
int needwrite=0;
for (i = 0 ; i < len ; ++i){
    if(((uint8_t*)data)[i]!=0xff){
        needwrite=1;
        break;
    }
}
if(!needwrite)
    return 0;
4
What is the question here? How to nanddump on my bizarre hardware? The people are right to tell you not to ignore the OOB data. Can we close this question as it is too localized? As it is now, this sound like a comment on how to hack-up nanddump for some specific hardware's purpose. Can you fix up the question to make it more relevant for more people? I am using nanddump -qobf file /dev/mtdXXX and nandwrite -p /dev/mtdXXX file and want to know if I am wrong. Google showed me this question.artless noise
@artlessnoise Feel free to vote to close if you think so. Different versions of nanddump has different functionality (as well as command line arguments) so I don't think this question can be applied generally. Check the help page of your version.Alvin Wong

4 Answers

2
votes

Sorry, Alvin, but the backup really will not "only work on that particular flash", because you cannot know when a particular bit will go from good to marginal or marginal to bad. You may read it in one state, attempt to write it in the exact same state and fail, on any given day, with any given backup.

The ONLY way to safely backup the data in a NAND device is WITH ECC TURNED ON. You read from the device with ECC corrections to get good data. You then write the known-good data back to NAND with ECC turned on so that any bits which are now marginal or bad from when you read it before can be corrected using the NEW ECC values.

0
votes

This is because it is a warning to you. It will not work reliably.

Consider a situation where you have a block (1) with an error in position 0. The "controller" of the Nand-flash device puts error correcting code to correct this error.

You copy the data from block 1 with the ECC BUT when you write the data to a new Nand-flash device, you are cloning the data. If that new nand-flash device has an error in position 1. Then the data you write back will be wrong on the following read, because position 1 is bad. But the system will think it is right, because the ECC does not show an error in position 1

You cannot reliably clone 1 nand-flash to another directly, because the hard/soft error positions are not identical.

The only way to do it reliably is to read the data out, use the systems ECC algorithms to correct any errors. Write the data out to a new device, use the systems algorithms to correct any bit errors.

You may think the devices are the same, but the results are data/program corruption due to mismatches in the bit error maps.

In response to Alvin's comment:

I am quite confident that I am cloning the exact same NAND, i.e. I made a backup of that particular chip and then write it back to THAT particular chip. It's not me who think it's the same, but there is only one single chip from the beginning to the end. It is quite strange but some other people state that it worked on their own device, while mine doesn't, could there be a bug in the driver? – Alvin Wong Aug 5 at 5:16

Sorry not possible (unless you are really..really..really lucky and get chips with 0 defects)

Each Nand-Flash chip has its own set of defect bits, they are Unique. The way that a user gets round it, is to generate a file system that masks out the bad blocks once the bad bits gets beyond the capability of the CRC. When you copy a nand-chip to another device, the CRC map matches the master chip. when you do a 1:1 clone of the device, some of the data bits will flip after the write (bad cells) and since you are doing a clone, you do not take into account in the CRC that these bits have flipped (because you are doing a verbatim copy).

The fact that it "works" for some people, does not mean it is correct, any more than I can drive a car, but I only find the brakes don't work when I need them. Even worse is the fact that many of these so called 'experts' on the net actually erase the defect map supplied by the manufacturer when they "clone" the device" or perform a "chip erase" before saving the defect map.

This is what happens with many of the 'dodgy' nand-flash usb sticks coming out of ebay, they are actually chips with the "defect map" erased , as a result they look like good devices, until you try to save content to them.

0
votes

My hack is adding a check in nandwrite, that if the whole page going to be written is totally empty (i.e. full of 0xFF), the program will skip writing it (as a flash_erase had been done).

An extra benefit is that the whole process of nandwrite got faster because of skipping empty pages. Horray!


ADDED:

It turned out that my hack didn't actually solve the problem...


ADDED again: (real solution)

The problem is in fact the PXA310 fills the hardware ECC bits with 0x00 for a blank page, so if the software writes an empty page, the bits gets 0x00. This is strange, because I should have already disabled ECC in the arguments of nandwrite. Luckily skipping writing empty pages works in preventing problems with re-writing a ROM dump.

More information can be found in my blog post.

A patch sent to the linux-mtd list actually mentioned about the fact.

0
votes

In my embedded world, you'd first use flash_erase to blast everything followed by nandwrite -p to pad the rest of the page beyond your data with 0xFF.

Usage: nandwrite [OPTION] MTD_DEVICE [INPUTFILE|-]
Writes to the specified MTD device.

  -m, --markbad           Mark blocks bad if write fails
  -N, --noskipbad         Write without bad block skipping
  -o, --oob               Image contains oob data
  -O, --onlyoob           Image contains oob data and only write the oob part
  -r, --raw               Image contains the raw oob data dumped by nanddump
  -s addr, --start=addr   Set start address (default is 0)
  -p, --pad               Pad to page size
  -b, --blockalign=1|2|4  Set multiple of eraseblocks to align to
  -q, --quiet             Don't display progress messages
      --help              Display this help and exit
      --version           Output version information and exit