1
votes

I'm trying to modify a doc with Apache POI in Java.

At first the test.doc cannot be read with a exception raised up :

"org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x6576206C6D783F3C, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document "

So I saved the doc as "word 97 - 03" format,and then POI can read the doc properly.

But when I try to rewrite the content to a new file with nothing changed, the file output.doc cannot be opened by MS Office.

When I make a new doc myself with MS Office, the POI works well, everything goes right.

So the problem is "test.doc".

The test.doc is generated by some sort of a program which I can't access the code,so I don't know what goes wrong.

My question is :

1.As test.doc can be read by MS Office why can't POI without saving as a new format doc? 2.As the POI can read the doc, why it cannot write back to a new file(MS Office can't open)?

Here is my code:

    FileInputStream fis = null;
    try {
        fis = new FileInputStream("test.doc");
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    POIFSFileSystem pfs = null;
    try {
        pfs = new POIFSFileSystem(fis);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    HWPFDocument hwdf = null;
    try {
        hwdf = new HWPFDocument(pfs);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(new File("output.doc"));
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    try {
        hwdf.write(fos);
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }finally {

    }

    try {
        fos.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    try {
        fis.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    try {
        pfs.close();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
1

1 Answers

2
votes

The HEX stuff read as ASCII and read little-endian converts to <?xml ve, which indicates that test.doc is some other format than actually .doc/.docx.

Word will open other data-formats gracefully sometimes, upon saving it will be saved correctly in the Word-Format.

Therefore you will need to use a hex-editor to take a look at the contents of test.doc and if it is really in some broken format you need to find out where it is coming from and how the creation of that file can be fixed.