0
votes

I was wondering if someone could help me figure out why my text is not lining up when I read a .doc file. So far in my code I am using WordExtractor, but I am having formatting issue with stuff not lining up correctly. Here is my code that was written using Java 1.7.

public class Doc {
 File docFile = null;
 WordExtractor docExtractor = null ;
 WordExtractor exprExtractor = null ;
 public void read(){
  docFile = new File("blue.doc");
   try{
     FileInputStream fis = new FileInputStream(docFile.getAbsolutePath());
     HWPFDocument doc=new HWPFDocument(fis);
     docExtractor = new WordExtractor(doc);
     }catch(Exception e){
     System.out.println(e.getMessage());
  }


 System.out.println(docExtractor.getText());



  }
 }

How the program displays the document.

 A                                                                      E
I'm stuck in Folsom Prison, and time keeps draggin on.  

It is supposed to be displayed like this

     A                                              E
 I'm stuck in Folsom Prison, and time keeps draggin on.  
1

1 Answers

0
votes

Of course this will not work. You are extracting the content of a document file into a string variable (which will distort formatting into document like paragraphs and all). Further you are printing the text into console and then you expect that it will look exactly like in Microsoft word?

Next, you should think what do you want to do. Assuming that you want to verify both formatting and content of the document, my answer follows. Converting a document into plain text using getText() will give you content of document in a distorted format which does not help you. By using POI library you should instead try to access each paragraph and table in the document and verify/read/write whatever you want to.

doc.getRange() would give you a Range object. Play with this object by referring to http://poi.apache.org/apidocs/org/apache/poi/hwpf/usermodel/Range.html and you would be able to access all paragraphs, tables and sections in the document. That should help you in working out the word document through program.