0
votes

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA]. the fie name is something like sitemap01.xml.gz I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml... I searched Google for past two days.

Do i need to use delegateParser in tika to extract data from xml? Please guide me to some sample or articles....

Here is my try

public void parseXml() throws IOException{
    Metadata metadata = new Metadata();
    ContentHandler handler = new BodyContentHandler();
    Parser parser = new AutoDetectParser();
    ParseContext context = new ParseContext();
     InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz");
    try {
        parser.parse(stream,handler,metadata,context);
        for(int i = 0; i <metadata.names().length; i++) {
            String name = metadata.names()[i];
            System.out.println(name + " : " + metadata.get(name));
          }
        System.out.println(handler.toString());

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }finally{
         if(stream!=null) {
                stream.close();
            }
    }


}
2

2 Answers

1
votes

The thing you're missing is setting a recursing parser on your ParseContext. You probably want something like:

Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(....)

By setting a Parser on the ParseContext, you tell Tika to call that when it encounters embedded documents (such as the XML inside your GZip)

0
votes

Here is how you can use XML parser from Apache Tika for your case:

 //detecting the file type
  BodyContentHandler handler = new BodyContentHandler(-1);
  Metadata metadata = new Metadata();
  File inFile = new File("sitemap.xml.gz");
  System.out.println(inFile.isFile());
  FileInputStream inputstream = new FileInputStream(inFile);
  ParseContext pcontext = new ParseContext();

  //Xml parser
  XMLParser xmlparser = new XMLParser(); 
  xmlparser.parse(inputstream, handler, metadata, pcontext);
  System.out.println(pcontext.toString());

  System.out.println("Contents of the document:" + handler.toString());//this one contains all contents from xml files and tags are also removed
  System.out.println("Metadata of the document:");
  String[] metadataNames = metadata.names();

  for(String name : metadataNames) {
    System.out.println(name + ": " + metadata.get(name));