Extract xml data from gzip file using apache tika?

Question

I am working a project in which i need to extract xml(sitemap)data from gz file using apache tika[AM NEW TO TIKA]. the fie name is something like sitemap01.xml.gz I could extract data from normal text file or html,but i don't know how to extract xml from gz and extract the meta and data from xml... I searched Google for past two days.

Do i need to use delegateParser in tika to extract data from xml? Please guide me to some sample or articles....

Here is my try

public void parseXml() throws IOException{
    Metadata metadata = new Metadata();
    ContentHandler handler = new BodyContentHandler();
    Parser parser = new AutoDetectParser();
    ParseContext context = new ParseContext();
     InputStream stream =this.getClass().getResourceAsStream("sitemap.xml.gz");
    try {
        parser.parse(stream,handler,metadata,context);
        for(int i = 0; i <metadata.names().length; i++) {
            String name = metadata.names()[i];
            System.out.println(name + " : " + metadata.get(name));
          }
        System.out.println(handler.toString());

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TikaException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }finally{
         if(stream!=null) {
                stream.close();
            }
    }


}

Gagravarr Gagravarr · Accepted Answer · 2011-03-31T17:25:50

The thing you're missing is setting a recursing parser on your ParseContext. You probably want something like:

Parser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(....)

By setting a Parser on the ParseContext, you tell Tika to call that when it encounters embedded documents (such as the XML inside your GZip)

Extract xml data from gzip file using apache tika?

2 Answers