0
votes

Goal - Using sax parser parse different xml files paralleley by multiple threads.

Found multiple posts related to the same topic. But none of them is pointing to the answer.

Question 1 Question 2

I know SAXParserFactory and SAXParser is not thread safe. As per my research I need to create new instances of SAXParserFactory and SAXParser for each thread. How can I achieve this. (Also new instance of MySAXHandler)

Please find the current implementation of my code.

Initiation of SAXParser

@Override
public GameStatisticsDTO processStatsGameStatXML(File gameStatsStatFile) {
    try(InputStream inputStream = new FileInputStream(gameStatsStatFile)) {
        // New Handler instance
        GameStatsSAXHandler gameStatsSAXHandler = new GameStatsSAXHandler();

        Reader reader = new InputStreamReader(inputStream, Constants.ENCODING_TYPE_UTF_8);
        InputSource inputSource = new InputSource(reader);
        inputSource.setEncoding(Constants.ENCODING_TYPE_UTF_8);

        // New Instance of SAXParserFactory
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

        // New Instance of SAXParser
        SAXParser saxParser = factory.newSAXParser();

        // Create an XML reader to set the entity resolver.
        XMLReader xmlReader = saxParser.getXMLReader();
        xmlReader.setEntityResolver(new StatsCustomResolver());
        xmlReader.setContentHandler(gameStatsSAXHandler);
        xmlReader.parse(inputSource);
        return gameStatsSAXHandler.getGameStatisticsDTO();
    } catch (Exception e) {
        throw new UnprocessableEntityException();
    }
}

This will call the GameStatsSAXHandler to parse xml nodes. Within that class I'm maintaining Instance reference variables to store my parsed data.

public class GameStatsSAXHandler extends DefaultHandler {

   // Instance Reference Variable - Hope this is thread safe
   private GameStatisticsDTO gameStatisticsDTO = new GameStatisticsDTO();

   protected GameStatisticsDTO getGameStatisticsDTO() {
      return this.gameStatisticsDTO;
   }

   @Override
   public void startElement (String uri, String localName, String 
     elementName, Attributes attributes) throws SAXException {
        // Process the data and add it to the gameStatisticsDTO
   }

   @Override
   public void endElement (String uri, String localName, String 
      elementName) throws SAXException {
        // Do some processing in gameStatisticsDTO
   }
}

gameStatisticsDTO contains multiple instance reference variables (Objects and Lists)

So I have 2 questions.

1) Since only local primitive variables are thread safe. Is this GameStatsSAXHandler and its GameStatisticsDTO are thread safe ?

My Thought: If I create new GameStatsSAXHandler instance for each thread then GameStatisticsDTO will be thread safe.

2) How can I convert this to multi threaded environment with parallelism.

My Thought: Create ThreadPoolExecutor and pass new SAXParserFactory and generate new SAXParser and create new GameStatsSAXHandler and pass it to the base method to processing. (processStatsGameStatXML method)

But how can I create new Instance for each thread? Code sample will be great ! Thanks

1
does it need to be multi-threading in the first place? also is SAX the right parsing model?vtd-xml-author

1 Answers

0
votes

You submit tasks to ThreadPoolExecutor. A task is normally where you keep all the contextual, i.e. state, data related to the specific task, in your case, the parsing of one file.

So something like this:

class ParsingTask implements Runnable {
    private SAXParserFactory factory;
    private SAXParser parser;
    private GameStatsHandler handler;
    // whatever else needed for parsing

    @Override
    public void run() {
        // actual parsing code
    }
}

[Edit] on a side note, I think usually an implementation of SAXParserFactory is thread-safe. Unless you need the factory configured differently between different parsing tasks, it probably does not need to be instantiated every time on each new parsing task.