1
votes

I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.

I am doing all this on Ubuntu.

Benchmarking Technique

* Run the following 5 times& get average total time taken *

  • Index documents [curl http://localhost:8983/solr/core/dataimport?command=full-import]
    • Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
    • Get size of 'data/index' directory
  • Delete Index [curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8']
  • Commit [curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8']
  • Re-index documents

Questions

  1. I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
  2. Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
  3. Is my approach fine?
  4. Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
  5. Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the QTime and Time taken values.

* XML Response Used to Get Throughput *

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
    <int name="QTime">0</int>
  </lst>
  <lst name="initArgs">
    <lst name="defaults">
      <str name="config">w5-data-config.xml</str>
    </lst>
  </lst>
  <str name="status">idle</str>
  <str name="importResponse"/>
  <lst name="statusMessages">
    <str name="Total Requests made to DataSource">0</str>
    <str name="Total Rows Fetched">3200</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Full Dump Started">2012-12-11 14:06:19</str>
    <str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
    <str name="Total Documents Processed">1600</str>
    <str name="Time taken">0:0:10.233</str>
  </lst>
  <str name="WARNING">This response format is experimental.  It is likely to change in the future.</str>
</response>
1

1 Answers

1
votes

To question 1:

I would suggest you should try to index more than 1 XML (with different dataset) file and compare the given results. Thats the way you will know if it´s ok to simply divide the taken time with your number of documents.

To question 2:

I didn´t find any of these tools, I did it by my own by developing a short Java application

To question 3:

Which approach you mean? I would link to my answer to question 1...

To question 4:

The size of the index folder gives you the correct size of the whole index, why don´t you want to use it?

To question 5:

The results you get in the posted XML is transfered through a XSL file. You can find it in the /bin/solr/conf/xslt folder. You can look up what the termes exactly means AND you can write your own XSL to display the results and informations. Note: If you create a new XSL file, you have to change the settings in your solrconfig.xml. If you don´t want to make any changes, edit the existing file.

edit: I think the difference is, that the Qtime is the rounded value of the taken time value. There are only even numbers in Qtime.

Best regards