I recently started playing around with Apache Solr and currently trying to figure out the best way to benchmark the indexing of a corpus of XML documents. I am basically interested in the throughput (documents indexed/second) and index size on disk.
I am doing all this on Ubuntu.
Benchmarking Technique
* Run the following 5 times& get average total time taken *
- Index documents [
curl http://localhost:8983/solr/core/dataimport?command=full-import
]- Get 'Time taken' name attribute from XML response when status is 'idle' [curl http://localhost:8983/solr/core/dataimport]
- Get size of 'data/index' directory
- Delete Index [
curl http://localhost:8983/solr/core/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8'
] - Commit [
curl http://localhost:8983/solr/w5/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
] - Re-index documents
Questions
- I intend to calculate my throughput by dividing the number of documents indexed by average total time taken; is this fine?
- Are there tools (like SolrMeter for query benchmarking) or standard scripts already available that I could use to achive my objectives? I do not want to re-invent the wheel...
- Is my approach fine?
- Is there an easier way of getting the index size as opposed to performing a 'du' on the data/index/ directory?
- Where can I find information on how to interpret XML response attributes (see sample output below). For instance, I would want to know the difference between the
QTime
andTime taken
values.
* XML Response Used to Get Throughput *
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">w5-data-config.xml</str>
</lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">3200</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2012-12-11 14:06:19</str>
<str name="">Indexing completed. Added/Updated: 1600 documents. Deleted 0 documents.</str>
<str name="Total Documents Processed">1600</str>
<str name="Time taken">0:0:10.233</str>
</lst>
<str name="WARNING">This response format is experimental. It is likely to change in the future.</str>
</response>