We have implemented an Azure CosmosDB (MongoDB with SQL API) database in the cloud. Through java, we'd like to generate reports based on the data hiding in the MongoDB. I'm not yet too happy with the performance of my read queries, and I was wondering what can be improved to my current setup.
Like said, I use Java to query the database. I use the Microsoft Azure DocumentDB library to query the database:
<dependency>
<groupId>com.microsoft.azure</groupId>
<artifactId>azure-documentdb</artifactId>
<version>1.16.2</version>
</dependency>
Currently, the best performance I have been able to get, was to query around 38.000 documents in memory in around 20 seconds, with 50,000 RU/s configured (local cosmos emulator). I would really like this improved, because we might query millions of documents soon.
I have the feeling that the way we store the data, might not be optimal. Each document now look as follows:
{
"deviceid": "xxx",
"devicedata": {
"datetime": "2018-08-28T00:00:02.104Z",
"sensors": [
{
"p_A2": "93095",
"p_A3": "303883",
"p_batterycurrent": "4294967.10000",
"p_batterygauge": "38.27700",
"p_batteryvoltage": "13.59400",
** ... around 200 more key - value pairs ... **
}
]
},
"id": "aa5d3cf5-10fa-48dd-a0d2-a536284eddac",
"_rid": "PtEIANkbMQABAAAAAAAAAA==",
"_self": "dbs/PtEIAA==/colls/PtEIANkbMQA=/docs/PtEIANkbMQABAAAAAAAAAA==/",
"_etag": "\"00000000-0000-0000-4040-006a7f2501d4\"",
"_attachments": "attachments/",
"_ts": 1535619672
}
A query that we would use a lot, would look as follows:
SELECT c.deviceid,
c.devicedata.datetime,
c.devicedata.sensors[0].p_A2,
c.devicedata.sensors[0].p_A3,
c.devicedata.sensors[0].p_batterycurrent,
c.devicedata.sensors[0].s_humidity
FROM c
WHERE c.deviceid = 'xxx'
AND c.devicedata.datetime >= '2018-08-28T00:00:00.000Z'
AND c.devicedata.datetime < '2018-08-30T00:00:00.000Z'
order by c.devicedata.datetime desc
I cut these queries per deviceId. So per device, I run a thread with this query. This seems to go a lot faster than a single thread with a single query.
Such a query as above would take us around 20 seconds.
I have noticed however, if I only query on the deviceid and devicedata.datetime, the query is done within 2 seconds. It seems that getting the sensor data out of the sensor list is a really tough cookie. If I do a select * (so no filtering on the sensor data), it is also faster than when I let the SQL API filter out the sensors: around 15 seconds.
My question is, what can I do to improve upon this? Is my document list too long? is there any way I can set this up differently? The sensor key value pairs are not fixed, and can differ per device.
Some more technical details: I have an unlimited collection, partitioned on /deviceid. I have used the standard index policy of Azure (which is index everything), as well as excluding the sensors from it.
I have tried all the tips as described here: https://docs.microsoft.com/en-us/azure/cosmos-db/performance-tips-java
This is my current Java setup, although I have tried lots of different things:
//This piece of code is currently in a seperate thread. There is one thread per deviceId to query
documentClient = new DocumentClient(HOST, MASTER_KEY,
ConnectionPolicy.GetDefault(), ConsistencyLevel.Session);
FeedOptions options = new FeedOptions();
options.setEnableCrossPartitionQuery(true);
documentList = documentClient
.queryDocuments(getAlldataCollection().getSelfLink(), query, options)
.getQueryIterable().toList();
I'm fairly sure MongoDB can query hundreds of thousands of documents within seconds, so I'm pretty sure I'm doing something wrong with my current setup.
Any suggestions?