I have some data (approximatly 5 Mio of items in 1500 tables, 10GB) in azure tables. The entities can be large and contain some serialized binary data in the protobuf format.
I have to process all of them and transform it to another structure. This processing is not thread safe. I also process some data from a mongodb replica set using the same code (the mongodb is hosted in another datacenter).
For debugging purposes I log the throughput and realized that it is very low. With mongodb I have a throughput of 5000 items / sec, with azure table storage only 30 items per second.
To improve the performance I try to use TPL dataflow, but it doesnt help:
public async Task QueryAllAsync(Action<StoredConnectionSetModel> handler)
{
List<CloudTable> tables = await QueryAllTablesAsync(companies, minDate);
ActionBlock<StoredConnectionSetModel> handlerBlock = new ActionBlock<StoredConnectionSetModel>(handler, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });
ActionBlock<CloudTable> downloaderBlock = new ActionBlock<CloudTable>(x => QueryTableAsync(x, s => handlerBlock.Post(s), completed), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
foreach (CloudTable table in tables)
{
downloaderBlock.Post(table);
}
}
private static async Task QueryTableAsync(CloudTable table, Action<StoredConnectionSetModel> handler)
{
TableQuery<AzureTableEntity<StoredConnectionSetModel>> query = new TableQuery<AzureTableEntity<StoredConnectionSetModel>>();
TableContinuationToken token = null;
do
{
TableQuerySegment<AzureTableEntity<StoredConnectionSetModel>> segment = await table.ExecuteQuerySegmentedAsync<AzureTableEntity<StoredConnectionSetModel>>(query, token);
foreach (var entity in segment.Results)
{
handler(entity.Entity);
}
token = segment.ContinuationToken;
}
while (token != null)
}
I run the batch process on my local machine (with 100mbit connection) and in azure (as worker role) and it is very strange, that the throughput on my machine is higher (100 items / sec) than on azure. I reach my max capacity of the internet connection locally but the worker role should not have this 100mbit limitation I hope.
How can I increase the throughput? I have no ideas what is going wrong here.
EDIT: I realized that I was wrong with the 30items per second. It is often higher (100/sec), depending on the size of the items I guess. According to the documentation (http://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading10) there is a limit:
The scalability limit for accessing tables is up to 20,000 entities (1KB each) per second for an account. This are only 19MB / sec, not so impressive, if you keep in mind, that there are also normal requests from the production system). Probably I test it to use multiple accounts.
EDIT #2: I made two single tests, starting with a list of 500 keys [1...500] (Pseudo Code)
Test#1 Old approach (TABLE 1)
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = key1, rowKey = key2 }
Test#2 New approach (TABLE 2)
numpartitions = 100
foreach (key1 in keys)
foreach (key2 in keys)
insert new Entity { paritionkey = (key1 + key2).GetHashCode() % numParitions, rowKey = key1 + key2 }
Each entity gets another property with 10KB of random text data.
Then I made the query tests, in the first case I just query all entities from Table 1 in one thread (sequential)
In the next test I create on task for each partitionkey and query all entities from Table 2 (parallel). I know that the test is no that good, because in my production environment I have a lot more partitions than only 500 per table, but it doesnt matter. At least the second attempt should perform well.
It makes no difference. My max throughput is 600 entities/sec, varying from 200 to 400 the most of the time. The documentation says that I can query 20.000 entities / sec (with 1 KB each), so I should get at least 1500 or so in average, I think. I tested it on a machine with 500MBit internet connection and I only reached about 30mbit, so this should not be the problem.