0
votes

I have some data (approximatly 5 Mio of items in 1500 tables, 10GB) in azure tables. The entities can be large and contain some serialized binary data in the protobuf format.

I have to process all of them and transform it to another structure. This processing is not thread safe. I also process some data from a mongodb replica set using the same code (the mongodb is hosted in another datacenter).

For debugging purposes I log the throughput and realized that it is very low. With mongodb I have a throughput of 5000 items / sec, with azure table storage only 30 items per second.

To improve the performance I try to use TPL dataflow, but it doesnt help:

public async Task QueryAllAsync(Action<StoredConnectionSetModel> handler)
{
    List<CloudTable> tables = await QueryAllTablesAsync(companies, minDate);

    ActionBlock<StoredConnectionSetModel> handlerBlock = new ActionBlock<StoredConnectionSetModel>(handler, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1 });

    ActionBlock<CloudTable> downloaderBlock = new ActionBlock<CloudTable>(x => QueryTableAsync(x, s => handlerBlock.Post(s), completed), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });

    foreach (CloudTable table in tables)
    {
        downloaderBlock.Post(table);
    }
}

private static async Task QueryTableAsync(CloudTable table, Action<StoredConnectionSetModel> handler)
{
    TableQuery<AzureTableEntity<StoredConnectionSetModel>> query = new TableQuery<AzureTableEntity<StoredConnectionSetModel>>();

    TableContinuationToken token = null;
    do
    {
        TableQuerySegment<AzureTableEntity<StoredConnectionSetModel>> segment = await table.ExecuteQuerySegmentedAsync<AzureTableEntity<StoredConnectionSetModel>>(query, token);

        foreach (var entity in segment.Results)
        {
            handler(entity.Entity);
        }

        token = segment.ContinuationToken;
    }
    while (token != null)
}

I run the batch process on my local machine (with 100mbit connection) and in azure (as worker role) and it is very strange, that the throughput on my machine is higher (100 items / sec) than on azure. I reach my max capacity of the internet connection locally but the worker role should not have this 100mbit limitation I hope.

How can I increase the throughput? I have no ideas what is going wrong here.


EDIT: I realized that I was wrong with the 30items per second. It is often higher (100/sec), depending on the size of the items I guess. According to the documentation (http://azure.microsoft.com/en-us/documentation/articles/storage-performance-checklist/#subheading10) there is a limit:

The scalability limit for accessing tables is up to 20,000 entities (1KB each) per second for an account. This are only 19MB / sec, not so impressive, if you keep in mind, that there are also normal requests from the production system). Probably I test it to use multiple accounts.


EDIT #2: I made two single tests, starting with a list of 500 keys [1...500] (Pseudo Code)

Test#1 Old approach (TABLE 1)

foreach (key1 in keys)
    foreach (key2 in keys)
        insert new Entity { paritionkey = key1, rowKey = key2 }

Test#2 New approach (TABLE 2)

numpartitions = 100
foreach (key1 in keys)
    foreach (key2 in keys)
        insert new Entity { paritionkey = (key1 + key2).GetHashCode() % numParitions, rowKey = key1 + key2 }

Each entity gets another property with 10KB of random text data.

Then I made the query tests, in the first case I just query all entities from Table 1 in one thread (sequential)

In the next test I create on task for each partitionkey and query all entities from Table 2 (parallel). I know that the test is no that good, because in my production environment I have a lot more partitions than only 500 per table, but it doesnt matter. At least the second attempt should perform well.

It makes no difference. My max throughput is 600 entities/sec, varying from 200 to 400 the most of the time. The documentation says that I can query 20.000 entities / sec (with 1 KB each), so I should get at least 1500 or so in average, I think. I tested it on a machine with 500MBit internet connection and I only reached about 30mbit, so this should not be the problem.

1
did you checked - Performance and storage scalability checklist- azure.microsoft.com/en-us/documentation/articles/… If not please have a lookJalpesh Vadgama
"To improve the performance I try to use TPL dataflow, but it doesnt help" This tells me you have no idea why this is slow and you have tried a random change to make it go faster. Not a good way to go. Find out why this is slow. Is the network maxed out locally? Is the CPU 100% busy (even on one core)?usr
@usr I think this is not true. As i mentioned above the network is maxed out locally, the cpu is at 20% (on all cores), so it seems to be the problem to get the data fast enough. The import process from mongodb shows that there is a lof of space to increase the throughput and make the cpu busy. I use data flow to make multiple requests in parallel and to synchronize the operations for the processing of the data.SebastianStehle
How is the ratio of count(partitionkey)/count(rowkey)? Scans across partitions are very slow. So ideally you want to parallelize the calls for each partition.Thomas Jungblut
I dont know exactly. We get data from a partner. Each data set is identified by 2 keys and a date. We have one table per date and use one key as row and one key as partitionkey. They have about 60.000 different keys, so there might be a lot of partitions. Typically we only need one entity (or data set) from the table and query it by date and keys.SebastianStehle

1 Answers

0
votes

You should also check out the Table Storage Design Guide. Hope this helps.