1
votes

I am trying to find way to further improve the performance of my console app (already fully working).

I have a CSV file which contains a list of addresses (about 100k). I need to query a Web API whose POST response would be the geographical coordinates of such addresses. Then I am going to write a GeoJSON file to the file system with the address data enriched with geographical coordinates (latitude and longitude).

My current solution splits the data into batches of 1000 records and sends Async POST requests to the Web API using HttpClient (.NET core 3.1 with console app and class library using .NET Standard 2.0). GeoJSON is my DTO class.

public class GeoJSON
    {
        public string Locality { get; set; }
        public string Street { get; set; }
        public string StreetNumber { get; set; }
        public string ZIP { get; set; }
        public string Latitude { get; set; }
        public string Longitude { get; set; }
    }


public static async Task<List<GeoJSON>> GetAddressesInParallel(List<GeoJSON> geos)
        {
            //calculating number of batches based on my batchsize (1000)
            int numberOfBatches = (int)Math.Ceiling((double)geos.Count() / batchSize);

            for (int i = 0; i < numberOfBatches; i++)
            {
                var currentIds = geos.Skip(i * batchSize).Take(batchSize);
                var tasks = currentIds.Select(id => SendPOSTAsync(id));
                geoJSONs.AddRange(await Task.WhenAll(tasks));
            }

            return geoJSONs;
        }

My Async POST method looks like this:

 public static async Task<GeoJSON> SendPOSTAsync(GeoJSON geo)
        {
            string payload = JsonConvert.SerializeObject(geo);
            HttpContent c = new StringContent(payload, Encoding.UTF8, "application/json");
            using HttpResponseMessage response = await client.PostAsync(URL, c).ConfigureAwait(false);

            if (response.IsSuccessStatusCode)
            {
                var address = JsonConvert.DeserializeObject<GeoJSON>(await response.Content.ReadAsStringAsync());
                geo.Latitude = address.Latitude;
                geo.Longitude = address.Longitude;
            }
            return geo;
        }

The Web API runs on my local machine as Self Hosted x86 application. The whole application ends in less than 30s. The most time consuming part is the Async POST part (about 25s). The Web API takes only one address for each post, otherwise I'd have sent multiple addresses in one request.

Any ideas on how to improve performance of the request against the Web API?

2
ericlippert.com/2012/12/17/performance-rant very good read about performance from Eric Lippert.zaggler
"The most time consuming part is the Async POST part" - yeah so the external server is probably throttling you...CodeCaster
1000 concurrent requests sounds excessive. Have you tried with smaller numbers?Theodor Zoulias
@TheodorZoulias I started with 100 and went up during my tests. 1000 seems to return the best results. With 100 concurrent requests the app took on average about 6 seconds more than with 100.fpsanti

2 Answers

1
votes

A potential problem of your batching approach is that a single delayed response may delay the completion of a whole batch. It may not be an actual problem because the web service you are calling may have very consistent response times, but in any case you could try an alternative approach that allows controlling the concurrency without the use of batching. The example bellow uses the TPL Dataflow library, which is built-in the .NET Core platform and available as a package for .NET Framework:

public static async Task<List<GeoJSON>> GetAddressesInParallel(List<GeoJSON> geos)
{
    var block = new ActionBlock<GeoJSON>(async item =>
    {
        await SendPOSTAsync(item);
    }, new ExecutionDataflowBlockOptions()
    {
        MaxDegreeOfParallelism = 1000
    });

    foreach (var item in geos)
    {
        await block.SendAsync(item);
    }
    block.Complete();

    await block.Completion;
    return geos;
}

Your SendPOSTAsync method just returns the same GeoJSON that receives as argument, so the GetAddressesInParallel can also return the same List<GeoJSON> that receives as argument.

The ActionBlock is the simplest of the blocks available in the library. It just executes a sync or async action for every item, allowing the configuration of the MaxDegreeOfParallelism among other options. You could also try splitting your workflow in multiple blocks, and then link them together to form a pipeline. For example:

  1. TransformBlock<GeoJSON, (GeoJSON, string)> that serializes the GeoJSON objects to JSON.
  2. TransformBlock<(GeoJSON, string), (GeoJSON, string)> that makes the HTTP requests.
  3. ActionBlock<(GeoJSON, string)> that deserializes the HTTP responses and updates the GeoJSON objects with the received values.

Such an arrangement would allow you to fine-tune the MaxDegreeOfParallelism of each block, and hopefully achieve the optimal performance.

0
votes

The answer above is probably correct, but this kind of dependency is not necessary. You can just use Task.WhenAll . This code is from a different Rest library but the concept is the same:

var tasks = new List<Task<Response<Person>>>();
const int maxCalls = 100;

Parallel.For(0, maxCalls, (i) =>
{
    var client = clientFactory.CreateClient();
    tasks.Add(client.GetAsync<Person>(new Uri("JsonPerson", UriKind.Relative)));
});

var results = await Task.WhenAll(tasks);

The client is created and request made in parallel 100x. Then all the tasks are awaited in parallel. This means that all available resources are utilized.

Full code