5
votes

To Developers,

I am doing benchmarks for Azure Data Lake and I am seeing about ~7.5 MB/S for a read of an ADL Store and a write to a VHD all in the same region. This is the case for PowerShell and C# with the code taken from the following examples:

PowerShell Code is from https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-get-started-powershell/ C# Code is from https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-get-started-net-sdk/

Are the above code samples acceptable for a benchmark test or will a new SDK be delivered that will enhance the throughput? Also, are there expected throughput numbers when ADL Store becomes generally available?

Thanks, Marc

2
Since the Azure Data Lake services are still in preview, I don't think that any benchmarks will deliver valid results. That'll change between now and GA anyway. There are also cache mechanisms in place, which will distort your results. As soon as the services are GA, I would be happy if there are benchmarks, like the one you started with.Sascha Dittmann
You also need to consider, that the analytical services will be able to retrieve several blocks simultaneously.Sascha Dittmann
@SaschaDittmann i don't disagree with your premise but i still find OP's question to be valid. OP's question is whether the linked code is worthy to be used in a benchmarking context. Creating a fair benchmark execution path is always non-trivial and warrants raising concerns on whether the code provides an accurate view or misleading view of performance.Chris Marisic
To Sasha and Chris, thanks for the input. Saveen Reddy (MS Azure Data Lake Evangelist) commented that Azure Data Lake Store will be GA'ed Summer-ish 2016. Regards, Marcuser1154422
I totally agree, that this is a valid question - that's why I gave it a +1 - and I've got already some ideas how to setup a benchmark test. As soon as I'm done with my solution, I'm going to post it here.Sascha Dittmann

2 Answers

2
votes

The code provided in the documentation can be used to build benchmark tests. The SDK will go through a few releases and updates prior to Azure Data Lake being generally available. These will include performance improvements in addition to features.

On the topic of performance benchmarks, our general guidance is as follows. The Azure Data Lake services are currently in preview. We are continually working to improve the services including performance through this preview phase. As we get closer to general availability, we will consider releasing additional guidance on the type of performance results to expect. Performance results depend heavily on many factors such as test topology, configuration and workload. Therefore it is difficult to comment your observations without examining all of these. If you can reach us offline with the details, we will be happy to take a look.

Amit Kulkarni (Program Manager - Azure Data Lake)

0
votes

I started to write an Azure Data Lake Storage Throughput Analyzer and put the first code bits on GitHub.

You should run that tool on an Azure VM to not measure you internet connection.

Please feel free to add you thoughts and code contributions to my GitHub repo as well.

I hope this helps.