0
votes

I would like some advice/tips about the right technology to select in order to store some forecast data on Azure technologies. My team and I are scraping some weather forecast data everyday from various sources and store it as is on a Azure File Storage. The files format is "grib2" which is a standard format of weather forecast data. We are able to extract the data from those "grib2" files using python script running on a Azure VM.

We now have several files that represent hundreds gigabytes of data to store and I'm struggling to find which data store from the Azure technologies suits the best our needs in term of praticity and cost.
We started using "Azure Table Storage" first because it's cheap solution, but I've read on many posts that it is a bit old and not very adapted to our solution as it for example does not allow more than 1,000 entites per query and no aggregation on data.

I considered using Azure SQL db but it seems that it can become very expensive very fast.
I also considered the Azure Data Lake Storage Gen2 (and HDinsight) technologies but am not very at ease with those blob storages and am not really able to say if it can suit my needs in terms of praticity and if it is "easy to query".

By now we just plan to achieve that :

1) Extract data from grib2 files thanks to a python script running on an Azure VM
2) Insert the transformed data into [Azure storage]
3) Query the [Azure storage] from Azure Machine Learning Service or a local R script (for example)
4) Insert the computed data into [Azure storage]

where [Azure Storage] technology is to determine.

Any help or advice would be much appreciated, thanks.

1
You want to store files, lots of files and run python over them. ADLS Gen2 would seem to be the obvious choice. The important question is what do you do with the computed data? Is it tabular / modelled? Do you run python or relational analysis on it? You haven't defined that in your question so that is one reason you're having difficulty picking a technology.Nick.McDermaid
Just mix databricks in there and you have everything you need.Nick.McDermaid
To make it simple, I want data scientists to be able to build predictive models using the stored data. (By computed data I meant the output of the ML service for example)Flo

1 Answers

2
votes

A couple of things I would see here:

  1. To store the downloaded files in raw format (grib2 in your case), either place them on good ol' Azure Blob Storage. Cheap storage exactly for your needs.
  2. Use Azure Databricks to load the data from the storage account and unpack it into memory. (python or scala)
  3. Load the data in memory - still in Databricks - to run you ML inferencing. You could also use SparkR if you really want to.
  4. Store the computed files in a serving layer. This really depends on what you want to do with it later. Often Azure SQL Database is an obvious choice. There is a native Spark connector which efficiently writes data from Databricks to SQL DB.

In addition to using Databricks as your inferencing environment, it's also a good choice for ML training (e.g. utilizing Azure ML Service).