0
votes

I have data saved from Spark dataframe to Azure Blob storage in json format. Now I wrote a Stream Analytics job to fetch the data from Azure Blob and store it into Cosmos DB.

When I tested the Stream analytics job with a sample file(less than 1MB) which consists of 10K records, it is returning entire 10K records as output which is expected result.

The problem is when I took sample from blob storage and tested, only 700 records are returning. But in Blob storage around 5GB of data is there and expected output shouldn't be 700 rows and should be a large value.

Is there any idea why this discrepancies in number of records are happening ? My Blob storage structure is as below. Container Name is dataframecopy and dataload/testdata is the location where files are stored. enter image description here

Below is the size of files available. enter image description here

The Blob settings provided at Stream Analytics job is given below. enter image description here

The Output for the data sampling from Blob Input is 783 rows as given below where as if I am uploading a sample data file of 1MB from my local machine it returns 10K rows. enter image description here

2

2 Answers

0
votes

Sampling events from a live source will retrieve up to 1000 events or 1 MB (whichever comes first), so the data sampled may not represent the full time interval specified.

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-test-query

Your question isn't super clear to me, but does this fit your scenario?

0
votes

While Adding a sample file from our local machine the max size of file we can upload is 2MB. If you are taking sample from the Blob input itself, it will not take the entire data from Blob, whereas it will fetch a data which is less than 1MB as sample data. So the number of rows obtained at output will be comparatively less.

Once you run the Analytics Job, we could see the entire data in the blob is getting processed. So the above asked question is not an error or issue