1
votes

We need to convert some huge files stored in Azure data lake store from nested JSON to CSV. Since the python modules pandas, numpy are supported in Azure data lake analytics apart from the standard modules I believe its pretty much possible to achieve this using python. Does anyone have the python code to achieve this?

Source Format:

{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}

Expected Output

TDM,location,location/80:7a:bf:d4:d6:50,974851970,1490004475,OP_UPDATE,151002334,xxxxxxx,gHq/1NZQ,977.7259,638.8827,490,1,n5THo6IINuOSVZ/cTidNVA==,7hY/jVh9NRqqxF6gbqT7Jw==,LV/ZiQRQMS2wwKiKTvYNBQ==,H5rrAD/jg1Fnkmo1Zmquau/Qn1U=,ALGORITHM_ESTIMATION,FEET

1

1 Answers

0
votes

According to your description, based on my understanding, I think your key needs is how to convert data stored in Azure Data Lake Store from JSON format to CSV format in Python with pandas/numpy packages. So I viewed your source data, and assumed there is not array type in JSON, then I designed the code below for the sample data conversion.

Here is my sample code for a JSON format object string. As reference, I add some comments for understanding my idea which the key is flattern method for converting the structure {"A": 0, "B": {"C": 1}} to the structure [["A", "B.C"], [0, 1]].

import json
import pandas as pd

# Source Data string
json_raw = '''{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}'''

# Load source data string to a Python dict
json_data = json.loads(json_raw)

# The key method `flattern` for converting `dict` to `2D-list`
def flattern(data, key):
    keys = []
    values = []
    if key is None:
        for key in data:
            if type(data[key]) is dict:
                keys.extend(flattern(data[key], key)[0])
                values.extend(flattern(data[key], key)[1])
            else:
                keys.append(key)
                values.append(data[key])
    else:
        for subkey in data:
            if type(data[subkey]) is dict:
                keys.extend(flattern(data[subkey], key+"."+subkey)[0])
                values.extend(flattern(data[subkey], subkey)[1])
            else:
                keys.append(key+"."+subkey)
                values.append(data[subkey])
    return [keys, values]

list2D = flattern(json_data, None)
df = pd.DataFrame([list2D[1],], columns=list2D[0])

# If you want to extract the items `Loc` & `Topic` & others like `location.staEthMac.addr`, you just need to create a list for them.
selected = ["Loc", "Topic"]
# Use `selected` list to select the columns you want.
result = df.ix[:,selected]
# Transform DataFrame to csv string
csv_raw = "\n".join([",".join(lst) for lst in pd.np.array(result)])

Hope it helps.