Step 1: Reading entire data into a single column and convert into ArrayType
input_file = "../data/datafile.csv"
initial_df = sqlContext.read.format("csv").csv(input_file)
initial_df.show(n=100, truncate=False)
Output Result:
+------------------------------------------------------------------------+
|_c0 |
+------------------------------------------------------------------------+
|ABC:"MobileData"|XYZ:"TableData"|ZXC:"MacData"|MNB:"" |
|ABC:"value1" |XYZ:"value2" |ZXC:"value3" |MNB:"value4" |
|ABC: "valueA" |XYZ:"ValueB" |ZXC:"valueC" |MNB:"valueD"|POI:"valueE"|
|ABC:"value11" |XYZ:"value12" |ZXC:"value13" |MNB:"value14" |
|ABC:"value1A" |XYZ:"value2A" |ZXC:"value3A" |
+------------------------------------------------------------------------+
Step 2: Converting StringType to ArrayType.
inter_df = initial_df.withColumn("array", F.split(initial_df['_c0'], '\|'))
inter_df.show(n=100, truncate=False)
+------------------------------------------------------------------------+------------------------------------------------------------------------------+
|_c0 |array |
+------------------------------------------------------------------------+------------------------------------------------------------------------------+
|ABC:"MobileData"|XYZ:"TableData"|ZXC:"MacData"|MNB:"" |[ABC:"MobileData", XYZ:"TableData", ZXC:"MacData", MNB:""] |
|ABC:"value1" |XYZ:"value2" |ZXC:"value3" |MNB:"value4" |[ABC:"value1" , XYZ:"value2" , ZXC:"value3" , MNB:"value4"] |
|ABC: "valueA" |XYZ:"ValueB" |ZXC:"valueC" |MNB:"valueD"|POI:"valueE"|[ABC: "valueA" , XYZ:"ValueB" , ZXC:"valueC" , MNB:"valueD", POI:"valueE"]|
|ABC:"value11" |XYZ:"value12" |ZXC:"value13" |MNB:"value14" |[ABC:"value11" , XYZ:"value12" , ZXC:"value13" , MNB:"value14"] |
|ABC:"value1A" |XYZ:"value2A" |ZXC:"value3A" |[ABC:"value1A" , XYZ:"value2A" , ZXC:"value3A"] |
+------------------------------------------------------------------------+------------------------------------------------------------------------------+
Step 3: Converting ArrayType to Dictionary Type so based on key am going to take the Respective key Values. Here am using UDF for converting ArrayType to MapType. For this conversion, it's taking a huge time. (Currently am running code with 300GB file, for processing its taking 3Hour time ) I want to reduce consuming time. Can anyone help me with this please*
def create_dict(input_string):
result_list = {}
for ele in input_string:
internal_ele = ele.strip()
internal_ele = internal_ele.split(":")
internal_ele = [ele.strip() for ele in internal_ele]
result_list[internal_ele[0]] = internal_ele[1].replace('"', "")
return result_list
create_dict_udf = F.udf(create_dict, MapType(keyType=StringType(), valueType=StringType()))
inter_df = inter_df.withColumn("dictionary", create_dict_udf(F.col("array")))
inter_df.show(n=100, truncate=False)
+------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|_c0 |array |dictionary |
+------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|ABC:"MobileData"|XYZ:"TableData"|ZXC:"MacData"|MNB:"" |[ABC:"MobileData", XYZ:"TableData", ZXC:"MacData", MNB:""] |[MNB -> , XYZ -> TableData, ABC -> MobileData, ZXC -> MacData] |
|ABC:"value1" |XYZ:"value2" |ZXC:"value3" |MNB:"value4" |[ABC:"value1" , XYZ:"value2" , ZXC:"value3" , MNB:"value4"] |[MNB -> value4, XYZ -> value2, ABC -> value1, ZXC -> value3] |
|ABC: "valueA" |XYZ:"ValueB" |ZXC:"valueC" |MNB:"valueD"|POI:"valueE"|[ABC: "valueA" , XYZ:"ValueB" , ZXC:"valueC" , MNB:"valueD", POI:"valueE"]|[MNB -> valueD, XYZ -> ValueB, ABC -> valueA, POI -> valueE, ZXC -> valueC]|
|ABC:"value11" |XYZ:"value12" |ZXC:"value13" |MNB:"value14" |[ABC:"value11" , XYZ:"value12" , ZXC:"value13" , MNB:"value14"] |[MNB -> value14, XYZ -> value12, ABC -> value11, ZXC -> value13] |
|ABC:"value1A" |XYZ:"value2A" |ZXC:"value3A" |[ABC:"value1A" , XYZ:"value2A" , ZXC:"value3A"] |[XYZ -> value2A, ABC -> value1A, ZXC -> value3A] |
+------------------------------------------------------------------------+------------------------------------------------------------------------------+---------------------------------------------------------------------------+
Without using UDF how can I convert ArrayType to MapType?