RDD Json file processing

Question

I imported data from bigquery using the following code on Pyspark:

table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

The output is an RDD framework but has the data in a json format:

[(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')]

I need to extract all the Values in the RDD format. A main concern being the resulting RDD should not contain double quotes for each record.

Required:

Value1,Value4
Value2

and not:

"Value1,Value4"
"Value2"

Can you show your result in a valid python data structure? Do you need another rdd returned as well? — Psidom
I need an RDD since I would be using MLlib to implement an algorithm. — Nivi
If the json is separated by a "," then its return type will be str already. How would you know the type of each Value, such as float, int, str and so on? — Willian Fuks
Do you need a RDD of sc.parallelize(["Value1,Value4", "Value2"]) for instance? — Psidom

Psidom Psidom · Accepted Answer · 2017-11-10T22:37:25

Possibly load it with json module:

import json

table_data.map(lambda t: json.loads(t[1]).get("colA")).collect()
# [u'Value1,Value4', u'Value2']

RDD Json file processing

2 Answers