0
votes

I imported data from bigquery using the following code on Pyspark:

table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

The output is an RDD framework but has the data in a json format:

[(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')]

I need to extract all the Values in the RDD format. A main concern being the resulting RDD should not contain double quotes for each record.

Required:

Value1,Value4
Value2

and not:

"Value1,Value4"
"Value2"
2
Can you show your result in a valid python data structure? Do you need another rdd returned as well? - Psidom
I need an RDD since I would be using MLlib to implement an algorithm. - Nivi
If the json is separated by a "," then its return type will be str already. How would you know the type of each Value, such as float, int, str and so on? - Willian Fuks
Do you need a RDD of sc.parallelize(["Value1,Value4", "Value2"]) for instance? - Psidom
@WillianFuks Every value is a string. - Nivi

2 Answers

1
votes

Possibly load it with json module:

import json

table_data.map(lambda t: json.loads(t[1]).get("colA")).collect()
# [u'Value1,Value4', u'Value2']
1
votes

From what I understood from your question this is what you are looking for:

import json
data = sc.parallelize([(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')])
data = data.map(lambda x: (json.loads(x[1])['colA']))
print(data.collect())

Results:

['Value1,Value4', 'Value2']