I imported data from bigquery using the following code on Pyspark:
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
The output is an RDD framework but has the data in a json format:
[(0, u'{"colA":"Value1,Value4"}'), (52, u'{"colA":"Value2"}')]
I need to extract all the Values in the RDD format. A main concern being the resulting RDD should not contain double quotes for each record.
Required:
Value1,Value4
Value2
and not:
"Value1,Value4"
"Value2"
stralready. How would you know the type of eachValue, such as float, int, str and so on? - Willian Fukssc.parallelize(["Value1,Value4", "Value2"])for instance? - Psidom