I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
https://cloud.google.com/hadoop/writing-with-bigquery-connector
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
http://spark.apache.org/docs/latest/api/python/pyspark.html
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?