I am running a Spark 2.2 job on Dataproc and I need to access a bunch of avro files located in a GCP storage bucket. To be specific, I need to access the files DIRECTLY from the bucket (i.e. NOT first have them copy/pasted onto the Master machine, both because they might be very large and also for compliance reasons).
I am using the gs://XXX notation to refer to the Bucket inside the Spark code, based on recommendations in this doc:
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
Everything seems to work. However, I am seeing the following Warnings repeatedly:
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns2.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns1.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns3.avro' is not open.
Is this a serious warning? Would it have any material impact on real-life performance (speed), particularly in case of large/many files involved? If so, how should I fix this, or should I just ignore it?
**** UPDATE:
Here's the most basic code to produce this in JAVA:
public static void main(String args[]) throws Exception
{
SparkConf spConf = new SparkConf().setAppName("AVRO-TEST-" + UUID.randomUUID().toString());
Master1 master = new Master1(spConf);
master.readSpark("gs://ff_src_data");
}
class Master1
{
private SparkConf m_spConf;
private JavaSparkContext m_jSPContext;
public Master1(SparkConf spConf)
{
m_spConf = spConf;
m_jSPContext = new JavaSparkContext(m_spConf);
}
public void readSpark(String srcDir)
{
SQLContext sqlContext = SQLContext.getOrCreate(JavaSparkContext.toSparkContext(m_jSPContext));
Dataset<Row> trn = sqlContext.read().format("com.databricks.spark.avro").load(srcDir);
trn.printSchema();
trn.show();
List<Row> rows = trn.collectAsList();
for(Row row : rows)
{
System.out.println("Row content [0]:\t" + row.getDouble(0));
}
}
}
For now, this is just a silly setup to test the ability to load a bunch of Avro files directly from the GCS Bucket.
Also, to clarify: this is Dataproc Image version 1.2 and Spark version 2.2.1