2
votes

I am running a Spark 2.2 job on Dataproc and I need to access a bunch of avro files located in a GCP storage bucket. To be specific, I need to access the files DIRECTLY from the bucket (i.e. NOT first have them copy/pasted onto the Master machine, both because they might be very large and also for compliance reasons).

I am using the gs://XXX notation to refer to the Bucket inside the Spark code, based on recommendations in this doc: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

Everything seems to work. However, I am seeing the following Warnings repeatedly:

18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns2.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns1.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns3.avro' is not open.

Is this a serious warning? Would it have any material impact on real-life performance (speed), particularly in case of large/many files involved? If so, how should I fix this, or should I just ignore it?

**** UPDATE:

Here's the most basic code to produce this in JAVA:

    public static void main(String args[]) throws Exception
    {
        SparkConf spConf = new SparkConf().setAppName("AVRO-TEST-" + UUID.randomUUID().toString());
        Master1 master = new Master1(spConf);
        master.readSpark("gs://ff_src_data");
    }

class Master1
{
    private SparkConf m_spConf;
    private JavaSparkContext m_jSPContext;

    public Master1(SparkConf spConf)
    {                       
        m_spConf = spConf;
        m_jSPContext = new JavaSparkContext(m_spConf);
    }

    public void readSpark(String srcDir)
    {
        SQLContext sqlContext = SQLContext.getOrCreate(JavaSparkContext.toSparkContext(m_jSPContext));

        Dataset<Row> trn = sqlContext.read().format("com.databricks.spark.avro").load(srcDir);
        trn.printSchema();
        trn.show();

        List<Row> rows = trn.collectAsList();
        for(Row row : rows)
        {
            System.out.println("Row content [0]:\t" + row.getDouble(0));
        }

    }
}

For now, this is just a silly setup to test the ability to load a bunch of Avro files directly from the GCS Bucket.

Also, to clarify: this is Dataproc Image version 1.2 and Spark version 2.2.1

1

1 Answers

1
votes

This warning means that code closes GoogleCloudStorageReadChannel after it was already closed. It's harmless warning message, but it could signal that input streams are handled inconsistently in the code when reading files.

May you provide simplified version of your job that reproduces this warning (the more concise it will be the better)? With this repro from you I will be able to check if this is an issue in GCS connector, or maybe in Hadoop/Spark Avro input format.

Update: This warning message was removed in GCS connector 1.9.10.