0
votes

I try to save some testing data to S3 from my local laptop using Java and getting following error:

java.io.IOException: No FileSystem for scheme: s3a at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1443) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:209) at org.apache.parquet.hadoop.ParquetWriter.(ParquetWriter.java:266) at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489)

Below is my code

private void testSaveToS3(SysS3Configuration s3Configuration) {
            try {
                Schema avroSchema = TestDTO.getClassSchema();
    
                Path path = new Path("s3a://" + s3Configuration.getBucketName()+"/test.parquet");
    
    
                Configuration config = new Configuration();
                config.set("fs.s3a.access.key", s3Configuration.getAccessKeyId());
                config.set("fs.s3a.secret.key", s3Configuration.getSecretKey());
    
                ParquetWriter writer = AvroParquetWriter.<GenericData.Record>builder(path)
                        .withSchema(avroSchema)
                        .withConf(config)
                        .withCompressionCodec(CompressionCodecName.SNAPPY)
                        .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                        .build();
    
                List<TestDTO> list = new ArrayList<>();
                TestDTO l1 = new TestDTO();
                l1.setId(1);
                l1.setValue(11);
    
                TestDTO l2 = new TestDTO();
                l2.setId(2);
                l2.setValue(22);
    
                list.add(l1);
                list.add(l2);
    
                for (TestDTO d : list) {
                    final GenericRecord record = new GenericData.Record(avroSchema);
                    record.put("id", d.getId());
                    record.put("value", d.getValue());
                    writer.write(record);
                }
            
                writer.close();
    
            } catch (Exception e) {
            
                e.printStackTrace();
            }
        }

I googled around but didn't get an answer. Any thoughts? Thanks in advance.

UPDATE:

  1. This is a java application and my local laptop doesn't have Hadoop installed.
  2. I have the following dependencies
compile 'com.amazonaws:aws-java-sdk:1.11.747'
compile 'org.apache.parquet:parquet-avro:1.8.1'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'

UPDATE: I change the hadoop-aws version to 3.3.0 as suggested, but still get the same error

java.io.IOException: No FileSystem for scheme: s3a
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
......

And then I try to change the "s3a://" in my path string to "s3n://". Now, I get a different error

java.io.IOException: The s3n:// client to Amazon S3 is no longer available: please migrate to the s3a:// client
    at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:82)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)

......

Any ideas? guys.

2
Sorry, I did not notice your update with the dependency versions, the version you are using is probably too old to contain the s3a implementation, I updated my answer - Jörn Horstmann

2 Answers

0
votes

The first thing to check would be the dependencies, the s3 filesystem implementation is in a separate artifact from the rest of hadoop. For example in gradle syntax:

api("org.apache.hadoop:hadoop-aws:$hadoopVersion")

Update: Since you added your dependencies, the hadoop version 1.2.1 is really old, the current version as of August 2020 is 3.3.0. In the older version you might be able to use s3 with the s3:// or s3n:// prefixes, but you should really update since the newer s3a implementation contains a lot of improvements.

0
votes

Adding this in config works for me.

conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

My dependencies in build.gradle to read parquet file from s3

compile 'org.slf4j:slf4j-api:1.7.5'
compile 'org.slf4j:slf4j-log4j12:1.7.5'
compile 'org.apache.parquet:parquet-avro:1.12.0'
compile 'org.apache.avro:avro:1.10.2'
compile 'com.google.guava:guava:11.0.2'
compile 'org.apache.hadoop:hadoop-client:2.4.0'
compile 'org.apache.hadoop:hadoop-aws:3.3.0'   
compile 'org.apache.hadoop:hadoop-common:3.3.0'      
compile 'com.amazonaws:aws-java-sdk-core:1.11.563'
compile 'com.amazonaws:aws-java-sdk-s3:1.11.563'

And if you have some data with Date and byte[], you also need to add this in config

conf.setBoolean(org.apache.parquet.avro.AvroReadSupport.READ_INT96_AS_FIXED, true);