2
votes

So I'm trying to run the following simple query on redshift spectrum:

select * from company.vehicles where vehicle_id is not null

and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter. I also don't see any errors.

The format of the files is orc.

The create table query is:

CREATE EXTERNAL TABLE 'vehicles'(
  'vehicle_id' bigint, 
  'parent_id' bigint, 
  'client_id' bigint, 
  'assets_group' int, 
  'drivers_group' int)
PARTITIONED BY ( 
  'dt' string, 
  'datacenter' string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0',  
  'classification'='orc', 
  'compressionType'='none')

Any idea?

2

2 Answers

1
votes

How did you create your external table ?? For Spectrum,you have to explicitly set the parameters to treat what should be treated as null

add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum

**

CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
    fields terminated by ','
    stored as textfile 
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');

**

Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values

1
votes

Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:

ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);