Impala: How to query against multiple parquet files with different schemata

Question

in Spark 2.1 I often use something like

df = spark.read.parquet(/path/to/my/files/*.parquet)

to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL.

Now I want to try Impala because I read the wiki article, which containing sentences like:

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...].

Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet.

So it sounds like it could also fit to my use case (and performs maybe a bit faster).

But when I try things like:

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION '/tmp';

I get an AnalysisException

AnalysisException: Cannot infer schema, path is not a file

So now my questions: Is it even possible to read a folder containing multible parquet files with Impala? Will Impala perform a schema merge like spark? What query do I need to perform this action? Couldn't find any information about it using Google. (always a bad sign...)

Thanks!

Personally, I would try Drill before Impala. Only because installation isn't straightforward. Even Hive or Pig would be quicker to try, if you're not already using Cloudera CDH — OneCricketeer
Had some related question, which might also help: stackoverflow.com/questions/56581105/… — Markus

roh roh · Accepted Answer · 2018-01-19T14:41:24

From what I understand, you have some parquet files and you want to see them through impala tables? Below is my explanation on it.

You can create an external table and set the location to the parquet files directory like below

CREATE EXTERNAL TABLE ingest_parquet_files(col1 string, col2 string) LOCATION "/path/to/my/files/" STORED AS PARQUET;

You have another option of loading the parquet files after creating the table

LOAD DATA INPATH "Your/HDFS/PATH" INTO TABLE schema.ingest_parquet_files;

What you are trying will also work, you have to remove the wildcard character, because it expects a path after the LIKE PARQUET, and looks for the files in the location.

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/'
STORED AS PARQUET
LOCATION '/tmp';

Below is the template you can refer which is pulled from Cloudera impala doc.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE PARQUET 'hdfs_path_of_parquet_file'
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
data_type:
    primitive_type
  | array_type
  | map_type
  | struct_type

Please note that the user you are using should have read-write access to any path you are giving to impala. You can achieve it by performing with the below steps

#Login as hive superuser to perform the below steps
create role <role_name_x>;

#For granting to database
grant all on database to role <role_name_x>;

#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;

#Granting the role to the user you will use to run the impala job
grant role <role_name_x> to group <your_user_name>;

#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

#Now to validate if the user has access to the role

show role grant group <your_user_name>;

More on how the roles and permissions are here

Impala: How to query against multiple parquet files with different schemata

1 Answers