in Spark 2.1 I often use something like
df = spark.read.parquet(/path/to/my/files/*.parquet)
to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL.
Now I want to try Impala because I read the wiki article, which containing sentences like:
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...].
Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet.
So it sounds like it could also fit to my use case (and performs maybe a bit faster).
But when I try things like:
CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION '/tmp';
I get an AnalysisException
AnalysisException: Cannot infer schema, path is not a file
So now my questions: Is it even possible to read a folder containing multible parquet files with Impala? Will Impala perform a schema merge like spark? What query do I need to perform this action? Couldn't find any information about it using Google. (always a bad sign...)
Thanks!