We have our data in SQL Server at the moment, we are trying to move them to our s3 bucket as parquet files. The intention is to analyse this s3 data in AWS EMR (Spark, Hive & Presto mainly). We don't want to store our data in HDFS.
What are the choices here? so far from our knowledge, it seems we can use either spark or sqoop for this import. Though sqoop is faster than Spark in this case due to parallelism (parallel db connections), it seems writing parquet file from sqoop to s3 is not possible - Sqoop + S3 + Parquet results in Wrong FS error . Workaround is to move to hdfs and then to s3. However this seems to be non-efficient. How about using SparkSQL to pull this data from SQL Server and write as parquet in s3 ?
Once we load this data as parquet in this format
s3://mybucket/table_a/day_1/(parquet files 1 ... n). s3://mybucket/table_a/day_2/(parquet files 1 ... n). s3://mybucket/table_a/day_3/(parquet files 1 ... n).
How can I combine them together as a single table and query using Hive. I understand that we can create hive external table pointing to s3, but can we point to multiple files?
Thanks.
EDIT: Adding this as requested.
org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)