how to avoid errors when querying hive table being loaded by Spark at the same time

Question

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?

The error is:

java.io.FileNotFoundException: File does not exist [HDFS PATH]

Which i think happens because the metadata says there is a file A that gets deleted during the job execution.

The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)

Where exactly are you storing your data? S3? HDFS? Also whats the full stack trace? Also are you directly manipulating files or are you using the Table abstractions? — Andrew Long
Data is stored in HDFS, using table abstractions(writing spark dataframes to hive), more detailed error: Exception: ERROR : Status: Failed ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1585784648138_16268_15_00, diagnostics=[Task failed, taskId=task_1585784648138_16268_15_00_000022, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: File does not exist: <HDFS PATH> — Luis Leal

Andrew Long Andrew Long · Accepted Answer · 2020-09-24T23:36:42

The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.

how to avoid errors when querying hive table being loaded by Spark at the same time

3 Answers