3
votes

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?

The error is:

java.io.FileNotFoundException: File does not exist [HDFS PATH]

Which i think happens because the metadata says there is a file A that gets deleted during the job execution.

The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)

3
Where exactly are you storing your data? S3? HDFS? Also whats the full stack trace? Also are you directly manipulating files or are you using the Table abstractions?Andrew Long
Data is stored in HDFS, using table abstractions(writing spark dataframes to hive), more detailed error: Exception: ERROR : Status: Failed ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1585784648138_16268_15_00, diagnostics=[Task failed, taskId=task_1585784648138_16268_15_00_000022, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: File does not exist: <HDFS PATH>Luis Leal

3 Answers

3
votes

The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.

0
votes

Have a set of two tables and a view over them:

CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;

First iteration of ETL process needs to:

  1. Synchronize foo_a -> foo_b
  2. Do the work on foo_b
  3. Drop view foo and recreate it pointing to foo_b

Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.

This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:

  • user is unlucky enough to hit a short time between dropping and recreating the view
  • user submits a query that's heavy enough to run across two iterations of ETL
-1
votes

not sure but check it out

CREATE TABLE foo_a (...); CREATE TABLE foo_b (...);