Can Hive table automatically update when underlying directory is changed

Question

If I build a Hive table on top of some S3 (or HDFS) directory like so:

create external table newtable (name string) 
row format delimited 
fields terminated by ',' 
stored as textfile location 's3a://location/subdir/';

When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?

Were the files added directly to s3a://location/subdir/ or to any subdirectories under this location? — franklinsijo
This does not make sense. The metastore holds the location, not its content. Every file within the location is supposed to be scanned when you query the table. — David דודו Markovitz
@franklinsijo The files were added directory to s3a://location/subdir/. @Dudu Every file is supposed to be scanned which is why if I add another file to that subdirectory, I would expect that data to show up when I run 'select *' on the table. But it doesn't; it shows the same table (without the newly added data). — covfefe

leftjoin leftjoin · Accepted Answer · 2017-03-08T17:17:54

On HDFS each file scanned each time table being queried as @Dudu Markovitz pointed. And files in HDFS are immediately consistent. On S3 files are immediately consistent after create and eventually consistent after delete or overwrite. When you add new files in s3 table folder they are immediately accessible when querying Hive table. There may be a problem with eventual consistency in S3 if you are rewriting files. If you rewrite files they are not immediately consistent, they are eventually consistent, see here: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel. There are few approaches to eliminate eventual consistency problem, such as writing each time newly created partition based on timestamp or dropping and creating table with new location based on timestamp or some runID. The idea is to create new files each time. Also have a look at this: https://github.com/andrewgaul/are-we-consistent-yet

Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344

Can Hive table automatically update when underlying directory is changed

2 Answers