What happens when data to be inserted into hdfs is larger than the capacity of datanodes

Question

I know data uploaded into hdfs are replicated across datanodes in a hadoop cluster as blocks. My question is what happens when the capacity of all datanodes in the cluster put together is insufficient? e.g. I have 3 datanodes each with a 10GB data capacity (30GB altogether) and I want to insert a data of size 60GB into the hdfs on the same cluster. I don't see how the 60GB data can be split into blocks (~64MB typically) to be accommodated by the datanodes?

Thanks

JamCon JamCon · Accepted Answer · 2014-03-13T05:11:56

I haven't tested it, but it should fail with an out of storage message. As each block of data is written into HDFS, it goes through the replication factor process. Your upload would get about half way through and then die.

That being said, you could potentially gzip the data (high compression) before the upload and potentially squeeze it in, depending on how compressible the data is.

What happens when data to be inserted into hdfs is larger than the capacity of datanodes

2 Answers