4
votes

Is it possible to tell HDFS where to store particular files?

Use case

I've just loaded batch #1 of files into HDFS and want to run job/application on these data. However, I also have batch #2 that is still to be loaded. It would be nice if I could run job/application on first batch on, say, nodes from 1 to 10, and load new data to nodes, say, 11 to 20, completely in parallel.

Initially I thought that NameNode federation (Hadoop 2.x) does exactly that, but it looks like federation only splits namespace, while DataNodes still provide blocks for all connected NameNodes.

So, is there a way to control the distribution of data in HDFS? And does it make sense at all?

1
As @climbage mentioned, the answer to the question "How to put files to specific node (in HDFS)?" is to create your own BlockPlacementPolicy. However, even if you wanted to do that, it would be hard to achieve your use case, since it seems specific to what job(s) are running at a specific moment. Can you provide more details on why you want to do this? This kind of tweaking should not be necessary, and in most cases, the best approach is to let the framework take care of distributing load and I/O. - cabad
@cabad: I try to reduce number of simultaneous disk accesses. When you upload data to a node (write to disk) and make some computations using data from the same node (read from disk), these operations can fall into conflict and slow down both processes. I don't know how much slower it will be, but I'd like to know possible solutions (and effort needed) beforehand. - ffriend
While there may be some slow down due to simultaneous I/O, I doubt the effect will be too noticeable. The Hadoop and OS layers use buffering and caching (in the disk page cache as well as local Hadoop buffers) so that I/O operations can hit disk when it's most convenient without creating contention. - cabad
@cabad: thanks, you dispelled my doubts. - ffriend

1 Answers

7
votes

Technically, you can, but I wouldn't.

If you want full control over where the data goes, you can extend BlockPlacementPolicy (see how does hdfs choose a datanode to store). This won't be easy to do and I don't recommend it.

You can probably take steps to minimize the amount of traffic between your two sets of nodes with some clever setup to use rack-awareness to your advantage.