2
votes

A friend asked me how to store raw video, frame by frame, in HBase. The typical access pattern would be to retrieve frames for a block of time. Each frame is approx. 7MB and the footage is captured at about 30 frames-per-second. A 20-minute video, for example, would take about 250GB of storage.

I saw an excellent video by Lars George, author of HBase: the definitive guide, titled HBase Schema Design: things you need to know, where he talks about storing video "chunks" (the snippet where he's talking about video starts at 1:07:12 and ends at 1:08:52), so it's seems like HBase could, potentially, be a fit for this use-case.

I created a couple of row-key options:

Scenario 0: rowkey=video ID + timestamp; frames in a single column (tall, skinny table), e.g.

key                     col
video1|1497567476.123   image=[image BLOB]
video1|1497567476.156   image=[image BLOB]
...
video1|1497567536.014   image=[image BLOB]

advantages:

  • simplicity

disadvantages:

  • hot spot when we read since the keys are consecutive

Scenario 1: rowkey = hash(video ID + round(timestamp, 1 minute)) + timestamp; frames in a single column, e.g.

key                                             col
18ba6892ce0933ece7282b1f2971b3fd|1497567536.014 image=[image BLOB]
...
2ea8ce843615408fb19f8d6e44df32c7|1497567476.123 image=[image BLOB]
2ea8ce843615408fb19f8d6e44df32c7|1497567476.156 image=[image BLOB]

The rowkey has a prefix that ensures that one-minute chunks are distributed across the cluster and, within a one-minute chunk, frames are in consecutive time order.

advantages:

  • chunks distributed across regions and, within each chunk, the reads will be sequential. It's a compromise that allows for sequential reads and distributes data across HBase regions.

disadvantages:

  • a bit inflexible; not sure what the optimal chunk time window should be and, once set, it's hard to change

Scenario 2: rowkey = hash(video ID + round(timestamp, 1 minute)); frames in columns offset from 'base' time (columns are OpenTSDB-like):

key                              col:base_time + (0 * x millis) col:base_time + (1 * x millis)  col:base_time + (2 * x millis)
18ba6892ce0933ece7282b1f2971b3fd image=[image BLOB]             ...                             ...
2ea8ce843615408fb19f8d6e44df32c7 image=[image BLOB]             image=[image BLOB]              [image BLOB]

advantages:

  • pattern has been proven to work well for timeseries metrics by OpenTSDB (see slide 13 in this presentation)

disadvantages:

  • very large rows, which is not generally a good idea for HBase

Does anyone have any recommendations or insights for the best rowkey design for video frames?

Note: I know of a couple of similar examples that, instead of using HBase to store the video footage, use sequence files or .har files with separate indexes to capture metadata to allow random access. For now, I'd like to focus on HBase: specifically the rowkey design.

2

2 Answers

1
votes

I like you approach but I would suggest to use (videoID % number_of_regions) + videoID + timestamp. This way you are not restricted to 1 min limit, but reads are consequtive and whole video is stored in same region.

1
votes

You have 200MB of data per second of the video (7MB per frame * 30 fps).

Data locality is a good thing when cells are small and its faster to read everything from a single machine then to wait for all machines to return results. Loading even 5 sec of video (1GB) is already a huge load for a single machine disk IO, so data locality does not help.

I believe salting/prefixing the keys is the best solution here

key = hash(video_id, timestamp) + video_id + timestamp

You will get even data distribution across the cluster and spread the load. You can store frames in a single row in separate columns or add frame_id to the key, is does not matter.

To get better performance you also need to set correct CF size settings to fit your data.