Optimal space-efficient storage scheme for Sparse Binary Matrix with runs

Question

In the context of implementing a "search window" for fast dynamic time warping in C, I need to build a structure that represents a sparse binary matrix with a very particular structure.

This matrix has some strong structural constraints on it: specifically, there is only one "run" of non-zero data per row, although that run is of variable length. The value of the starting cell's column index and ending cell's column index increases monotonically as the row increases:

Example valid matrices (* = filled, . = unfilled)

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *


  0 1 2 3 4 5
0 * * * * . .
1 . * * * * .
2 . . * * * .
3 . . . . * *
4 . . . . . *

Example invalid matrices:

  0 1 2 3 4 5
0 * * * . * * <-- more than one continuous run
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 * * * * * * <-- start index not monotonically increasing
4 . . * * * *


  0 1 2 3 4 5
0 * * * . . .
1 . * * * * .
2 . * * * . . <-- end index not monotonically increasing
3 . . * * * *
4 . . * * * *

These sparse matrices are mostly populated on the diagonals, but they are not square, so I don't know if I should be using "jagged diagonal storage".

My current approach is to store (start, end) column indices for each row: i.e.

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

is (logically) stored as

(0, 0) --> (0, 2)
(1, 1) --> (1, 3)
(2, 1) --> (2, 3)
(3, 2) --> (3, 5) 
(4, 2) --> (4, 5)

(although these are internally stored as ravelled indices, i.e.)

(0 * num_cols + 0) --> (0 * num_cols + 2)
(1 * num_cols + 1) --> (1 * num_cols + 3)

so the final structure ends up looking like

[0, 2, 7, 9, 13, 15, 20, 23, 26, 29]

Then, this structure is delta encoded like

[first_value, offset_1, offset_2, ...]
[0, 2, 5, 2, 4, 2, 5, 3, 3, 3]

Because these delta values are small and positive, we can take advantage of packing them into some flavor of VARINT structure.

First question: do these matrices have a well-known name? Can't seem to find much in the literature.

Second question: is this storage scheme taking advantage of all the strong properties of the matrices? Can I abuse the constraints in order to store less data somehow?

I have read a few documents describing sparse matrix storage for general sparse matrices, but it strikes me that this special case structure might have a special case optimal storage / encoding scheme.

So there is always exactly one run of non-zero values per row, or can there be rows with only zeros? — John Bollinger
How much shorter than a whole row can you rely on the non-zero runs to be? — John Bollinger
What bounds, if any, are there on the dimensions of the matrices you need to represent? — John Bollinger
Is minimum storage size your most important criterion? Because smaller size generally correlates with more expensive access. — John Bollinger
@JohnBollinger There is always exactly one run of non-zero values per row. There are no rows with only zeros. It is possible in theory that the run could span the length of a whole row. The matrices are in theory unbounded in size: an m x n matrix is used to compute timewarp distance for a pair of timeseries of length m and length n. In practice, I don't expect to see timeseries of length more than 2^16 or so. Eventually, this data structure will need to support iterating over set indices -- I could be convinced that a different tradeoff between space and time is reasonable. Thoughts? — mrdmnd

Stephan Lechner Stephan Lechner · Accepted Answer · 2016-12-13T00:31:48

I think your approach of considering the absolute index values (and then using delta encoding) already yields good results, but it does not make use of the monotonically incrementing start/end-index constraint.

Considering a storage structure that stores - relative to an absolute start and end index - only the increment of the start and end index should - in average - yield numbers with a smaller range (and consequently a shorter representation).

Translating your first sample into this structure would look as following:

(0/2) - (1/1)(0/0)(1/2)(0/0)

where the first pair (0/2) represents the absolute start/end - indices, and the other pairs stand for the increment of start and end for each line. As these numbers use a smaller range, a variable length integer encoding should yield a better compression.

Consider, for example, the following (simple) integer encoding:

0 .. 00
1 .. 01
2 .. 100
3 .. 101
4 .. 1100
5 .. 1101
6 and higher: 111 + 16 bit integer

With this encoding, the first sample translates into the following bit stream comprising 22 bits:

00 100 01 01 00 00 01 100 00 00

This approach works the better the smaller the increments are; When considering the increments row-wise, this should be the case if the matrix has more rows than columns.

Just an idea for optimizing matrixes with more columns than rows: one could think of using the same storage approach column-wise; I think (yet without mathematical proof) that row-wise monotonically increments with single runs imply also column-wise monotonically increments with single runs.

Optimal space-efficient storage scheme for Sparse Binary Matrix with runs

1 Answers