1
votes

In the context of implementing a "search window" for fast dynamic time warping in C, I need to build a structure that represents a sparse binary matrix with a very particular structure.

This matrix has some strong structural constraints on it: specifically, there is only one "run" of non-zero data per row, although that run is of variable length. The value of the starting cell's column index and ending cell's column index increases monotonically as the row increases:

Example valid matrices (* = filled, . = unfilled)

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *


  0 1 2 3 4 5
0 * * * * . .
1 . * * * * .
2 . . * * * .
3 . . . . * *
4 . . . . . *

Example invalid matrices:

  0 1 2 3 4 5
0 * * * . * * <-- more than one continuous run
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 * * * * * * <-- start index not monotonically increasing
4 . . * * * *


  0 1 2 3 4 5
0 * * * . . .
1 . * * * * .
2 . * * * . . <-- end index not monotonically increasing
3 . . * * * *
4 . . * * * *

These sparse matrices are mostly populated on the diagonals, but they are not square, so I don't know if I should be using "jagged diagonal storage".

My current approach is to store (start, end) column indices for each row: i.e.

  0 1 2 3 4 5
0 * * * . . .
1 . * * * . .
2 . * * * . .
3 . . * * * *
4 . . * * * *

is (logically) stored as

(0, 0) --> (0, 2)
(1, 1) --> (1, 3)
(2, 1) --> (2, 3)
(3, 2) --> (3, 5) 
(4, 2) --> (4, 5)

(although these are internally stored as ravelled indices, i.e.)

(0 * num_cols + 0) --> (0 * num_cols + 2)
(1 * num_cols + 1) --> (1 * num_cols + 3)

so the final structure ends up looking like

[0, 2, 7, 9, 13, 15, 20, 23, 26, 29]

Then, this structure is delta encoded like

[first_value, offset_1, offset_2, ...]
[0, 2, 5, 2, 4, 2, 5, 3, 3, 3] 

Because these delta values are small and positive, we can take advantage of packing them into some flavor of VARINT structure.

First question: do these matrices have a well-known name? Can't seem to find much in the literature.

Second question: is this storage scheme taking advantage of all the strong properties of the matrices? Can I abuse the constraints in order to store less data somehow?

I have read a few documents describing sparse matrix storage for general sparse matrices, but it strikes me that this special case structure might have a special case optimal storage / encoding scheme.

1
So there is always exactly one run of non-zero values per row, or can there be rows with only zeros?John Bollinger
How much shorter than a whole row can you rely on the non-zero runs to be?John Bollinger
What bounds, if any, are there on the dimensions of the matrices you need to represent?John Bollinger
Is minimum storage size your most important criterion? Because smaller size generally correlates with more expensive access.John Bollinger
@JohnBollinger There is always exactly one run of non-zero values per row. There are no rows with only zeros. It is possible in theory that the run could span the length of a whole row. The matrices are in theory unbounded in size: an m x n matrix is used to compute timewarp distance for a pair of timeseries of length m and length n. In practice, I don't expect to see timeseries of length more than 2^16 or so. Eventually, this data structure will need to support iterating over set indices -- I could be convinced that a different tradeoff between space and time is reasonable. Thoughts?mrdmnd

1 Answers

1
votes

I think your approach of considering the absolute index values (and then using delta encoding) already yields good results, but it does not make use of the monotonically incrementing start/end-index constraint.

Considering a storage structure that stores - relative to an absolute start and end index - only the increment of the start and end index should - in average - yield numbers with a smaller range (and consequently a shorter representation).

Translating your first sample into this structure would look as following:

(0/2) - (1/1)(0/0)(1/2)(0/0)

where the first pair (0/2) represents the absolute start/end - indices, and the other pairs stand for the increment of start and end for each line. As these numbers use a smaller range, a variable length integer encoding should yield a better compression.

Consider, for example, the following (simple) integer encoding:

0 .. 00
1 .. 01
2 .. 100
3 .. 101
4 .. 1100
5 .. 1101
6 and higher: 111 + 16 bit integer

With this encoding, the first sample translates into the following bit stream comprising 22 bits:

00 100 01 01 00 00 01 100 00 00

This approach works the better the smaller the increments are; When considering the increments row-wise, this should be the case if the matrix has more rows than columns.

Just an idea for optimizing matrixes with more columns than rows: one could think of using the same storage approach column-wise; I think (yet without mathematical proof) that row-wise monotonically increments with single runs imply also column-wise monotonically increments with single runs.