I'm looking into how BigTable compresses my data.
I've loaded 1,5GB into 1 table; about 500k rows containing 1 column, on average each cell holds about 3kb. In further tests more columns will be added to these rows containing similar data with similar size.
The data in each cell is currently a JSON serialized array of dictionaries [10 elems on avg], like:
[{
"field1": "100.10",
"field2": "EUR",
"field3": "10000",
"field4": "0",
"field5": "1",
"field6": "1",
"field7": "0",
"field8": "100",
"field9": "110.20",
"field10": "100-char field",
"dateField1": "1970-01-01",
"dateField2": "1970-01-01",
"dateTimeField": "1970-01-01T10:10:10Z"
},{
"field1": "200.20",
"field2": "EUR",
"field3": "10001",
"field4": "0",
"field5": "1",
"field6": "0",
"field7": "0",
"field8": "100",
"field9": "220.30",
"field10": "100-char field",
"dateField1": "1970-01-01",
"dateField2": "1970-01-01",
"dateTimeField": "1970-01-01T20:20:20Z"
}, ...]
The BigTable console shows me that the cluster holds 1,2GB. It thus compressed the 1,5GB I inserted to roughly 80% of the original size. Gzipping a typical string as they are stored in the cells however gives me a compression ratio of about 20%.
This compression performance of BigTable thus seems low to me, given that the data I'm inserting holds a lot of repetitive values (e.g. the dictionary keys). I understand that BigTable trades of compression for speed, but I'd hoped it to perform better on my data.
Is a compression ratio of 80% ok for data like above, or are lower values to be expected? Are there any techniques to improve the compression, apart from remodeling the data I'm uploading?
Thanks!