1
votes

I have a storage with 2 GB of hashes, which i want to check with a public Api.

Use Case

Let's say I want to create an API which check if a person is known by my product. To respect the persons privacy I don't want to upload his name, member id and so on. So I decide to upload only a hash of the combined Informationen which will identify him. Now I have 2 GB (6*10^7) of SHA256 hashes and want to check them in a insane fast way.

This API should be hosted in azure.

Afte reading the documentation of the azure storage account, I think the Azure Table Storage is the right storage solution. I would set the base64 hash as partition key and leave the row key empty.

Question

  1. First, is the Azure Table the right storage for the job?
  2. Will it be a performance different between:
    1. partition key: base64 hash, row key: empty
    2. partition key: 'Upload Id', row key: empbase64 hash
  3. Does the time to access trough keys depends on the size of the table?
  4. What is the fastest way to check if a partition key is present? I think my naive first try is not really the best way.

    if(members.Where(x=>x.PartitionKey == Convert.ToBase64String(data.Hash)).AsEnumerable().Any()) { return req.CreateResponse(HttpStatusCode.OK, "Found Hash"); }else { return req.CreateResponse(HttpStatusCode.NotFound, "Don't found Hash"); }

  5. How to upload the 2 GB of hashes? I think about to upload one big file and use azure function to split after each 256 bit and add the value to azure storage. Or any better Idea?

2
Sorry for the bad formatted code block, I was unable to format it right.hdev
There's no right answer to #1. If you're doing partition scans or table scans, your query will absolutely get slower as your table grows (#3). #4 can't be done without a table scan (or you keeping track of all partition keys in another table). #5 is a completely different topic. But why would you leave a row key empty? That makes no sense.David Makogon
"But why would you leave a row key empty? That makes no sense." How would you design it, if you only need a lookup?hdev

2 Answers

3
votes

My take on this:

  1. If the only query you need is "check if existing hash exists" (and retrieve its details if needed), then Table Storage is the perfect match. Key lookups are fast and cheap, and 2 GB is nothing.

  2. Hash gives the most diversity, so I would use it for partition key. Row key can be anything then. If Upload Id is never used for (range) lookups, don't use it for keys.

  3. With proper partition key, the lookup time should be constant.

  4. If you mean you need to check if user hash is there or not, just retrieve one row by partition key + row key. That's the fastest operation possible. See "Retrieve a single entity" here.

  5. Table Storage supports batch inserts. Again, 2GB is not much, you probably spent more time asking this question than your upload will take :)

0
votes

I saw this is tagged with Azure-Functions, so I'll add that Azure-Functions lets you directly bind to table storage. See https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-table

You can even bind directly to a specific entity. The function.json would look like:

{
    "name": "<Name of input parameter in function signature>",
    "type": "table",
    "direction": "in",
    "tableName": "<Name of Storage table>",
    "partitionKey": "<PartitionKey of table entity to read - see below>",
    "rowKey": "<RowKey of table entity to read - see below>",
}