16
votes

We're creating a multi-tenant application that must segregate data between tenants. Each tenant will save various documents, each of which can fall into several different document categories. We plan to use Azure blob storage for these documents. However, given our user base and the number of documents and size of each one, we're not sure how to best manage storage accounts with our current Azure subscription.

Here are some numbers to consider. With 5,000 users at 27,000 8Mb documents per year per user, that is 1080TB per year total. A storage container maxes out at 500TB per storage account.

So my question is what would be the most efficient and cost effective way to store this data and stay within the Azure limits?

Here are a few things we've considered:

  1. Create a storage account for each client. THIS DOES NOT WORK because you can only have 100 storage accounts per subscription (this would have been the most ideal solution).

  2. Create a blob container for each client. A storage account can have up to 500TB, so this could potentially work except for eventually we would have to split off into other storage accounts. I'm not sure how that would work if eventually a user had data in two accounts. Could get messy.

Perhaps we are missing something fundamentally simple here.

UPDATE For now our thought is to use Azure table storage with a table for each document type. Within each table the partition key would be the tenant's ID, and the row key would be the document ID. Each row would also contain metadata type information for the document, along with a URI (or something) linking to the blob itself.

2
Will you be storing the client/files relationship in some kind of table? For example, a master table which would store the list of all files for all clients?Gaurav Mantri
@Gaurav Mantri: Great question! I have provided an update to address your question.spoof3r

2 Answers

14
votes

Not really an answer but think of it as "food for thought" :). Basically your architecture should be based on the fact that each storage account has some scalability targets and your design should be such that you don't exceed those to maintain high availability of storage for your application.

Some recommendations:

  • Start by creating multiple storage accounts (say 10 to begin with). Let's call them Pods.
  • Each tenant will get one of the pod. You can pick a pod storage account randomly or use some predefined logic. The information about the pod is stored along side tenant information.
  • From the description it seems that currently you're storing the file information in just one table. This would put a lot of stress on just one table/storage account which is not a scalable design IMHO. Instead when a tenant is created, you assign a pod to the tenant and then create a table for each tenant which will store the file information in that table. This would have following benefits: 1) You have nicely isolated each tenant data, 2) The read requests are now load-balanced thus allowing you to stay within scalability targets and 3) Since each tenant data lies in a separate table, your PartitionKey became free and you can assign some other value if needed.

Now coming on to storing files:

  • Again you can go with the Pod concept wherein files for each tenant reside in the pod storage account for that tenant.
  • If you see issues with this approach, you can randomly pick the pod storage account and put the file there and store the blob URL in the Files table.
  • You could either go with just one blob container (say tenant-files) or separate blob containers for each tenant.
  • With just one blob container for all tenants, management overhead is smaller as you just have to create this container when a new pod is commissioned. However the downside is that you can't logically separate files by tenant so if you want to provide direct access to the files (using Shared Access Signature), it would be problematic.
  • With separate blob containers for each tenant, the management overhead is more but you get nice logical isolation. In this as a tenant is brought on board, you would have to create container for that tenant in each pod storage account. Similarly when a new pod is commissioned, you have to ensure that a blob container is created for each tenant in the system.

Hope this gives you some idea about how you can go about architecting your solution. We're using some of these concepts in our solution (which explicitly uses Azure Storage as data store). It would be really interesting to see what architecture you come up with.

9
votes

I am just going to put my thought on the topic, and it do have some redundant information to Gaurav Mantri's answer. This is based on a design that I came up with after doing something very similar at my current work.

Azure Blob storage

  1. Randomly select a pod from pod pool when tenant is created and store its namespace along with the tenant information.

  2. Provide an api for creating containers where container names are composite of tenant id Guid::ToString("N") + <resourcename>. You dont need to sell the to your users as containers, i can be folders, worksets or filebox, you find a name.

  3. Provide an api for maintaining documents within these containers.

This means that you can just increase the pod pool if getting more tenants, ect remove those pods that is getting filled up.

The benefits of this is that you do not need to keep two systems for your data, using both table storage and blob storage. Blob storage already have a way to present data as a directory/files hierarchy.

Extension Points

Blob Storage Api Broker

On top of the above design I made an Owin Middleware that wraps in between clients and blob storage, basicly just forwarding requests from clients to blob storage. This step is off cause not needed, as you can delegate normal sas tokens and talk directly to blob storage from clients. But it makes it easy to hook into when actions are executed on files. Each tenant will get its own endpoint files/teantid/<resourcename>/

Using such an API would also enable you to hook into whatever token authentication system you may be useing already to validate the authenticate and authorize the incoming requests and then sign the requests in this API.

Blob Storage Metadata

Using the above api broker extension, combined with metadata one can actually take it a step further and modify incoming requests to always include metadata and add in filters on the xml returned to blob storage before sending it to clients to filter out containers or blobs. One example would be when users delete a blob, then set a x-ms-meta-status:deleted and filter them out when returning blobs/containers. This way you can add different procedures for deleting data behind the scenes.

One should be careful here, since you don't want to put to much logic in here since it adds a penalty on all requests, but doing it smart can make this work very nice.

This extensions would also allow you to allow your users to create "empty" subfolders inside a container, but placing a zero byte file with a status:hidden that also will be filtered out. (remember that blob storage only can show virtual folders if there is something in them). This could also be achieved using table storage.

Azure Search

Another great extension point is that for each blob you could keep it in Azure Search to be able to find content, and this is most likely my favorite. I dont see any good solution using just blob storage or table storage that could give you a good search functionality or to some extend even a good filtering experience. With Azure Search this would give users a really rich experience for finding their content again.

Snapshots

Another extension is that snapshots could be created for every time a file is modified automatically. This becomes even easier with the broker api, otherwise monitoring logs is an options.

These ideas comes from a project that I started that I wanted to share, but since I am busy the coming months at work I don't see myself releasing my project before the summer holidays give me time to finish. The motivation of the project is to provide a nuget package that enables other developers to quickly set up this broker api that i mentioned above and configure a multi tenant blob storage solution.

I kindly ask you to vote up this answer if you read this and believe such a project could have saved you time in your current development process. This way I can see if I could use more time on the project or not.

I think that gaurav Mantris answer is more spot on for the question above, but just wanted to share my ideas on the topic.