Is CKAN capable of dealing with 100k+ files and TB of data?

Question

What we are wanting to do is create a local data repository for our lab memebers to organize, search upon, access, catalog, reference our data, etc. I feel that CKAN can do all of these things; however, I'm not sure how it will handle these tasks for the data we actually have (I could be wrong, which is why I'm asking).

Our lab is procuring a lot of data for internal use. We would like to be able to catalog and organize this data within our group (maybe CKAN?) so people can push data to the catalog, and pull the data and use it. Some use cases would be, having ACL to the data, web interface, search, browse, organize, add, delete, update datasets etc. While CKAN looks to be a very good fit for this, the problem comes in with the data (more so the amount) we are trying to deal with.

We are wanting to catalog anything from terabytes of images (200k+ images), geospatial data in various formats, twitter streams (TBs of JSON data), database dump files, binary data, machine learning models, etc. I wouldn't think it would be reasonable to add 100k 64MB JSON files as a resource to a CKAN dataset, or is it? We realize we aren't going to be able to search within this JSON/images/geo data, which is fine. But we would like to find information out on if we had the data available (e.g. we search "twitter 2015-02-03"), a type of metadata search if you will. Using a local file store within CKAN, what happens if a user requests 200k images? Would the system become unresponsive when it is having to answer these requests?

I've seen CKAN used on the datahub.io and the vast majority of that stuff is small CSV files, small 2-3MB zip files, and no more than 20 or 30 individual files within a dataset.

So is CKAN capable of doing what we want? If it isn't any suggestions on alternatives?

Edit more specific questions instead of discussion:

I have looked around and googled for information regarding this topic but I haven't see a deployed system with any significant amount of data.

Is there a limit on the file sizes that I am able to upload (e.g. a zipped 400GB database file)?
Is there a limit to the number of files I upload as a resource to a dataset within CKAN? (e.g. I create a dataset and upload 250,000 64MB JSON files and the system be usable?)
The UI doesn't seem to support the capability of uploading multiple files at the time time (e.g. a folder of data as a resource). Is there a tool/extension/plugin that supports this functionality already?
a. is there any restrictions that would prevent me from using the CKAN API to achieve this?

Questions on SO should be a specific problem, not open ended discussions — D Read

Ben Scott Ben Scott · Accepted Answer · 2015-10-21T17:30:30

We're using CKAN at the Natural History Museum (data.nhm.ac.uk) for some pretty hefty research datasets - our main specimen collection has 2.8 million records - and it's handling it very well. We have had to extend CKAN with some custom plugins to make this possible though - but they're open source and available on Github.

Our datasolr extension moves querying large datasets into SOLR, which handles indexing and searching big datasets better than postgres (on our infrastructure anyway) - https://github.com/NaturalHistoryMuseum/ckanext-datasolr.

To prevent CKAN falling over when users download big files, we moved the packaging and download to a separate service and task queue.

https://github.com/NaturalHistoryMuseum/ckanext-ckanpackager https://github.com/NaturalHistoryMuseum/ckanpackager

So yes, CKAN with a few contributed plugins can definitely handle larger datasets. We haven't tested it with TB+ datasets yet, but we will next year when we use CKAN to release some phylogenetic data.

Is CKAN capable of dealing with 100k+ files and TB of data?

2 Answers