What we are wanting to do is create a local data repository for our lab memebers to organize, search upon, access, catalog, reference our data, etc. I feel that CKAN can do all of these things; however, I'm not sure how it will handle these tasks for the data we actually have (I could be wrong, which is why I'm asking).
Our lab is procuring a lot of data for internal use. We would like to be able to catalog and organize this data within our group (maybe CKAN?) so people can push data to the catalog, and pull the data and use it. Some use cases would be, having ACL to the data, web interface, search, browse, organize, add, delete, update datasets etc. While CKAN looks to be a very good fit for this, the problem comes in with the data (more so the amount) we are trying to deal with.
We are wanting to catalog anything from terabytes of images (200k+ images), geospatial data in various formats, twitter streams (TBs of JSON data), database dump files, binary data, machine learning models, etc. I wouldn't think it would be reasonable to add 100k 64MB JSON files as a resource to a CKAN dataset, or is it? We realize we aren't going to be able to search within this JSON/images/geo data, which is fine. But we would like to find information out on if we had the data available (e.g. we search "twitter 2015-02-03"), a type of metadata search if you will. Using a local file store within CKAN, what happens if a user requests 200k images? Would the system become unresponsive when it is having to answer these requests?
I've seen CKAN used on the datahub.io and the vast majority of that stuff is small CSV files, small 2-3MB zip files, and no more than 20 or 30 individual files within a dataset.
So is CKAN capable of doing what we want? If it isn't any suggestions on alternatives?
Edit more specific questions instead of discussion:
I have looked around and googled for information regarding this topic but I haven't see a deployed system with any significant amount of data.
- Is there a limit on the file sizes that I am able to upload (e.g. a zipped 400GB database file)?
- Is there a limit to the number of files I upload as a resource to a dataset within CKAN? (e.g. I create a dataset and upload 250,000 64MB JSON files and the system be usable?)
- The UI doesn't seem to support the capability of uploading multiple files at the time time (e.g. a folder of data as a resource). Is there a tool/extension/plugin that supports this functionality already?
- a. is there any restrictions that would prevent me from using the CKAN API to achieve this?