We are in the process of evaluating whether to move an multitenant EAV system built on PostGres to Cassandra and I wanted input on our schema approach to see if a test with Cassandra makes sense. Our multitenant system hierarchy consists of account->app where an account can run multiple apps. Queries need to be segregated by app or by account (aggregating all app data for the account). Accounts can create their own dataobjects with their own custom fields in our EAV model.
There are two approaches that I have considered taking with Cassandra. The first is to hold a certain number of apps (say 20) within 1 column family (to reduce the number of column families used). Each row would be identified by a composite column of accountid->appid->dataobjectid->recordid. Columns would be added on the fly for each app's dataobject as needed by that app. This means if the column family had two apps, 1 row for the first app may have 20 columns defined while the second app might have 30 columns defined. This would mean there would be a total of 50 potential columns for those two apps. Right now the average number of columns for an app is 19. This means the avg number of columns in a column family would be 400. Seems reasonable and takes advantage of Cassandra's wide column support. In fact, we could probably easily support more apps per columnfamily. The drawback is that secondary indexes would be difficult as we don't allow user's to create their own indexes so queries could not be made more efficient without .
The second approach is to have two columnfamilies to hold all data for say 1000 apps. The first column family would have the same composite column as above, but it would hold the entire dataobject for that row in a JSON document. A second column family would have the same composite key but would add another value to the key which is fieldid that represents the field within the json document (our apps metadata manager stores UUIDs to identify each "field" within a JSON doc), but would have a "fieldvalue" column for each datatype - string, number, decimal, float (dates and bools get converted to numbers). The nice feature here is that we could easily index each of those columns for search purposes and we are minimizing the number of columnfamilies we create.
What are the pros and cons of the two approaches above? Am I missing something obvious or misunderstanding Cassandra in the scenarios above (for example, can I have composite columns that are so wide in the first place)? Are there other, better schema suggestions for an app of this type?