I am modeling a data set in a graph database (Titan 0.5.2 on top of Cassandra) which has entities (represented by vertices) and two types of properties - link between the entities (naturally represented by edges) and scalar property (like string or number). There are a number of property types (about 2000 now), each property type is always of the same kind (i.e., property P1 is always link and property P2 is always string) but each entity can have any set of properties and properties can be repeated (i.e., entity E1 can have three P2 values and no P1 values).
The question is how to best model the scalar values of P2 - should they be part of the entity vertex E1? A property on the edge between entity vertex E1 and property vertex P2? An edge between E1 and value vertex containing the actual value, labeled P2? Something else? I am interested mainly in performance considerations for each solution - i.e., is it better to have a lot of properties on vertices or "thin" vertices but a lot of them and a lot of edges? Is there a difference for indexing them? But also I'm interested in other considerations such as convenience of querying, etc.
The data set is in tens of millions of entities (but will potentially grow, probably to hundreds of millions) and each vertex usually has about 10-20 properties, but some vertices can have more properties, i.e. hundreds or more. The queries anticipated could use any property, both the fact it is present and its value, and may also require calculations like "the greatest P2 value for this entity" or "does this entity has any P2 value which satisfies certain condition". The querying is planned to be done by Gremlin-type queries, but using Titan-only features is not excluded if it helps.