I'm in the process of working on a POC with OrientDB. I've set it up across 3 servers. I read the OrientDB documentation and wanted to know the best possible method to load the data which is in the form of CSV files. The schema having 3 class vertices and 3 class edges which should be interconnected among one another.
Below are some of the questions i have :
1) Does it make sense in terms of ETL performance, if i create 3 clusters for each of the classes and assign each cluster to one of the servers ? ( based on this link : http://orientdb.com/docs/2.2.x/Distributed-Sharding.html I'm not worried about fault tolerance at this stage )
2) Regarding the ETL storage process, i'm considering 3 options :
- The ETL tool provided with OrientDB ( with all possible optimizations )
Utilizing OGraphBatchInsert
Storing in terms of document ( http://orientdb.com/docs/2.2.x/Graph-Batch-Insert.html )
For the 2nd and 3rd method, I'm required to provide Record Ids manually, My doubt is how do i make sure Duplicate vertices are not created. Will Indexing help avoid this ? How does the above 3 methods compare in terms of performance ?
3) Is it possible to store in one server of the OrientDB cluster within that machine using the "plocal" option in the ETL tool ?
4) Is it possible to use plocal option for ETL , even when the OrientDB runs on distributed mode ?