0
votes

I'm in the process of working on a POC with OrientDB. I've set it up across 3 servers. I read the OrientDB documentation and wanted to know the best possible method to load the data which is in the form of CSV files. The schema having 3 class vertices and 3 class edges which should be interconnected among one another.

Below are some of the questions i have :

1) Does it make sense in terms of ETL performance, if i create 3 clusters for each of the classes and assign each cluster to one of the servers ? ( based on this link : http://orientdb.com/docs/2.2.x/Distributed-Sharding.html I'm not worried about fault tolerance at this stage )

2) Regarding the ETL storage process, i'm considering 3 options :

For the 2nd and 3rd method, I'm required to provide Record Ids manually, My doubt is how do i make sure Duplicate vertices are not created. Will Indexing help avoid this ? How does the above 3 methods compare in terms of performance ?

3) Is it possible to store in one server of the OrientDB cluster within that machine using the "plocal" option in the ETL tool ?

4) Is it possible to use plocal option for ETL , even when the OrientDB runs on distributed mode ?

1

1 Answers

0
votes
  1. Makes sense. Pay attention also at the copies, because with 3 servers if you copy the same cluster to all the servers, it will be slower (of course)
  2. I suggest you to use the ETL if you don't need complex transformation. if it's slower, you can write your piece of code in Java
  3. and 4. It's supported, but not from oetl.sh script. You have to write a Java class with a few lines of coed that (1) start a distributed server as embedded and then run the ETL main class (com.orientechnologies.orient.etl.OETLProcessor).