gremlin query to load csv file with selected column

Question

I am using following script in gremlin to create a graph by using csv file:

graph = TinkerGraph.open()
graph.createIndex('userId', Vertex.class) //(1)
g = graph.traversal()
getOrCreate = { id ->
 g.V().has('userId', id).tryNext().orElseGet{ g.addV('userId', id).next() }
}
 new File('wiki-Vote.txt').eachLine 
 { 
 if (!it.startsWith("#")){ 
 l->p=it.split(',').collect(getOrCreate) //(2)**
 (fromVertex, toVertex) = (s[0],s[1])
   fromVertex.addEdge('votesFor', toVertex) } }

as we can see in this query see line

l>p=it.split(',').collect(getOrCreate)

in this line the csv file rows are splitting based on delimiter "," then the getOrCreate method function is called to apply indexing on the collected vertices.

if I give g.V().count() it is counting the all the values in all columns. but I need to add only selected columns into vertex.

what I need: I want to apply getOrCreate method only on selected columns instead of applying on all columns

for example:if csv file has name, age,Id,marks columns. I want to apply getOrCreate method only on name and age columns and add these into vertex. if I give g.V().count()... it has to give me only name and age counts

Your code would be much more readable with better indentation and better variable names (don't use one-character varnames) — glenn jackman

stephen mallette stephen mallette · Accepted Answer · 2017-02-06T12:04:16

That example you provided looks like the one from the Powers of Ten blog post on bulk loading. The blog post presents a bit of an over-simplification of the CSV loading concept to convey the point that a simply Groovy script is the best way to load small graphs. The logic is also tied fairly tightly to the wikivote data which is an edge list with just user identifiers.

If you have a more complex set of logic for loading or a CSV file that contains more columns than you care to load, then you'll need to expand on the starting point presented in the blog post. How you do this is dependent on the structure of your CSV file. Let's assume it is still just an edge list as the wikivote data was, but you just have more columns for edge vertex pair in the edge list:

getOrCreate = { id,name,age ->
  def p = g.V('userId', id)
  if (p.hasNext()) ? p.next() : g.addVertex([userId:id, userName:name, userAge:age])
}

new File('wiki-Vote.txt').eachLine {
  if (!it.startsWith("#")){
    def row = it.split('\t')
    def fromVertex = getOrCreate(row[0],row[1],row[3])
    def toVertex = getOrCreate(row[5],row[6],row[8])
    fromVertex.addEdge('votesFor', toVertex)
  }
}

g.commit()

So instead of Groovy magic to decompose a row of the CSV file into vertices, we just split the row into a list of columns. Then we call the getOrCreate for the "fromVertex" and the "toVertex" with just the columns that we need (I made assumptions of how your data is structured, so hopefully you get the idea that I was able to ignore certain columns in this code). If your CSV file is significantly complex, you might want to consider getting some help from groovycsv which is a really nice parsing library and can help simplify your code a bit.

Note that this code (and the blog post) were based on code for TinkerPop 2.x and Titan 0.5.x. Obviously, the Gremlin syntax for "addVertex" would have to be adjusted for TinkerPop 3.x if you needed that.

gremlin query to load csv file with selected column

1 Answers