2
votes

I am using TITAN 0.4, and gremlin for traversals. My requirement is to identify duplicate vertices in graph, and to merge those. There are > 15 M vertices in graph.

gremlin> g.V.has('domain').groupBy{it.domain}{it.id}.cap

==>{google.com=[4], yahoo.com=[16, 24, 20]}

I am able to group the vertices, but I need only those domains(vertices) which exists more than once.

In the above example, I need to return only ==>{yahoo.com=[16, 24, 20]} The key "domain" is indexed, if that makes any difference.

Please help me here

2

2 Answers

2
votes

Consider use of groupCount rather than groupBy to save a step of counting up ids in your collected list:

g.V.has('domain').groupCount(it.domain}.cap.next().findAll{it.value>1}

I suppose this is cheaper as well on a larger traversal as you are just maintaining a counter rather than lists of identifiers.

0
votes

Old question, but did you try below to force the index?

g.V.hasNot('domain', null).groupBy{it.domain}{it.id}.cap