Gremlin : GroupBy vertices , having count > 1

Question

I am using TITAN 0.4, and gremlin for traversals. My requirement is to identify duplicate vertices in graph, and to merge those. There are > 15 M vertices in graph.

gremlin> g.V.has('domain').groupBy{it.domain}{it.id}.cap

==>{google.com=[4], yahoo.com=[16, 24, 20]}

I am able to group the vertices, but I need only those domains(vertices) which exists more than once.

In the above example, I need to return only ==>{yahoo.com=[16, 24, 20]} The key "domain" is indexed, if that makes any difference.

Please help me here

stephen mallette stephen mallette · Accepted Answer · 2015-05-11T10:17:44

Consider use of groupCount rather than groupBy to save a step of counting up ids in your collected list:

g.V.has('domain').groupCount(it.domain}.cap.next().findAll{it.value>1}

I suppose this is cheaper as well on a larger traversal as you are just maintaining a counter rather than lists of identifiers.

Gremlin : GroupBy vertices , having count > 1

2 Answers