I have a legacy dataset (ENRON data represented as GraphML) that I would like to query. In an comment in a related question, @StefanArmbruster suggests that I use Cypher to query the database. My query use case is simple: given a message id (a property of the Message node), retrieve the node that has that id, and also retrieve the sender and recipient nodes of that message.
It seems that to do this in Cypher, I first have to create an index of the nodes. Is there a way to do this automatically when the data is loaded from the graphML file? (I had used Gremlin to load the data and create the database.)
I also have an external Lucene index of the data (I need it for other purposes). Does it make sense to have two indexes? I could, for example, index the Neo4J node ids into my external index, and then query the graph based on those ids. My concern is about the persistence of these ids. (By analogy, Lucene document ids should not be treated as persistent.)
So, should I:
Index the Neo4j graph internally to query on message ids using Cypher? (If so, what is the best way to do that: regenerate the database with some suitable incantation to get the index built? Build the index on the already-existing db?)
Store Neo4j node ids in my external Lucene index and retrieve nodes via these stored ids?
UPDATE
I have been trying to get auto-indexing to work with Gremlin and an embedded server, but with no luck. In the documentation it says
The underlying database is auto-indexed, see Section 14.12, “Automatic Indexing” so the script can return the imported node by index lookup.
But when I examine the graph after loading a new database, no indexes seem to exist.
The Neo4j documentation on auto indexing says that a bunch of configuration is required. In addition to setting node_auto_indexing = true, you have to configure it
To actually auto index something, you have to set which properties should get indexed. You do this by listing the property keys to index on. In the configuration file, use the node_keys_indexable and relationship_keys_indexable configuration keys. When using embedded mode, use the GraphDatabaseSettings.node_keys_indexable and GraphDatabaseSettings.relationship_keys_indexable configuration keys. In all cases, the value should be a comma separated list of property keys to index on.
So is Gremlin supposed to set the GraphDatabaseSettings parameters? I tried passing in a map into the Neo4jGraph constructor like this:
Map<String,String> config = [
'node_auto_indexing':'true',
'node_keys_indexable': 'emailID'
]
Neo4jGraph g = new Neo4jGraph(graphDB, config);
g.loadGraphML("../databases/data.graphml");
but that had no apparent effect on index creation.
UPDATE 2
Rather than configuring the database through Gremlin, I used the examples given in the Neo4j documentation so that my database creation was like this (in Groovy):
protected Neo4jGraph getGraph(String graphDBname, String databaseName) {
boolean populateDB = !new File(graphDBName).exists();
if(populateDB)
println "creating database";
else
println "opening database";
GraphDatabaseService graphDB = new GraphDatabaseFactory().
newEmbeddedDatabaseBuilder( graphDBName ).
setConfig( GraphDatabaseSettings.node_keys_indexable, "emailID" ).
setConfig( GraphDatabaseSettings.node_auto_indexing, "true" ).
setConfig( GraphDatabaseSettings.dump_configuration, "true").
newGraphDatabase();
Neo4jGraph g = new Neo4jGraph(graphDB);
if (populateDB) {
println "Populating graph"
g.loadGraphML(databaseName);
}
return g;
}
and my retrieval was done like this:
ReadableIndex<Node> autoNodeIndex = graph.rawGraph.index()
.getNodeAutoIndexer()
.getAutoIndex();
def node = autoNodeIndex.get( "emailID", "<2614099.1075839927264.JavaMail.evans@thyme>" ).getSingle();
And this seemed to work. Note, however, that the getIndices() call on the Neo4jGraph object still returned an empty list. So the upshot is that I can exercise the Neo4j API correctly, but the Gremlin wrapper seems to be unable to reflect the indexing state. The expression g.idx('node_auto_index') (documented in Gremlin Methods) returns null.