Reducing the cost of Cypher Query

Question

I have a simple database that I'm using to analyze twitter data among a specific group.

The data model is:

(:Person)-[:TWEETS_TO]->(:Twitter_Account)

and

(:Twitter_Account)-[:FOLLOWS]->(:Twitter_Account)

There are only a little over 500 (:Person) nodes, but there are about 500,000 (:Twitter_Account) nodes. In other words most (:Twitter_Account)s aren't connected to people.

I want to count the number of following relationships, but only among the 500 or so twitter accounts that are connected with people. Searching around I found this neo4j blog post and this SO post that suggest a query like this:

MATCH (p:Person)-[:TWEETS_TO]->(t1:Twitter_Account)
WITH t1, 
size((t1)-[:FOLLOWS]->(:Twitter_Account)<-[:TWEETS_TO]-(:Person)) 
AS following
RETURN t1, following ORDER BY following LIMIT 5

Profiling gives:

Cypher version: CYPHER 3.2, planner: COST, runtime: INTERPRETED. 2938092 total db hits in 1356 ms.

As you can see, it's relatively quick, but my intuition says there should be way to write the query without having so many DB hits since we are only looking a small subset of the data that is easily defined. Everything else I've tried (such as matching both twitter accounts first) results in cartesian products that are much slower than the above.

Is there a way to count these relationships without looking at every twitter account?

InverseFalcon InverseFalcon · Accepted Answer · 2017-07-31T21:19:29

You may want to consider adding a separate label to the :Twitter_Accounts that are connected to people, to make your querying a little easier later.

MATCH (t:Twitter_Account)
WHERE exists(()-[:TWEETS_TO]->(t))
SET t:Connected_Account

If your graph needs to handle updates, then you'll need to ensure new accounts added check to see if a :Person is connected and add the label accordingly.

Once that's in place, your query later on becomes:

MATCH (t1:Connected_Account)
WITH t1, size((t1)-[:FOLLOWS]->(:Connected_Account)) as following
RETURN t1, following 
ORDER BY following 
LIMIT 5

If there are only 500 :Connected_Account nodes, then this should drastically reduce your db hits and speed up your query.

Reducing the cost of Cypher Query

2 Answers