I have a simple database that I'm using to analyze twitter data among a specific group.
The data model is:
(:Person)-[:TWEETS_TO]->(:Twitter_Account)
and
(:Twitter_Account)-[:FOLLOWS]->(:Twitter_Account)
There are only a little over 500 (:Person) nodes, but there are about 500,000 (:Twitter_Account) nodes. In other words most (:Twitter_Account)s aren't connected to people.
I want to count the number of following relationships, but only among the 500 or so twitter accounts that are connected with people. Searching around I found this neo4j blog post and this SO post that suggest a query like this:
MATCH (p:Person)-[:TWEETS_TO]->(t1:Twitter_Account)
WITH t1,
size((t1)-[:FOLLOWS]->(:Twitter_Account)<-[:TWEETS_TO]-(:Person))
AS following
RETURN t1, following ORDER BY following LIMIT 5
Profiling gives:
Cypher version: CYPHER 3.2, planner: COST, runtime: INTERPRETED. 2938092 total db hits in 1356 ms.
As you can see, it's relatively quick, but my intuition says there should be way to write the query without having so many DB hits since we are only looking a small subset of the data that is easily defined. Everything else I've tried (such as matching both twitter accounts first) results in cartesian products that are much slower than the above.
Is there a way to count these relationships without looking at every twitter account?