0
votes

I have a simple database that I'm using to analyze twitter data among a specific group.

The data model is:

(:Person)-[:TWEETS_TO]->(:Twitter_Account)

and

(:Twitter_Account)-[:FOLLOWS]->(:Twitter_Account)

There are only a little over 500 (:Person) nodes, but there are about 500,000 (:Twitter_Account) nodes. In other words most (:Twitter_Account)s aren't connected to people.

I want to count the number of following relationships, but only among the 500 or so twitter accounts that are connected with people. Searching around I found this neo4j blog post and this SO post that suggest a query like this:

MATCH (p:Person)-[:TWEETS_TO]->(t1:Twitter_Account)
WITH t1, 
size((t1)-[:FOLLOWS]->(:Twitter_Account)<-[:TWEETS_TO]-(:Person)) 
AS following
RETURN t1, following ORDER BY following LIMIT 5

Profiling gives:

Cypher version: CYPHER 3.2, planner: COST, runtime: INTERPRETED. 2938092 total db hits in 1356 ms.

As you can see, it's relatively quick, but my intuition says there should be way to write the query without having so many DB hits since we are only looking a small subset of the data that is easily defined. Everything else I've tried (such as matching both twitter accounts first) results in cartesian products that are much slower than the above.

Is there a way to count these relationships without looking at every twitter account?

2

2 Answers

1
votes

You may want to consider adding a separate label to the :Twitter_Accounts that are connected to people, to make your querying a little easier later.

MATCH (t:Twitter_Account)
WHERE exists(()-[:TWEETS_TO]->(t))
SET t:Connected_Account

If your graph needs to handle updates, then you'll need to ensure new accounts added check to see if a :Person is connected and add the label accordingly.

Once that's in place, your query later on becomes:

MATCH (t1:Connected_Account)
WITH t1, size((t1)-[:FOLLOWS]->(:Connected_Account)) as following
RETURN t1, following 
ORDER BY following 
LIMIT 5

If there are only 500 :Connected_Account nodes, then this should drastically reduce your db hits and speed up your query.

1
votes

You should only need to do a DB search for the Twitter_Account nodes (that are "owned" by any Person) once.

For example:

MATCH (:Person)-[:TWEETS_TO]->(t1:Twitter_Account)
WITH COLLECT(t1) AS accts
UNWIND accts AS acct
OPTIONAL MATCH (acct)-[:FOLLOWS]->(t2)
WHERE t2 IN accts
RETURN acct, COUNT(t2) AS following
ORDER BY following
LIMIT 5

In this query, we find all the Twitter_Account nodes that are "owned" by a Person, and keep that collection in accts. We then UNWIND that collection to find how many owned accounts (t2) are followed by each owned account (acct). Finally, we return each owned acct and the number of owned accounts it follows. (If you only want to return owned accounts that follow at least 1 owned account, replace OPTIONAL MATCH with MATCH).