1
votes

I have a neo4j database schema that looks like:

(a:Author)<-[r:HAS_AUTHOR]-(n:Article)-[rel:HAS_DESCRIPTOR]->(d:Descriptor)

I'd like to do a query showing the link between authors and descriptors, filtered for authors that have published more than once (count(r)>1) and for descriptors that occurred in more than one article (count(rel)>1)

Here is the query that I wrote:

MATCH (a:Author)<-[r:HAS_AUTHOR]-(n:Article)-[rel:HAS_DESCRIPTOR]->(d:Descriptor)
WITH a,count(r) as cnt WHERE cnt>1
MATCH (a:Author)<-[r:HAS_AUTHOR]-(n:Article)-[rel:HAS_DESCRIPTOR]->(d:Descriptor)
WITH d,count(rel) as cnt1 WHERE cnt1>1
MATCH (a:Author)<-[r:HAS_AUTHOR]-(n:Article)-[rel:HAS_DESCRIPTOR]->(d:Descriptor)
RETURN * limit 100

It doesn't seem to do what I'm expecting. I'm still seeing Authors or Descriptors linked to a single article.

Note that the count of relationships should be considered only in the context of the query (ie.: with limit 100, all authors should be linked to more than one article in the query output graph).

Is that the right way to write this query? Thanks

EDIT

I apologize for not being clear enough.

If I run a simple query showing all author--article--descriptor graphs, I can have some of the scenario in images below.

In all images, yellow nodes are articles, green are authors and pink are descriptors.

Scenario 1: An article that is the only one mentioning the descriptor. I'd like to filter out those descriptors that are mentioned in only one article.

enter image description here

Scenario 2: A descriptor mentioned by more than one article but whose authors have not published any other articles. I'd like to filter out those authors that have published only one article

enter image description here

These two filters should apply at the sub-graph level. For example: if I filter down to a particular descriptor type, then the two conditions (author and descriptor with more than one article) should be fulfilled in this new sub-graph.

The first query that was proposed generate graphs as in the image below:

MATCH (a:Author)
WHERE size((a)<-[:HAS_AUTHOR]-()) > 1
MATCH (a)<-[:HAS_AUTHOR]-(n:Article)-[:HAS_DESCRIPTOR]->(d:Descriptor)
WITH a, d, collect(n) as articles
WHERE size(articles) > 1
RETURN a, d, articles

The collect(n) as articles for a,d pair forces the author to have published twice on the same descriptor which is not desirable. I'd like to allow for an author who has published papers on 2 different descriptors to appear. enter image description here

The second query that was proposed generate graphs as in the image below:

MATCH (d:Descriptor)
WHERE size((d)<-[:HAS_DESCRIPTOR]-()) > 1
WITH collect(d) as descriptors
MATCH (a:Author)
WHERE size((a)<-[:HAS_AUTHOR]-()) > 1
MATCH (a)<-[:HAS_AUTHOR]-(n:Article)-[:HAS_DESCRIPTOR]->(d)
WHERE d in descriptors
RETURN a, n, d

Note that I added a filter on descriptor type so that the query could run and I'm not sure if that would impact the filtering condition. Here it shows descriptors and author linked to a single article. enter image description here

1
How many of each of these node types are there in the graph? Also, for descriptors that occurred in more than one article (count(rel)>1), do you mean across all articles, or for more than one article considering articles per author? - InverseFalcon
@InverseFalcon there are 10k author nodes, 26k article nodes and 1.3k descriptors. I meant for more than one article considering the query sub-graph (ie.: any descriptor in that sub-graph should be linked to more than one article). - yoann

1 Answers

3
votes

The first optimization is for filtering for :Authors that have published more than once. All this requires is a degree check on :HAS_AUTHOR relationships from the author, something that can be done cheaply since a node knows the types and counts of relationships attached to it. You can use the size() function on the pattern to get this: WHERE size((author)<-[:HAS_AUTHOR]-()) > 1.

Next to get the patterns involving descriptors that occur in more than one article, we need to do aggregation of the articles by author and descriptor, keeping only rows where there are more than one article.

Try this out:

MATCH (a:Author)
WHERE size((a)<-[:HAS_AUTHOR]-()) > 1
MATCH (a)<-[:HAS_AUTHOR]-(n:Article)-[:HAS_DESCRIPTOR]->(d:Descriptor)
WITH a, d, collect(n) as articles
WHERE size(articles) > 1
RETURN a, d, articles

This returns rows featuring the author, the descriptor, and the collection of articles ( > 1) by the article with the given descriptor.

EDIT

Looks like you want to filter for :Descriptors that have been mentioned more than once total, regardless of author, and not per the subgraph we're forming in the query.

In that case, it may be best to pre-match to these and filter, then collect, and use that collection for some set operations as we expand out the subgraph.

MATCH (d:Descriptor)
WHERE size((d)<-[:HAS_DESCRIPTION]-()) > 1
WITH collect(d) as descriptors
MATCH (a:Author)
WHERE size((a)<-[:HAS_AUTHOR]-()) > 1
MATCH (a)<-[:HAS_AUTHOR]-(n:Article)-[:HAS_DESCRIPTOR]->(d)
WHERE d in descriptors
RETURN a, n, d