3
votes

I have 3 nodes created in graph database, origin airport and destination airport and carrier. They are related by a property named 'canceled_by'.

   MATCH (origin:origin_airport {name: row.ORIGIN}),
   (destination:dest_airport {name: row.DEST}),
   (carrier:Carrier {name: row.UNIQUE_CARRIER})
   CREATE (origin)-[:cancelled_by {cancellation: row.count}]->(carrier)
   CREATE (origin)-[:cancelled_by {cancellation: row.count}]->(destination)
   CREATE (origin)-[:operated_by {carrier: row.UNIQUE_CARRIER}]->(carrier)

cancelled_by holds the value number of times particular carrier is cancelled. My input file will be of below format:

ORIGIN  UNIQUE_CARRIER  DEST    Cancelled
 ABE    DL                ATL    1
 ABE    EV                ATL    1
 ABE    EV                DTW    3
 ABE    EV                ORD    3
 ABQ    DL                DFW    2
 ABQ    B6                JFK    2

Here I need to calculate the cancellation percentage of each carrier. I am expecting result as below:

UNIQUE_CARRIER  DEST    Percentage_Cancelled
    DL                   25%
    EV                   58.33%
    B6                   16.66%

Example: Total number of cancellation = 12
No of cancellation for DL = 3
Percentage of cancellation for DL = (3/12)*100 = 25%

Below query gives the sum of cancellation for each carrier:

 MATCH ()-[ca:cancelled_by]->(c:Carrier)
 RETURN c.name  AS Carrier, 
 SUM(toFloat(ca.cancellation)) As sum
 ORDER BY sum DESC
 LIMIT 10

I tried the below query for calculating percentage:

  MATCH ()-[ca:cancelled_by]->(c:Carrier)
  WITH SUM(toFloat(ca.cancellation)) As total
  MATCH ()-[ca:cancelled_by]->(c:Carrier)
  RETURN c.name  AS Carrier, 
  (toFloat(ca.cancellation)/total)*100 AS percent
  ORDER BY percent DESC
  LIMIT 10

But it is not calculating percentage by grouping and instead calculating the percentage individually.

     Carrier    sum
        DL    0.36862408915559364
        DL    0.34290612944706383
        DL    0.3171881697385341

How to calculate percentage based on group_by using cypher queries in Neo4j?

2
yup. will be signing up. In the mean time, trying some queries on my own to get used to it. You have any suggestion for above issue?Sriram Chandramouli

2 Answers

6
votes

You forgot the summation by every carrier when grouping, and not necessarily always use a cast to float - just when the last calculation multiplied by the floating-point number.

MATCH ()-[ca:cancelled_by]->(:Carrier)
  WITH SUM(ca.cancellation) As total
  MATCH ()-[ca:cancelled_by]->(c:Carrier)
RETURN c.name AS Carrier, 
       100.0 * SUM(ca.cancellation) / total AS percent
  ORDER BY percent DESC
  LIMIT 10
0
votes

Hi You could try using the R dplyr package. Use the chaining operation %>% along with the functions group_by, summarize and transmute. The group_byand summarize will give you the sum of cancelled within each group. Use the transmute function to get relative frequencies.