1
votes

I am working on a project related to flight dataset. I have a dataframe of below format: It has flight number, carrier name, origin, destination, carrier delay, weather delay, nas delay, security delay and aircraft delay details in minutes.

FL_NUM  CARRIER ORIGIN  DEST    carr_del    weather_del   nas_del   sec_del   aircraft_del

   1     AA      JFK    LAX        0             0            0         0     0
   1     AS      DCA    SEA        0             0            0         0     0
   1     B6      JFK    FLL        12            0            12        0     0
   1     HA      LAX    HNL        405           0            5         0     0
   1     VX      SFO    DCA        24            20           50        0     0
   1     WN      ATL    MDW         0             0            0        0     0
   1     WN      DAL    HOU         27            0            0        0     0

I have formed the relationship as below in Neo4j using cypher queries:

   MERGE (origin:origin_airport {name: row.ORIGIN})
   MERGE (destination:dest_airport {name: row.DEST})
   MERGE (carrier:Carrier {name: row.UNIQUE_CARRIER})
   MERGE (flight:Flight {name: row.FL_NUM})
   MERGE (flight)-[:from {flnum: row.FL_NUM}]->(origin)
   MERGE (flight)-[:to {flnum: row.FL_NUM}]->(destination)
   MERGE (flight)-[:operated_by {carrier: row.UNIQUE_CARRIER}]->(carrier)
   MERGE (origin)-[r:delayed_by]->(destination)
   SET  r.carr_delay=row.carr_delay, r.weather_delay=row.weather_delay, 
   r.nas_delay=row.nas_delay, r.sec_delay=row.sec_delay,
   r.aircraft_delay=row.aircraft_delay
   MERGE (flight)-[r1:delayed_by]->(origin)
   SET  r1.carr_delay=row.carr_delay, r1.weather_delay=row.weather_delay, 
   r1.nas_delay=row.nas_delay, r1.sec_delay=row.sec_delay,
   r1.aircraft_delay=row.aircraft_delay
   ")

The relations are:

1) Flight number linked to origin airport(ORIGIN)
2) Flight number linked to destination airport(DEST)
3) Flight number linked to Unique carrier
4) Origin airport linked by delay to destination airport.
Delay parameter holds the value of carrier delay, weather delay, nas delay,   
security and late aircraft delay
5) Flight linked by delay to origin airport
Here again, delay parameter holds the value of carrier delay, weather delay,  
nas delay, security and late aircraft delay

Here I am looking to answer the question Top 10 carriers - leading type of delay.

I am using the below code to get top 10 carriers with respect to flight.

MATCH (f:Flight)-[:operated_by]->(c:Carrier)
WITH c, COUNT(f) AS flights
RETURN c.name,flights
ORDER BY flights DESC
LIMIT 10

I need to take it to the next step and calculate maximum delays associated with each carrier. Here, I have the delay values specified in minutes, my query needs to calculate which delay is having higher value and return the name of the delay for that particular carrier.

From the example, if you could notice for HA, carr_del has higher value and hence the output should be like:

  Carrier        Cause of delay
   HA                Carrier delay
   VX                nas delay

Is it possible to achieve using cypher queries in Neo4j? Or should I need to change the structure of relationship?.

If the above result is complex, is it possible to get top carriers with respect to any specific delay atleast, like say carrier delay?. Here carrier delay has the value for all the carriers, and it should return carriers based on top values. I know it starts somewhat like below, but not sure how to end.

    MATCH (c)<-[:operated_by]-(:Flight)-[r1:DELAYED_BY]

Could someone help me?

1

1 Answers

1
votes

1) I think that you have an error in the model (you keep redundant data, and lose the flight information, perform a specific carrier.) It should be something like this:

enter image description here

MERGE (carrier:Carrier {name: row.UNIQUE_CARRIER})
MERGE (flight:Flight {name: row.FL_NUM})
MERGE (destination:Airport {name: row.DEST})
MERGE (origin:Airport {name: row.ORIGIN})
MERGE (origin)-[:from]->(flight)-[:to]->(destination)
MERGE (flight)-[:flight_details]->
// Stores information about the flight, perform a specific carrier
      (:FlightByCarrierDetails {
        name: 'Detail of ' + flight.name + ' by ' + carrier.name, 
        carr_del: row.carr_del, weather_del: row.weather_del, 
        nas_del: row.nas_del, sec_del: row.sec_del, aircraft_del: row.aircraft_del})
      -[:operated_by]->(carrier)

2) Then your first query is:

MATCH (f:Flight)
      -[:flight_details]->(:FlightByCarrierDetails)
      -[:operated_by]->(c:Carrier)
RETURN c.name as `Carrier name`, COUNT(f) AS flights
ORDER BY flights DESC LIMIT 10

3) And search frequent reason for the delay is:

MATCH (f:Flight)
      -[:flight_details]->(d:FlightByCarrierDetails)
      -[:operated_by]-(c:Carrier)
WITH c,
     // reasons of delay
     {carr: SUM(d.carr_del), weather: SUM(d.weather_del), 
      nas: SUM(d.nas_del), sec: SUM(d.sec_del), 
      aircraft: SUM(d.aircraft_del)} as rD
UNWIND [rD.carr, rD.weather, rD.nas, rD.sec, rD.aircraft] as delay
WITH c, rD, max(delay) as mD
RETURN c.name as `Carrier name`,  
       REDUCE ( acc=0, r in keys(rD) | acc + rD[r] ) as `Total delay`,
       FILTER(r in keys(rD) WHERE rD[r]>=mD) as `Cause of delay`
ORDER BY `Total delay` DESC