0
votes

I have two servers, one is for neo4j to store graph data, another server will run ETL to load data into neo4j every minutes.

My current solution is: using a for loops to run a transaction for each item of coming data (based on py2neo) , but the performance is very slow, I have also tried to save a tmp csv file in the neo4j local server, then use load csv syntax in cypher it will improve performance a lot, but I dont know how to load csv from a remote server.

so, what I want to know is that if there is a way to load dict/list/(pandas dataframe) into neo4j ? just like load csv to do a batch import, in python script ? I am new to neo4j, thanks very much for help.

1

1 Answers

1
votes

If you want to load CSV from remote server, you need to run a simpleHTTPServer or something similar that hosts files on HTTPServer. Then you can simply use

LOAD CSV FROM "http://192.x.x.x/myfile.csv" as row

On the other hand you can import your file from a pandas dataframe. I have create a simple script that calculated linear regression gradient and saves it back to neo4j

from neo4j.v1 import GraphDatabase
import pandas as pd
import numpy as np
driver = GraphDatabase.driver("bolt://192.168.x.x:7687", auth=("neo4j", "neo4j"))
session = driver.session()

def weekly_count_gradient(data):
    df = pd.DataFrame([r.values() for r in data], columns=data.keys())
    df["week"] = df.start.apply(lambda x: pd.to_datetime(x).week if pd.notnull(x) else None)
    df["year"] = df.start.apply(lambda x: pd.to_datetime(x).year if pd.notnull(x) else None)
    group = df.groupby(["week","year","company"]).start.count().reset_index()
    for name in group["company"].unique():
        if group[group["company"] == name].shape[0] >= 5:
            x = np.array([i[1] if i[0] == 2016 else i[1] + 52 for i in group[group.company == name][["year","week"]].values])
            y = group[group.company == name]["start"].values
            fit = np.polyfit(x,y,deg=1)     
            update = session.run("MATCH (a:Company{code:{code}}) SET a.weekly_count_gradient = toFLOAT({gradient}) RETURN a.code,{"code":name,"gradient":fit[0]})

the key here is that you run a query with parameters, and parameters can come from anywhere (list/dict/pandas)