1
votes

Updated question:

I'm new in neo4j and im trying to create a movie rating database. I have a file with 100.000 registers (2,3 MB) that represents when a user rated a movie; the file look like this (this dataset is from MovieLens):

dataset file

I'm using py2neo and I create de datebase with this code:

data = pd.read_csv('ratings_small.csv')
def create_bipartite_graph(data):
#Indexes creation to perform the queries
    graph.run('''
        CREATE INDEX user_index IF NOT EXISTS FOR (u:User) ON (u.UserId)
    ''')
    graph.run('''
        CREATE INDEX movie_index IF NOT EXISTS FOR (m:Movie) ON (m.MovieId)
    ''')
    actual_user_node = None
    for index, row in data.iterrows():
        userID = int(row['userId'])
        movieID = int(row['movieId'])
        rating = row['rating']
        date = datetime.fromtimestamp(row['timestamp']).strftime("%m/%d/%Y, %H:%M:%S")

        #Creation nodes and relationships
        graph.run('''
            MERGE(u:User{UserId: $uID})
            MERGE(m:Movie{MovieId: $mID})
            CREATE (u)-[:RATED_MOVIE{rating: $r, date: $d}]->(m)
        ''', parameters = {'uID': userID, 'mID': movieID, 'r': rating, 'd': date})

The problem is with that small dataset, takes more than 2 hours in create the graph. Any advice for the time of the databse creation decrease considerably?

Using LOAD CSV:

This is the query that I execute in neo4j browser

EXPLAIN LOAD CSV With HEADERS FROM 'file:///ratings_small.csv' AS line FIELDTERMINATOR ',' 
MERGE(m:Movie{MovieId: toInteger(line.movieId)})
MERGE(u:User{UserId: toInteger(line.userId)})
CREATE (u)-[:RATED_MOVIE{rating:toFloat(line.rating)}]->(m)

And this is the profile plan: profile plan

2

2 Answers

1
votes

Every graph run call represents not only a full round trip to the server, but a completely separate transaction. Working like this, with one such call inside every cycle of the loop, is therefore incredibly inefficient.

I suggest instead taking a look at the bulk data operations API: https://py2neo.readthedocs.io/en/stable/bulk/index.html

0
votes

You are starting with a CSV file, right? You could use LOAD CSV to bring it into Neo4j. Use your python code to put the csv in the Neo4j Import directory. Then run this query ...

LOAD CSV With HEADERS FROM 'file:///ratings_small.csv' AS line FIELDTERMINATOR ',' MERGE(m:Movie{MovieId: toInteger(line.mID)})
    ''', parameters = {'mID': toInteger(line.movieID),rating:toFloat(line.rating)})

You can add this at the start if you want to iterate 5000 rows at a time

Using periodic commit 5000 ....

This should run much faster!