0
votes

I am working on using Neo4j with py2neo for analyzing Twitter data. I'm a newbie in all of these, so the question might be pretty basic. But I could not find the answer in any of the documentations. I have two csv files, one with 100 followers, the other with about 22000 tweets. For the tweet I have informations like it is a reply to another tweet and the other users who have been mentioned in this tweet.

I want to add followers and tweets as nodes, then using the reply_to and the mentions_user field of the tweets to add connections between tweets (reply_to) and tweet and user (mentions).

Adding the nodes works well with batch. However, when I want to iterate through all Tweets using py2neo to add the relationships I get OutOfMemoryError: Java heap space.

I'm trying to iterate through the tweets like this:

for tweet in graph.find("Tweet")

My questions are now: a) Is there another way in py2neo to iterate through (a lot of) nodes? b) A little broader: I read in the py2neo documentation it is better to use cypher transactions than batch. Should I do that and could that also help for a)?

Thanks in advance for any help! KMM

1

1 Answers

0
votes

There are certainly ways to load bulk data effectively but this particular method (finding all items of a particular "type") is not one that takes advantage of the graph structure of the database and therefore won't scale well.

You can of course increase the Java heap size if this is a one-off and you may get away with it. But your best bet is probably to look into the LOAD CSV operation: http://neo4j.com/docs/stable/query-load-csv.html