1
votes

I have a very huge ecommerce order data (including product details). I just started to explore Neo4j to load into graph database to calculate product relationships and patterns via graph algorithms. Following are the fields in my csv file

CUSTOMER_UNIQUE_ID (Customer Code)
ORDER_ID (Order Code)
ORDER_DATE (Order date)
CLIENT_TYPE (Ordered via Mobile / App / Desktop)
PARENT_SKU (Product ID)
LEV1 (Category Level 1)
LEV2 (Category Level 2)
LEV3 (Category Level 3)

To load the data I am using the following cypher code:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "FILE:///E:/Data/2015/Nov/MBA/order_item_MBA.csv" AS line
MERGE(product:Product {parent_sku:line.PARENT_SKU}) ON CREATE SET product.parent_sku = line.PARENT_SKU, product.lev1 = line.LEV1, product.lev2 = line.LEV2, product.lev3 = line.LEV3

It's taking 13 minutes to just run the above script of 50K records (5MB file size). Am i going wrong somewhere ? I was planning to load around 30M records. Apprx. 20+M nodes & 100+M edges. I want to create a product-customer graph creating edges based on products bought.

1
Please don't re-set the parent-sku in `ON CREATE``Michael Hunger
In general for your use-case I would turn the categories into a proper category tree instead.Michael Hunger

1 Answers

3
votes

If you MERGE on a node, you should have an index on the property (http://neo4j.com/docs/stable/query-schema-index.html):

CREATE INDEX ON :Product(parent_sku)

Ideally, you create a uniqueness constraint for this property on this label. This will automatically add a very fast index (http://neo4j.com/docs/stable/query-constraints.html):

CREATE CONSTRAINT ON (node:Product) ASSERT node.parent_sku IS UNIQUE

This should speed up your import a lot.