0
votes

I am trying to use neo4j-admin import to populate a neo4j database with CSV input data. According to documentation, escaping quotation marks with \" is not supported but my input has these and other formatting anomalies. Hence neo4j-admin import obviously fails for input CSV

> neo4j-admin import --mode=csv --id-type=INTEGER \
>    --high-io=true \
>    --ignore-missing-nodes=true \
>    --ignore-duplicate-nodes=true \
>    --nodes:user="import/headers_users.csv,import/users.csv"
Neo4j version: 3.5.11
Importing the contents of these files into /data/databases/graph.db:
Nodes:
  :user
  /var/lib/neo4j/import/headers_users.csv
  /var/lib/neo4j/import/users.csv

Available resources:
  Total machine memory: 15.58 GB
  Free machine memory: 598.36 MB
  Max heap memory : 17.78 GB
  Processors: 8
  Configured max memory: -2120992358.00 B
  High-IO: true


IMPORT FAILED in 97ms. 
Data statistics is not available.
Peak memory usage: 0.00 B
Error in input data
Caused by:ERROR in input
  data source: BufferedCharSeeker[source:/var/lib/neo4j/import/users.csv, position:91935, line:866]
  in field: company:string:3
  for header: [user_id:ID(user), login:string, company:string, created_at:string, type:string, fake:string, deleted:string, long:string, lat:string, country_code:string, state:string, city:string, location:string]
  raw field value: yyeshua
  original error: At /var/lib/neo4j/import/users.csv @ position 91935 -  there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Universidad Pedagógica Nacional \"F'

My question is whether is it possible to skip or ignore poorly formatted rows of the CSV file for which neo4j-admin import throws an error. No such option seems available in the docs. I understand that solutions exist using LOAD CSV and that CSVs ought to be preprocessed prior to import. Note I am able to import CSV successfully when I fix formatting issues.

1
Why would you want to ignore poorly-formatted CSV data? The neo4j DB would be incomplete.cybersam
In my use case, I can tolerate the omission of a small subset of the nodes and/or relationships.sboysel
you need to double a " to escape it in a CSV. Can you repare your CSV file by replacing all \" by "" ?logisima
Yes I can replace \" with "" but my question is whether I can simply skip rows containing \" when using neo4j-admin importsboysel

1 Answers

1
votes

Perhaps it's worth describing the differences between the bulk importer and LOAD CSV.

LOAD CSV does a transactional load of your data into the database - this means you get all of the ACID goodness, etc. The side effect of this is that it's not the fastest way to load data.

The bulk importer assumes that the data is in a data-base ready format, that you've dealt with duplicates, any processing you needed to get it into the right form, etc., and will just pull the data as is and form it as specified into the database. This is not a transactional load of the data, and because it assumes the data being loaded is already 'database ready', it is ingested very quickly indeed.

There are other options to import data in, but generally if you need to do some sort of row skipping/correction on import, you don't really want to be doing it via the offline bulk importer. I would suggest you either do some form of pre-processing on your on CSV prior to using, neo4j-admin import, or look at one of the other import options available where you can dictate how to handle any poorly formatted rows.