3
votes

I have a delimited file as input source to ingest data in marklogic using conten-pump through unix.There is no such column in the file that is unique throught to serve as the URI. Problem with this is that since duplicates(URI) is not possible, those records are skipped/overwritten for that particular URI. The syntaxes available are: -delimited_uri_id *my_column_name* output_uri_prefix *my_prefix_string* output_uri_suffix *my_suffix_string* output_uri_replace pattern,'string'

The command for mlcp is:

bin/mlcp.sh import -host localhost -port 8042 -username name -password password-input_file_path  hdfs://path/to/file -delimiter '|'  -delimited_uri_id column_name-input_file_type delimited_text -mode distributed

The problem that lies here is that if I modify the above command and include:

-output_uri_prefix $(date +%s%N)

It takes the time(in nanoseconds) of execution of this command and prefixes for all URI.But that doesnt solve my problem since this value remains repeated. Same would happen for other options available too .What could be done to have all records ingested by the construction of unique URI for all records in some manner?

1

1 Answers

1
votes

One way or another it is up to you to provide unique ids. For a delimited file the easiest answer might be to add a new column and populate it with a unique id, generated however you like.

Or you could use http://marklogic.github.io/recordloader/ DelimitedDataLoader with the special option ID_NAME=#AUTO. But keep in mind that ID_NAME=#AUTO will single-thread ingestion.