2
votes

After some experimenting with Marklogic 8 and Marklogic Content Pump, I'm running into an issue with importing data into a Marklogic database. I'm trying to run an mlcp import operation to load data from a set of csv-files with input settings like this:

-input_file_path content/csv/  
-input_file_pattern ".*\.csv"  
-input_file_type delimited_text

Furthermore, I'm trying out some additional settings to customize the import to my needs. One setting I experimented with is the -transform_module setting to apply a custom javascript-based transform module to do some additional transformation during load like this:

-transform_module /transform/customTransform.sjs

When I run the mlcp import command with these settings the documents are loaded correctly by mlcp and the transform is executed as expected.

Another setting I tried out is the -filename_as_collection setting to assign a collection to each imported document with the name of the file the document originated from. I ran some tests and verified that the collections were assigned correctly with this setting.

So the -transform_module and the -filename_as_collection settings individually work as expected, but the issue occurs when I try to apply both in one import operation. I get the following error message in the command window:

15/03/25 11:01:51 ERROR contentpump.MultithreadedMapper: com.marklogic.contentpump.ContentWithFileNameWritable cannot be cast to org.apache.hadoop.io.Text
java.lang.ClassCastException: com.marklogic.contentpump.ContentWithFileNameWritable cannot be cast to org.apache.hadoop.io.Text
at com.marklogic.contentpump.utilities.TransformHelper.getTransformInsertQry(TransformHelper.java:163)
at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:97)
at com.marklogic.contentpump.TransformWriter.write(TransformWriter.java:46)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106)
at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:46)
at com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:32)
at com.marklogic.contentpump.BaseMapper.runThreadSafe(BaseMapper.java:51)
at com.marklogic.contentpump.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:376)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Here's the full command I'm executing:
mlcp import -input_file_path content/csv/ -input_file_pattern ".*\.csv" -input_file_type delimited_text -delimiter ";" -delimited_root_name rootname -namespace http://marklogic.com/somenamespace -transform_module /transform/customTransform.sjs -filename_as_collection

I'm running Marklogic 8.0-1.1 developer edition and mlcp 1.3-1, both on a single Windows 8.1 machine.

1

1 Answers

0
votes

Sounds like a bug. Customers who run into this should contact Support.