I have the following sample data incoming in a CSV file:
Identifer Key,Name,Address,City,State,ZIP
WELD-424,Jane Doe,123 Main St,Whereverville,CA,90210
MOWN-175,John Doe,555 Broadway Ave,New York,NY,10010
The processor flow I came up with so far is:
- GetFile
- UpdateAttribute to set the
avro.schemaproperty with the schema text - PutMongoRecord uses
CSVReaderto load the records into the database
What would the Avro schema look like for this? Here is my best guess (based on the two fields I care about):
{
"type" : "record",
"namespace" : "TheNameSpace",
"name" : "MySchema",
"fields" : [
{ "name" : "Identifier Key" , "type" : ["string"]}
{ "name" : "Name" , "type" : ["string", "null"]}
]
}
Specifying "Identifier Key" above gives an error because it contains a space. Other fields like "Name" load fine however.
Some challenges I am facing:
- How do you rename fields? Does that need to be done in another processor block outside of the
ConvertRecordprocessor and schema ecosystem? This seems like a common scenario because you will want fields to have the same name coming from many different sources. - Avro does not like field names with spaces in them ( so going from
"Identifier Key" -> "_id"is going to be problematic). - There does not seem to be a way to rename a field during a read and write operation. I thought the aliases capability would help (for example: going from
"Name" -> "fullName") - Make a single field (i.e. Identifier Key) all lower case before importing to MongoDB?
I also tried using the ConvertRecord processor block to convert from CSV to JSON first so that it can be imported into MongoDB as JSON. It would need to look something like this (with the identifier key field all lowercase) but the field comes out null for the Identifier Key after ConvertRecord runs:
{"_id": "weld-424", "fullName": "Jane Doe", "updated": {"$date":"2018-11-01T04:00:00.000Z"}, "created": {"$date":"2018-11-01T04:00:00.000Z"}}
{"_id": "mown-175", "fullName": "John Doe", "updated": {"$date":"2018-11-01T04:00:00.000Z"}, "created": {"$date":"2018-11-01T04:00:00.000Z"}}