I am trying to ingest data into druid from hive orc compressed table data in hdfs. Any pointers on this would be very helpful.
1 Answers
Assuming you have Druid and Yarn/MapReduce setup already, you can launch a index_hadoop task that will do what you ask.
There is a druid-orc-extensions that allows to read ORC file, I don't think it come with the standard release, so you'll have to get it somehow (we build it from source)
(extension list http://druid.io/docs/latest/development/extensions.html)
Here an example that would ingest a bunch of orc file and append an interval to a datasource. to POST to an overlord http://overlord:8090/druid/indexer/v1/task
(doc http://druid.io/docs/latest/ingestion/batch-ingestion.html)
You may have to adjust depending of your distribution, I remember we had issue on hortonworks with some class not found (classpathPrefix will help adjusting MapReduce classpath)
{
"type": "index_hadoop",
"spec": {
"ioConfig": {
"type": "hadoop",
"inputSpec": {
"type": "granularity",
"inputFormat": "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat",
"dataGranularity": "hour",
"inputPath": "/apps/hive/warehouse/table1",
"filePattern": ".*",
"pathFormat": "'partition='yyyy-MM-dd'T'HH"
}
},
"dataSchema": {
"dataSource": "cube_indexed_from_orc",
"parser": {
"type": "orc",
"parseSpec": {
"format": "timeAndDims",
"timestampSpec": {
"column": "timestamp",
"format": "nano"
},
"dimensionsSpec": {
"dimensions": ["cola", "colb", "colc"],
"dimensionExclusions": [],
"spatialDimensions": []
}
},
"typeString": "struct<timestamp:bigint,cola:bigint,colb:string,colc:string,cold:bigint>"
},
"metricsSpec": [{
"type": "count",
"name": "count"
}],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "DAY",
"queryGranularity": "HOUR",
"intervals": ["2017-06-14T00:00:00.000Z/2017-06-15T00:00:00.000Z"]
}
},
"tuningConfig": {
"type": "hadoop",
"partitionsSpec": {
"type": "hashed",
"targetPartitionSize": 5000000
},
"leaveIntermediate": false,
"forceExtendableShardSpecs": "true"
}
}
}