Several Extracts Job in Datameer (Rapid ETL/BI tool, sits on top of hadoop) are reading data out of salesforce objects. The largest extract is 1.4 GB(Task object) and the smallest extract is 96 MB(account object). Datameer uses REST API Based connector , a SOQL query is supplied to the connector and records are fetched accordingly (https://documentation.datameer.com/documentation/display/DAS60/Salesforce).
Datameer compiles the job and hands over the execution to the execution framework (Tez). Also there are no Job specific configurations.
All the saleforce extract jobs run with 1 Map tasks.
But,
There are other extract jobs in datameer that read data from flat files(50 - 200 MB) on a sftp server and use between 3-5 map tasks.
About SOQL: https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_changing_batch_size.htm SOQL pulls a max of 2000 records per batch
My question :
- Considering that data from flat file is running with multiple map tasks, does the issue corresponds to SOQL batch size which only pulls 2000 records per request hence resulting in allocation of only 1 mapper.
- How does MR program determine total size of the input extract when
dealing with source like salesforce or for that matter even a cloud based
database.
Environment Information: Hortonwork 2.7.1
Cores Per Data node=8
RAM per Data node=64GB
No of datanodes = 6
Block Size : 128 MB
Input Split info:
mapreduce.input.fileinputformat.split.maxsize=5368709120 (5 GB)
mapreduce.input.fileinputformat.split.minsize=16777216 (16 MB)
Execution Framework: Tez
Memory Sizes: <property> <name>mapreduce.map.memory.mb</name> <value>1536</value> </property><property> <name>mapreduce.reduce.memory.mb</name> <value>2048</value> </property><property> <name>mapreduce.map.java.opts</name> <value>-Xmx1228m</value> </property><property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1638m</value> </property>
<property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property><property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx819m -Dhdp.version=${hdp.version}</value> </property>
Compression is enabled:
<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.type=BLOCK
mapreduce.map.output.compress=true
mapred.map.output.compression.type=BLOCK