0
votes

Several Extracts Job in Datameer (Rapid ETL/BI tool, sits on top of hadoop) are reading data out of salesforce objects. The largest extract is 1.4 GB(Task object) and the smallest extract is 96 MB(account object). Datameer uses REST API Based connector , a SOQL query is supplied to the connector and records are fetched accordingly (https://documentation.datameer.com/documentation/display/DAS60/Salesforce).

Datameer compiles the job and hands over the execution to the execution framework (Tez). Also there are no Job specific configurations.

All the saleforce extract jobs run with 1 Map tasks.

But,

There are other extract jobs in datameer that read data from flat files(50 - 200 MB) on a sftp server and use between 3-5 map tasks.

About SOQL: https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_changing_batch_size.htm SOQL pulls a max of 2000 records per batch

My question :

  • Considering that data from flat file is running with multiple map tasks, does the issue corresponds to SOQL batch size which only pulls 2000 records per request hence resulting in allocation of only 1 mapper.
  • How does MR program determine total size of the input extract when dealing with source like salesforce or for that matter even a cloud based
    database.

Environment Information: Hortonwork 2.7.1

Cores Per Data node=8

RAM per Data node=64GB

No of datanodes = 6

Block Size : 128 MB

Input Split info:

mapreduce.input.fileinputformat.split.maxsize=5368709120 (5 GB)

mapreduce.input.fileinputformat.split.minsize=16777216 (16 MB)

Execution Framework: Tez

Memory Sizes: <property> <name>mapreduce.map.memory.mb</name> <value>1536</value> </property><property> <name>mapreduce.reduce.memory.mb</name> <value>2048</value> </property><property> <name>mapreduce.map.java.opts</name> <value>-Xmx1228m</value> </property><property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1638m</value> </property>

<property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>1024</value> </property><property> <name>yarn.app.mapreduce.am.command-opts</name> <value>-Xmx819m -Dhdp.version=${hdp.version}</value> </property>

Compression is enabled:

<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> mapreduce.output.fileoutputformat.compress=true

mapreduce.output.fileoutputformat.compress.type=BLOCK

mapreduce.map.output.compress=true

mapred.map.output.compression.type=BLOCK
1

1 Answers

0
votes

The issue was raised with Datameer support who provided the following response.

Root Cause Analysis:

“There is a limitation of only 1 mapper in use. Primarily, it is used for a web service that doesn't benefit from creating more than one split. This could be because the service doesn't support splitting or the data is small enough that the job won't benefit from splitting.”

Background:

Datameer uses salesforce connector which intern uses REST API calls that can fetch a maximum of 2000 records in a single request. REST API calls are synchronous and have a limit of 5 seconds under which they return results.