0
votes

When you create a new AWS EMR cluster through the AWS Management Console you're able to provide JSON Software Configurations. You can put the JSON file in an S3 bucket and point the Software Configurations to the S3 bucket via the following field,

enter image description here

I need to do this through the AWS Python SDK Boto3 library but I don't see where to do it at in the available fields in their example,

response = client.run_job_flow(
    Name='string',
    LogUri='string',
    AdditionalInfo='string',
    AmiVersion='string',
    ReleaseLabel='string',
    Instances={
        'MasterInstanceType': 'string',
        'SlaveInstanceType': 'string',
        'InstanceCount': 123,
        'InstanceGroups': [
            {
                'Name': 'string',
                'Market': 'ON_DEMAND'|'SPOT',
                'InstanceRole': 'MASTER'|'CORE'|'TASK',
                'BidPrice': 'string',
                'InstanceType': 'string',
                'InstanceCount': 123,
                'Configurations': [
                    {
                        'Classification': 'string',
                        'Configurations': {'... recursive ...'},
                        'Properties': {
                            'string': 'string'
                        }
                    },
                ],
                'EbsConfiguration': {
                    'EbsBlockDeviceConfigs': [
                        {
                            'VolumeSpecification': {
                                'VolumeType': 'string',
                                'Iops': 123,
                                'SizeInGB': 123
                            },
                            'VolumesPerInstance': 123
                        },
                    ],
                    'EbsOptimized': True|False
                },
                'AutoScalingPolicy': {
                    'Constraints': {
                        'MinCapacity': 123,
                        'MaxCapacity': 123
                    },
                    'Rules': [
                        {
                            'Name': 'string',
                            'Description': 'string',
                            'Action': {
                                'Market': 'ON_DEMAND'|'SPOT',
                                'SimpleScalingPolicyConfiguration': {
                                    'AdjustmentType': 'CHANGE_IN_CAPACITY'|'PERCENT_CHANGE_IN_CAPACITY'|'EXACT_CAPACITY',
                                    'ScalingAdjustment': 123,
                                    'CoolDown': 123
                                }
                            },
                            'Trigger': {
                                'CloudWatchAlarmDefinition': {
                                    'ComparisonOperator': 'GREATER_THAN_OR_EQUAL'|'GREATER_THAN'|'LESS_THAN'|'LESS_THAN_OR_EQUAL',
                                    'EvaluationPeriods': 123,
                                    'MetricName': 'string',
                                    'Namespace': 'string',
                                    'Period': 123,
                                    'Statistic': 'SAMPLE_COUNT'|'AVERAGE'|'SUM'|'MINIMUM'|'MAXIMUM',
                                    'Threshold': 123.0,
                                    'Unit': 'NONE'|'SECONDS'|'MICRO_SECONDS'|'MILLI_SECONDS'|'BYTES'|'KILO_BYTES'|'MEGA_BYTES'|'GIGA_BYTES'|'TERA_BYTES'|'BITS'|'KILO_BITS'|'MEGA_BITS'|'GIGA_BITS'|'TERA_BITS'|'PERCENT'|'COUNT'|'BYTES_PER_SECOND'|'KILO_BYTES_PER_SECOND'|'MEGA_BYTES_PER_SECOND'|'GIGA_BYTES_PER_SECOND'|'TERA_BYTES_PER_SECOND'|'BITS_PER_SECOND'|'KILO_BITS_PER_SECOND'|'MEGA_BITS_PER_SECOND'|'GIGA_BITS_PER_SECOND'|'TERA_BITS_PER_SECOND'|'COUNT_PER_SECOND',
                                    'Dimensions': [
                                        {
                                            'Key': 'string',
                                            'Value': 'string'
                                        },
                                    ]
                                }
                            }
                        },
                    ]
                }
            },
        ],
        'InstanceFleets': [
            {
                'Name': 'string',
                'InstanceFleetType': 'MASTER'|'CORE'|'TASK',
                'TargetOnDemandCapacity': 123,
                'TargetSpotCapacity': 123,
                'InstanceTypeConfigs': [
                    {
                        'InstanceType': 'string',
                        'WeightedCapacity': 123,
                        'BidPrice': 'string',
                        'BidPriceAsPercentageOfOnDemandPrice': 123.0,
                        'EbsConfiguration': {
                            'EbsBlockDeviceConfigs': [
                                {
                                    'VolumeSpecification': {
                                        'VolumeType': 'string',
                                        'Iops': 123,
                                        'SizeInGB': 123
                                    },
                                    'VolumesPerInstance': 123
                                },
                            ],
                            'EbsOptimized': True|False
                        },
                        'Configurations': [
                            {
                                'Classification': 'string',
                                'Configurations': {'... recursive ...'},
                                'Properties': {
                                    'string': 'string'
                                }
                            },
                        ]
                    },
                ],
                'LaunchSpecifications': {
                    'SpotSpecification': {
                        'TimeoutDurationMinutes': 123,
                        'TimeoutAction': 'SWITCH_TO_ON_DEMAND'|'TERMINATE_CLUSTER',
                        'BlockDurationMinutes': 123
                    }
                }
            },
        ],
        'Ec2KeyName': 'string',
        'Placement': {
            'AvailabilityZone': 'string',
            'AvailabilityZones': [
                'string',
            ]
        },
        'KeepJobFlowAliveWhenNoSteps': True|False,
        'TerminationProtected': True|False,
        'HadoopVersion': 'string',
        'Ec2SubnetId': 'string',
        'Ec2SubnetIds': [
            'string',
        ],
        'EmrManagedMasterSecurityGroup': 'string',
        'EmrManagedSlaveSecurityGroup': 'string',
        'ServiceAccessSecurityGroup': 'string',
        'AdditionalMasterSecurityGroups': [
            'string',
        ],
        'AdditionalSlaveSecurityGroups': [
            'string',
        ]
    },
    Steps=[
        {
            'Name': 'string',
            'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'|'CANCEL_AND_WAIT'|'CONTINUE',
            'HadoopJarStep': {
                'Properties': [
                    {
                        'Key': 'string',
                        'Value': 'string'
                    },
                ],
                'Jar': 'string',
                'MainClass': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    BootstrapActions=[
        {
            'Name': 'string',
            'ScriptBootstrapAction': {
                'Path': 'string',
                'Args': [
                    'string',
                ]
            }
        },
    ],
    SupportedProducts=[
        'string',
    ],
    NewSupportedProducts=[
        {
            'Name': 'string',
            'Args': [
                'string',
            ]
        },
    ],
    Applications=[
        {
            'Name': 'string',
            'Version': 'string',
            'Args': [
                'string',
            ],
            'AdditionalInfo': {
                'string': 'string'
            }
        },
    ],
    Configurations=[
        {
            'Classification': 'string',
            'Configurations': {'... recursive ...'},
            'Properties': {
                'string': 'string'
            }
        },
    ],
    VisibleToAllUsers=True|False,
    JobFlowRole='string',
    ServiceRole='string',
    Tags=[
        {
            'Key': 'string',
            'Value': 'string'
        },
    ],
    SecurityConfiguration='string',
    AutoScalingRole='string',
    ScaleDownBehavior='TERMINATE_AT_INSTANCE_HOUR'|'TERMINATE_AT_TASK_COMPLETION',
    CustomAmiId='string',
    EbsRootVolumeSize=123,
    RepoUpgradeOnBoot='SECURITY'|'NONE',
    KerberosAttributes={
        'Realm': 'string',
        'KdcAdminPassword': 'string',
        'CrossRealmTrustPrincipalPassword': 'string',
        'ADDomainJoinUser': 'string',
        'ADDomainJoinPassword': 'string'
    }
)

How can I provide an S3 bucket location that has the Software Configuration JSON file for creating an EMR cluster through the Boto3 library?

2

2 Answers

1
votes

The Configuring Applications - Amazon EMR documentation says:

Supplying a Configuration in the Console

To supply a configuration, you navigate to the Create cluster page and choose Edit software settings. You can then enter the configuration directly (in JSON or using shorthand syntax demonstrated in shadow text) in the console or provide an Amazon S3 URI for a file with a JSON Configurations object.

That seems to be the capability you showed in your question.

The documentation then shows how you can do it via the CLI:

aws emr create-cluster --use-default-roles --release-label emr-5.14.0 --instance-type m4.large --instance-count 2 --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

This maps to the Configurations options in the JSON you show above:

                    'Configurations': [
                        {
                            'Classification': 'string',
                            'Configurations': {'... recursive ...'},
                            'Properties': {
                                'string': 'string'
                            }
                        },
                    ]

Configurations: A configuration classification that applies when provisioning cluster instances, which can include configurations for applications and software that run on the cluster.

It would contain settings such as:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.security.groups.cache.secs": "250"
    }
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.tasktracker.map.tasks.maximum": "2",
      "mapreduce.map.sort.spill.percent": "0.90",
      "mapreduce.tasktracker.reduce.tasks.maximum": "5"
    }
  }
]

Short answer: Configurations

1
votes

Right now the boto3 SDK can't directly import the configuration settings from s3 for you as part of the run_job_flow() function. You would need to setup an S3 client in boto3, download the data as an S3 object and then update the Configuration List part of your EMR dictionary with the JSON data in your S3 file.

An example of how to download a json file from S3 and then load it into memory as a Python Dict can be found over here - Reading an JSON file from S3 using Python boto3