Attach aws emr cluster to remote jupyter notebook using sparkmagic

Question

I am trying to connect and attach an AWS EMR cluster (emr-5.29.0) to a Jupyter notebook that I am working on my local windows machine. I have started a cluster with Hive 2.3.6, Pig 0.17.0, Hue 4.4.0, Livy 0.6.0, Spark 2.4.4 and the subnets are public. I found that this can be done with Azure HDInsight, so was hoping something similar can be done using EMR. The issue I am having is with passing the correct values in the config.json file. How should I attach a EMR cluster?

I could work on the EMR notebooks native to AWS, but thought I can go the develop locally route and have hit a road block.

{
    "kernel_python_credentials" : {
      "username": "{IAM ACCESS KEY ID}", # not sure about the username for the cluster
      "password": "{IAM SECRET ACCESS KEY}", # I use putty to ssh into the cluster with the pem key, so again not sure about the password for the cluster
      "url": "ec2-xx-xxx-x-xxx.us-west-2.compute.amazonaws.com", # as per the AWS blog When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy
      "auth": "None"
    },
  
    "kernel_scala_credentials" : {
      "username": "{IAM ACCESS KEY ID}",
      "password": "{IAM SECRET ACCESS KEY}",
      "url": "{Master public DNS}",
      "auth": "None"
    },
    "kernel_r_credentials": {
      "username": "{}",
      "password": "{}",
      "url": "{}"
    },

Update 1/4/2021

On 4/1, I got sparkmagic to work on my local jupyter notebook. Used these documents as a references (ref-1, ref-2 & ref-3) to setup local port forwarding (if possible avoid using sudo).

 sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 [email protected]

Configuration details Release label:emr-5.32.0 Hadoop distribution:Amazon 2.10.1 Applications:Hive 2.3.7, Livy 0.7.0, JupyterHub 1.1.0, Spark 2.4.7, Zeppelin 0.8.2

Updated config file

{
    "kernel_python_credentials" : {
      "username": "",
      "password": "",
      "url": "http://localhost:8998"
    },
  
    "kernel_scala_credentials" : {
      "username": "",
      "password": "",
      "url": "http://localhost:8998",
      "auth": "None"
    },
    "kernel_r_credentials": {
      "username": "",
      "password": "",
      "url": "http://localhost:8998"
    },
  
    "logging_config": {
      "version": 1,
      "formatters": {
        "magicsFormatter": { 
          "format": "%(asctime)s\t%(levelname)s\t%(message)s",
          "datefmt": ""
        }
      },
      "handlers": {
        "magicsHandler": { 
          "class": "hdijupyterutils.filehandler.MagicsFileHandler",
          "formatter": "magicsFormatter",
          "home_path": "~/.sparkmagic"
        }
      },
      "loggers": {
        "magicsLogger": { 
          "handlers": ["magicsHandler"],
          "level": "DEBUG",
          "propagate": 0
        }
      }
    },
    "authenticators": {
      "Kerberos": "sparkmagic.auth.kerberos.Kerberos",
      "None": "sparkmagic.auth.customauth.Authenticator", 
      "Basic_Access": "sparkmagic.auth.basic.Basic"
    },
  
    "wait_for_idle_timeout_seconds": 15,
    "livy_session_startup_timeout_seconds": 60,
  
    "fatal_error_suggestion": "The code failed because of a fatal error:\n\t{}.\n\nSome things to try:\na) Make sure Spark has enough available resources for Jupyter to create a Spark context.\nb) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.\nc) Restart the kernel.",
  
    "ignore_ssl_errors": false,
  
    "session_configs": {
      "driverMemory": "1000M",
      "executorCores": 2
    },
  
    "use_auto_viz": true,
    "coerce_dataframe": true,
    "max_results_sql": 2500,
    "pyspark_dataframe_encoding": "utf-8",
    
    "heartbeat_refresh_seconds": 5,
    "livy_server_heartbeat_timeout_seconds": 60,
    "heartbeat_retry_seconds": 1,
  
    "server_extension_default_kernel_name": "pysparkkernel",
    "custom_headers": {},
    
    "retry_policy": "configurable",
    "retry_seconds_to_sleep_list": [0.2, 0.5, 1, 3, 5],
    "configurable_retry_policy_max_retries": 8
  }

Second update 1/9

Back to square one. Keep getting this error and spent days trying to debug. Not sure what I did previously to get things going. Also checked my security group config and it looks fine, ssh on port 22.

An error was encountered:
Error sending http request and maximum retry encountered.

The Microsoft document mentioned here is to attach an HDInsight cluster with a Local Jupyter notebook. I would recommend checking the AWS documents whether you can attach an EMR cluster with a local notebook.You can refer stackoverflow.com/questions/44800857/… christo-lagali.medium.com/… — Subash
It is possible to attach a local notebook to a remote EMR cluster. towardsdatascience.com/… — Ahmed

Ahmed Ahmed · Accepted Answer · 2021-01-11T11:20:35

Created a local port forwarding (ssh tunneling) to livy server on port 8998 and it works like magic.

sudo ssh -i ~/aws-key/my-pem-file.pem -N -L 8998:ec2-xx-xxx-xxx-xxx.us-west-2.compute.amazonaws.com:8998 [email protected]

Did not change my config.json file from 1/4 update

Attach aws emr cluster to remote jupyter notebook using sparkmagic

1 Answers