5
votes

I am trying to install PySpark on Google Colab using the code given below but getting the following error.

tar: spark-2.3.2-bin-hadoop2.7.tgz: Cannot open: No such file or directory

tar: Error is not recoverable: exiting now

This code has ran successfully once. But it is throwing this error after the notebook restart. I have even tried running this from a different Google account but same error again.

(Also is there any way that we don't need to install PySpark everytime after the notebook re-start?)

code:

--------------------------------------------------------------------------------------------------------------------------------

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz

This following line seems to cause the problem as it is not finding the downloaded file.

!tar xvf spark-2.3.2-bin-hadoop2.7.tgz

I have also tried the following two lines (instead of above two lines) suggested somewhere on medium blog. But nothing better.

!wget -q http://mirror.its.dal.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

!tar xvf spark-2.4.0-bin-hadoop2.7.tgz

!pip install -q findspark

-------------------------------------------------------------------------------------------------------------------------------

Any ideas how to get out of this error and install PySpark on Colab?

5

5 Answers

11
votes

I am running pyspark on colab by just using

!pip install pyspark

and it works fine.

2
votes

Date: 6-09-2020


Step 1 : Install pyspark on google colab

!pip install pyspark

Step 2 : Dealing with pandas and spark Dataframe inside spark session

!pip install pyarrow

It facilitates communication between many components, for example, reading a parquet file with Python (pandas) and transforming to a Spark data frame, Falcon Data Visualization or Cassandra without worrying about conversion.

Step 3 : Create Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()

Done ⭐

1
votes

you are getting this error because spark-2.3.2-bin-hadoop2.7 is replaced with latest version on official site and mirror sites.

Go to any of this path and get the latest version

  1. http://apache.osuosl.org/spark/
  2. https://www-us.apache.org/dist/spark/

replace spark build version and you are done. every thing will work smoothly.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf /content/spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark
1
votes

I had tried to install in the same way but even after checking with proper versions of spark I was getting the same error. Running below code worked for me!!

!pip install pyspark
!pip install pyarrow
!pip install -q findspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('HelloWorld').getOrCreate()
0
votes

I have used below setup to run pyspark and sparkdl on google colab.

# Installing spark 
!apt-get install openjdk-8-jre
!apt-get install scala
!pip install py4j
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark

# Installing databricks packages
!wget -q https://github.com/databricks/spark-deep-learning/archive/refs/tags/v1.5.0.zip 
!unzip v1.5.0.zip
!mv spark-deep-learning-1.5.0 databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11

# Clearing unnecessary space
!rm -r *.tgz *.zip sample_data
!ls

# Setting up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"

SUBMIT_ARGS = "--packages databricks:spark-deep-learning:1.5.0-spark2.4-s_2.11 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

# Importing and initating spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
# spark = SparkSession.builder.master("local[*]").getOrCreate()
spark = SparkSession.builder.appName("Test Setup").getOrCreate()
sc = spark.sparkContext