Can someone been able to integrate Dataproc,Datalab and Source code repo? As many of us have seen that when you call an init action to install datalab, it does not create the source code repo. I am trying to achieve a full end-to-end solution where a user logs into to a datalab notebook, interact with Dataproc through Pyspark and check-in the notebooks to the Source code repo. I have not been able to do this with the init action like i pointed out earlier. I also tried installing dataproc and then datalab as a separate install ( this time it creates the source repo) , however, I can't run any spark code on this datalab notebook. Can someone please give me some pointers on how to achieve this? Any and all is appreciated.
Code in Datalab
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show databases""").show()
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-2019016-ds-team/datasets/invoices'""")
hc.sql("""select * from invoices limit 10""").show()
Error
Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$or