We have a CSV file stored in a ADO (Azure DevOps) Git repository. I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. But every time the file undergoes change, I have to manually download it from ADO Git and upload to the Databricks workspace. I use the following command to verify that the file has been uploaded:-
dbutils.fs.ls ("/FileStore/tables")
It lists my file. I then use the following Python code to convert this CSV to Spark dataframe:
file_location = "/FileStore/tables/MyFile.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
So there is this manual step involved every time the file in the ADO Git repository changes. Is there any Python function using which I can directly point to the copy of the file in the master branch of the ADO Git ?