0
votes

We have a CSV file stored in a ADO (Azure DevOps) Git repository. I have Azure Databricks cluster running, and in the workspace I have a python code to read and transform this CSV file into a spark dataframe. But every time the file undergoes change, I have to manually download it from ADO Git and upload to the Databricks workspace. I use the following command to verify that the file has been uploaded:-

dbutils.fs.ls ("/FileStore/tables")

It lists my file. I then use the following Python code to convert this CSV to Spark dataframe:

file_location = "/FileStore/tables/MyFile.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

So there is this manual step involved every time the file in the ADO Git repository changes. Is there any Python function using which I can directly point to the copy of the file in the master branch of the ADO Git ?

1

1 Answers

0
votes

You have 2 choices, depending on what would be simpler for you:

  1. Use Azure DevOps Python API to access file (called item in API) inside the Git tree. Because this file will be accessed only from driver node, then you will need to use dbutils.fs.cp to copy file from driver node into /FileStore/tables
  2. Setup a build pipeline inside your Git repository, that will be triggered only on commit of specific file, and if it changes, use Databricks CLI (databrics fs cp ... command) to copy file directly into DBFS. Here is an example that not doing exactly what you want, but it could be used as inspiration.