6
votes

I am trying to read and excel file in Azure Blob Storage with .xlsx extension in my azure data factory dataset. it throws following error

Error found when processing 'Csv/Tsv Format Text' source 'Filename.xlsx' with row number 3: found more columns than expected column count: 1.

What are the right Column and row delimiters for excel files to be read in azure Data factory

4

4 Answers

1
votes

Azure Data Factory V2 has recently released an update to support parsing Excel(.xls) files on existing connectors.

Currently, the connections supporting excel files are:

  • Amazon S3
  • Azure Blob
  • Azure Data Lake Storage Gen1
  • Azure Data Lake Storage Gen2
  • Azure File Storage
  • File System
  • FTP
  • Google Cloud Storage
  • HDFS
  • HTTP
  • SFTP

More details can be found here: https://docs.microsoft.com/en-us/azure/data-factory/format-excel

8
votes

Excel files have a proprietary format and are not simple delimited files. As indicated here, Azure Data Factory does not have a direct option to import Excel files, eg you cannot create a Linked Service to an Excel file and read it easily. Your options are:

  1. Export or convert the data as flat files eg before transfer to cloud, as .csv, tab-delimited, pipe-delimited etc are easier to read than Excel files. This is your simplest option although obviously requires a change in process.
  2. Try shredding the XML - create a custom task to open the Excel file as XML and extract your data as suggested here.
  3. SSIS packages are now supported in Azure Data Factory (with the Execute SSIS package activity) and have better support for Excel files, eg a Connection Manager. So it may be an option to create an SSIS package to deal with the Excel and host it in ADFv2. Warning! I have not tested this, I am only speculating it is possible. Also there is the overhead of creating an Integration Runtime (IR) for running SSIS in ADFv2.
  4. Try some other custom activity, eg there is a custom U-SQL Extractor for shredding XML on github here.
  5. Try and read the Excel using Databricks, some examples here although spinning up a Spark cluster to read a few Excel files does seem somewhat overkill. This might be a good option if Spark is already in your architecture.

Let us know how you get on.

0
votes

Azure Data Factory does not have a direct option to upload the excel, however, it can be uploaded through linked service via some configurations. Follow below steps to do so:

  1. Create a Linked service to the source using the appropriate protocol(FTP, Fileshare, SFTP etc.)
  2. Create a Linked Service to your cloud storage.
  3. Take a Copy Data activity, define a filesystem Source Dataset and Sink using your previously defined linked services
  4. In Source and Sink, tick the Binary file option.Binary file option
  5. Publish and execute your pipeline
0
votes

I would like to elaborate a bit more on option 3. SSIS packages. I tested the solution and it sorta worked. First of all - SSIS IR is super expensive and You would like to use http post's to start-stop it while it is not needed.

Secondly, SSIS does not support processing blob files out of the box. There are some ways around it. You can use some third-party soft (didn't try that), or explore options

In my case I used trick to deploy package that download file locally, process and then delete it. Since SSIS IR is a VM, it basically did the same on IR after deployment. It was a trick rather than solution and it was not 100% stable. Another issue - numeric fields got some insignificant error during processing - eg. uploading 25 resulted in 25.0000000001 at db site. No idea why. I did not have enough time to investigate issues deeper.