we have a Azure DataFactory (ADF) pipeline where the first is a Databricks (DB) notebook to poll a DB mounted FS for new files (usually 1 day delta, based on the "added" metadata field). We then do some filtering on that file list and pass it to a ForEach to begin the actual data cleaning / insertion pipeline. This works fine for the daily delta updates, but for a full ingest of all the historical data we run into an error from the Data Factory.
We pass the filtered file list from the first notebook as a json via dbutils.notebook.exit(file_list_dict), where file_list_dict is a Python dictionary containing the filtered paths as an array under a json key like this
{"file_list": [{"path": dbfs_filepath, "type": "File"}, ... ]
For the full ingestion ADF throws an error that json passed by DB notebooks cant exceed 20mb (because it would contain thousands of file paths) and fails the pipeline. I've tried writing the json to a file instead and making the ForEach operator loop over that, but I can't find the right way to do it. The documentation about ForEach only speaks of items from pipeline activities, which here seem to be out of the question since all our steps are essentially databricks notebooks. I've also tried to make an ADF dataset out of the json file I wrote to the FS and loop over that wit the lookup activity, but this also only supports 5k rows.
Is there a simple way to make ForEach loop over file rows that i just dont see?
Pipeline schematic:
<DB file poll notebook & filter> -> <ForEach Operator for file in filelist> -> <run pipeline for individual files>