Question
I think it's pretty common issue, hope there are solutions/approaches we can reuse.
We're building data lake in Azure ADLS gen2, having unidirectional data flow: Nifi/ADF -> ADLS -> ETL/Spark/Databricks -> Data Warehouse -> Power BI
. Some ETL inputs should be loaded/updated by responsible business users on weekly/monthly basis.
Could you please suggest/improve solutions for business users to upload ETL inputs with meeting the requirements below?
Requirements
- User friendly interface.
- Parsing/validation. We must to make sure that data fit expected format during uploading.
- Audit logging. We must have ability to track who and when loaded what. In case of some delays, notifications should be sent.
- Simplicity of implementation and compliance with current architecture.
Requirements achievements are estimated from 1 (very bad practice approach) to 5 (100% clean, easy to implement solution).
Possible solutions
Upload files from Power Apps to Data Warehouse.
Flow:Business users -> Power Apps -> Data Warehouse & Stored Procedures -> ADLS -> Spark -> Data Warehouse -> Power BI
.- 1 requirement = 5. Very user friendly interface built with Power Apps.
- 2 requirement = 2. Poor implementation of validations/transformations in SQL stored procedures. All other application code is written in Spark.
- 3 requirement = 3-5. Not sure how to implemenent this yet.
- 4th requirement = 2. Data flow becomes bidirectional means
DW -> ADLS -> DW
. Harder to reason about and orchestrate.
Use Spark/Databricks over ADLS events/triggers.
Flow:Business users -> Microsoft Storage Explorer app -> ADLS gen2 -> Azure Blob Storage trigger -> Azure Function -> Spark parsing/validation job -> ADLS gen2
- 1st requirement = 3-4. Uploading through Storage Explorer is very user friendly, the only issue is that the mechanics for notifying user about success/failure can be done through email and may not be very clear.
- 2nd requirement = 5. I like parsing/validation happen on ETL side, not on Data Warehouse stored procedures.
- 3rd requirement = 1-3. Isn't clear hot to achieve that currently. Expecting it'll be worse than with Power Apps.
- 4th requirement = 4. Unidirectional process, data aren't moved from DW to Data Lake. 4 not 5, because it isn't very clear for business user that success/failure notification will come through email. Also little bit bigger complexity of implementation.