2
votes

I have experience in making ETL solutions based on Azure Data Factory and Azure Data Lake Analytics (U-SQL).

But seems like Microsoft has started forcing Azure Databricks.

Is U-SQL is dying? I have not seen any news about new features since July.

The upcoming project is pretty simple. We have about 0.5 Tb of small JSON files stored on Azure Data Lake Storage. They needs to be transformed into a flat tables and joined in some way.

So my question is what to choose for a new project ADF+U-SQL or ADF+DataBricks?

1
Thanx for the links above. I more about whether U-SQL is dying or not (in the light of Databricks)?Alex S
I have no reason to believe it is dying. There are still features planned, in progress and under review. Just because new features are not announced on a monthly basis does not mean it is dying imho. Especially not when it is maturing well. Also, if you don't mind me asking, what makes you think Databricks is forced upon you? I think using u-sql is far easier for ppl from a ms background where databricks has a lower learning curve for ppl already working on that stack in a non-managed situation.Peter Bons

1 Answers

2
votes

Spark's programming model for data engineering/transformation is fundamentally more flexible and extensible than U-SQL.

For small, simple projects you wouldn't notice the difference and I'd recommend you go with whatever you are familiar with. For complex projects and/or ones where you expect significant flux in requirements, I would strongly recommend Spark using one of the supported languages: Scala, Java, Python or R and not SparkSQL. The reason for the recommendation is that Spark's domain specific language (DSL) for data transformations makes the equivalent of SQL code generation, which is the trick all BI/analytics/warehousing tools use under the covers to manage complexity, very easy. It allows logic/configuration/customization to be organized and managed in manners that are impossible or impractical when dealing with SQL which, we should not forget, is a 40+ year old language.

For an extreme example of the level of abstraction that's possible with Spark, you might enjoy https://databricks.com/session/the-smart-data-warehouse-goal-based-data-production

I would also recommend Spark if you are dealing with dirty/untrusted data (the JSON in your case) where you'd like to have a highly controlled/custom ingestion process. In that case, you might benefit from some of the ideas in the spark-records library for bulletproof data processing. https://databricks.com/session/bulletproof-jobs-patterns-for-large-scale-spark-processing

When it comes to using Spark, especially for new users, Databricks provides the best managed environment. We've been a customer for years managing petabytes of very complex data. People on our team who come from SQL backgrounds and are not software developers use SparkSQL in Databricks notebooks but they benefit from the tooling/abstractions the data engineering and data science teams create for them.

Good luck with your project!