1
votes

I am processing data in U-SQL but not getting expected results. Here is what I am doing:

1- Select data from ADL table partitions and assign it to @data1

2- Aggregate data using Group BY and assign it to @data2

3- Truncate partitions

4- Insert data(produced in step 2) into the same table

5- Use @data2 and generate a unique GUID for every record using user
defined function and assign it to @data2
        //UDF Code
        public static Guid GetNewGuid ()
        {
        return Guid.NewGuid ();
        }

6- Select few columns from @data2 and assign it to @data3

Strangely GUIDs in @data2 and @data3 are totally different.

If I perform some joins with other datasets and change schema in Step 5 and then generate unique GUIDs then I get same GUIDS at last step. It looks like some script optimization is happening in the backend that is creating this problem.

Could you please let me know what is wrong happening in above workflow? Or if some sort of optimization is happening in the backend then how to learn how script optimization works.

Update: In this question, my focus is to learn why something calculated on one step is automatically changed in next step.

2
See here for a reply from Azure Data Lake Team. Guid is not something you should be using with U-SQL to uniquely identify records because they might be "recalculated - to due vertex retries, performance optimizations, etc."wBob
Thanks for this link. I will consider this for unique ID generation but for this question I have a different focus that I updated above.Jamil

2 Answers

2
votes

Answering the updated focus - why something calculated on one step is automatically changed in the next step

wBob's excerpt answers the question completely IMO, but maybe a broader context will help.

Semantically, the answer is simply

  • Because you've violated a usage requirement of the language (determinism), the resulting behavior is undefined.

Undefined means anything can happen, so you cannot have expectations - of consistent values or otherwise. Any discussion of the actual behavior seen (different Guids) is implementation detail.

Semantic deep dive
Steps are an illusion.

  • Statements in an imperative language like C# are a sequence of exact instructions (how).

  • Statements in a functional language like U-Sql are a list of requirements, describing outputs in terms of inputs (what). The U-Sql optimizer has complete implementation flexibility in meeting those requirements. Rowsets are logical constructs to organize user requirements, they need not actually exist at implementation time; while the logic corresponding to a rowset may be split, merged, skipped, repeated, etc in implementation.

So, for example, a perfectly legal implementation of steps 5, 6 under the deterministic requirement:

@data3 = SELECT <FewCols>, GetNewGuid() AS NewGuid FROM @data2;
@data2 = SELECT *, GetNewGuid() AS NewGuid FROM @data2;
0
votes

Welcome to the lake Nabeel. Looking forward to more of your answers!

In addition, the behavior with non-deterministic functions is a bit complicated by the fact that for some functions, we make them deterministic (e.g., DateTime.Now()) if they appear in U-SQL but outside of a C# code block (e.g., a user-defined function), but they stay non-deterministic inside the C# code block.

The recommended way to create keys is a technique called Skolemization, where you use a deterministic key generation based on the identifying properties. That way you stay deterministic.