1
votes

EDIT: I have been looking into the required results. Let me explain better: What I try to reach is to create a row for each YearMonth for each SomeData, between it's start and end date.

So for example SomeData "88888888888888888881", with startDate "2005-12-06 00:00:00.000" and EndDate 2006-03-13 00:00:000". I want rows to go like:

88888888888888888881, 200512
88888888888888888881, 200601
88888888888888888881, 200602
88888888888888888881, 200603

I am aware this might "explode" the result into a huge file.

Below my post:

I am trying to re-write something in U-SQL that we did before in T-SQL.

The problem is that U-SQL does not allow for a between to happen during the join.

The T-SQL JOIN looks something like this:

SELECT rf.SomeData AS SomeData,
       rd.YearMonth AS YearMonth,
      (rf.SomeData + '-' + rd.YearMonth.ToString()) AS MonthlyKey,
       rf.SomeKey AS SomeKey
FROM MyTable rf
    INNER JOIN dbo.DimDate rd
    ON rd.Date >= rf.StartDate
    AND rd.Date <= (CASE WHEN rf.EndDate IS NULL THEN GETDATE() ELSE rf.EndDate END)

In U-SQL I started like this, but how should I write the JOIN now?:

@EditedTable =
    SELECT rf.SomeData AS SomeData,
           rd.YearMonth AS YearMonth,
           (rf.SomeData + "-" + rd.YearMonth.ToString()) AS MonthlyKey,
           rf.SomeKey AS SomeKey
    FROM @MyTable AS rf
         INNER JOIN
             @date AS rd
         ON 

It is important that we get all the data between start and end date, and create a monthly key so that "SomeData" can be joined with another table later on.

I have tried using a cross join but when running it, it get stuck at 80% and never seems to end. It keeps writing GB's in one vertex. Besides, I am not actually sure this will deliver the same results.

@EditedTableCROSS =
    SELECT rfj.SomeData AS SomeData,
           rfj.StartDate AS StartDate,
           rfj.EndDate AS EndDate,
           (rfj.SomeData + "-" + dtj.YearMonth.ToString()) AS MonthlyKey, 
           rfj.SomeKey AS SomeKey

        FROM
    (
        SELECT SomeData AS SomeData,
               StartDate AS StartDate,
               EndDate AS EndDate,
               SomeKey AS SomeKey
        FROM @TableA
        WHERE SomeData != ""
    ) AS rfj

    CROSS JOIN

    (
    SELECT DISTINCT
           dt.Date AS Date,
           dt.YearMonth AS YearMonth,
           dt.Month AS Month,
           rf.StartDate AS StartDate
    FROM @date AS dt INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date
    WHERE rf.StartDate >= dt.Date AND
          dt.Date <= DateTime.Now
          ) AS dtj

    WHERE rfj.StartDate <= dtj.Date AND
          rfj.EndDate >= dtj.Date;

The problem with the code above is that "INNER JOIN @TableA AS rf ON rf.StartDate == dt.Date" is not on an unique key, certain dates occur a lot more than once. So I doubt this is the way....

Please share your ideas?

EDIT: People asked for sample data, End date could contain:

2006-03-13 10:27:13.000
2016-03-02 18:48:11.000
2016-03-02 18:42:57.000
NULL
2013-09-12 09:19:05.000
NULL
2016-03-02 18:59:37.000
NULL
NULL

StartDate:

2005-12-06 00:00:00.000
2011-03-29 20:57:51.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2007-11-01 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2011-02-28 00:00:00.000
2008-01-17 00:00:00.000

DimDate contains dates from 2000 to 2018 at day level.

SomeDate and SomeKey would look something like:

88888888888888888881
88888888888888888882
88888888888888888883
88888888888888888884
88888888888888888885
88888888888888888886
88888888888888888887
88888888888888888888
88888888888888888889
2
In the TSQL you can use isnull(rf.EndDate, getdate()) - paparazzo
Sample data, expected results please? - wBob
@wBob I added the sample data. - Richard
@Paparazzi Thanks for the tip. - Richard
Does wBob's answer address your question? As to why you may get a problem with your CROSS JOIN: Please make sure that you can partition the join in some way. Otherwise all the data will be piped into a single vertex and given that your join will most likely explode the data set, may run out of time (5h for the vertex) or in general be very inefficient. - Michael Rys

2 Answers

4
votes

The reason U-SQL does not support BETWEEN in the predicate is that there is no scale-out join algorithm that works for non-equijoins. Even if we syntactically would allow it, it would still go into a CROSS JOIN in the plan.

What you want to do is to get a join that can be partitioned. One way to do it is if you can have an equality join on a partitioning key and then the cross join within that partition.

In your case I am however thinking you do not really need a join. I think what you want to do is to generate a row per day between the start and end date.

I would do this with a CROSS APPLY EXPLODE that has no scale limit. Here is an example:

@MyTable = 
  SELECT * 
  FROM (VALUES
        (81,81,(DateTime?) DateTime.Parse("2005-12-06 00:00:00.000"),(DateTime?) DateTime.Parse("2006-03-13 10:27:13.000")),
        (82,82,(DateTime?) DateTime.Parse("2011-03-29 20:57:51.000"),(DateTime?) DateTime.Parse("2016-03-02 18:48:11.000")),
        (83,83,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:42:57.000")),
        (84,84,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) null),
        (85,85,(DateTime?) DateTime.Parse("2007-11-01 00:00:00.000"),(DateTime?) DateTime.Parse("2013-09-12 09:19:05.000")),
        (86,86,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
        (87,87,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) DateTime.Parse("2016-03-02 18:59:37.000")),
        (88,88,(DateTime?) DateTime.Parse("2011-02-28 00:00:00.000"),(DateTime?) null),
        (89,89,(DateTime?) DateTime.Parse("2008-01-17 00:00:00.000"),(DateTime?) null)
    ) AS T(SomeKey, SomeData, StartDate, EndDate);

@res = 
  SELECT SomeKey, SomeData, StartDate, EndDate, DailyDate 
  FROM @MyTable 
       CROSS APPLY EXPLODE 
         (Enumerable.Range(0, 
               1 + (EndDate == (DateTime?) null ? DateTime.Now 
                                                : EndDate.Value).Subtract(StartDate.Value).Days)
           .Select(offset => StartDate.Value.AddDays(offset))
          ) AS T(DailyDate);

OUTPUT @res
TO "/output/test.csv"
USING Outputters.Csv(outputHeader : true);

This is a typical example of where asking the question scenario-based instead of asking for a translation would have been easier to answer :).

3
votes

I got this script to work with some sample data I generated.

@dateDim =
    EXTRACT xdate DateTime,
            yearMonth string
    FROM "/input/dbo.DimDate.tsv"
    USING Extractors.Tsv();

@data =
    EXTRACT 
            someKey int,
            someData string,
            startDate DateTime,
            endDate DateTime?
    FROM "/input/dbo.MyTable.tsv"
    USING Extractors.Tsv();

/*
// Use U-SQL ISNULL conditional operator which is ?
@working =
    SELECT COUNT( * ) AS records
    FROM
    (
        SELECT *
        FROM @dateDim AS dd
             CROSS JOIN
                 @data AS d
        WHERE dd.xdate BETWEEN d.startDate AND (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
    ) AS x;
*/

@working =
    SELECT COUNT( * ) AS records
    FROM
    (
        SELECT *
        FROM @dateDim AS dd
             CROSS JOIN
                 @data AS d
        WHERE dd.xdate >= d.startDate
          AND dd.xdate <= (d.endDate == (DateTime?)null ? DateTime.Now : d.endDate)
) AS x;