0
votes

I'm stumped with a requirement and can't seem to get past it (I've tried to scour the internet, and I'm unable to fix it myself, given my limited experience with Power Query). I could accomplish this using VBA, but the number of records, that need to be processed are shy of half a million records!

So here's the simplified task at hand. There are two columns - "Substring" and "Main String", as shown below [Input]:

Substring   Main String
AB|CDE      ABCDEF
ABC|DE|GH   ABCDEFGHI
A|BC|X|YZ   ABCDYZ

The Substring column has mini substrings separated by a delimiter (|). These Mini Substrings may or may not be found within the Main string. The output at the end should look like this [Expected Output]:

Substring   Main String   MSS1  ML1   MSS2  ML2   MSS3  ML3   MSS4  ML4
AB|CDE      ABCDEF        CDE   3     AB    2     
ABC|DE|GH   ABCDEFGHI     ABC   3     DE    2     GH    2     
A|BC|X|YZ   ABCDYZ        BC    2     YZ    2     A     1     X     0

(Note: MSS and ML in the column headers stand for Mini Substring and Match Length respectively)

The Mini Substrings can vary in number, so the row that has the maximum number of Mini Substrings would define the number of MSS and ML columns. In this case, the third row has four of them, so there were four MSS and ML columns each.

Not just that, the sequence of these Mini Substrings has to be such that those with highest matching lengths should be placed on the leftmost MSS and ML pair of columns. And then those with the least matching lengths should go on the rightmost pair of MSS and ML columns. And everything in between should progress in that order.

I got until the stage where the current output looks like this [Interim Output]:

Substring   Main String   MSS1  MSS2  MSS3  MSS4
AB|CDE      ABCDEF        AB    CDE
ABC|DE|GH   ABCDEFGHI     ABC   DE    GH
A|BC|X|YZ   ABCDYZ        A     BC    X     YZ

Dynamically inserting the ML columns just after the respective MSS columns is the challenge I'm unable to get past by. And I'm sure sorting them horizontally by the rank of matching lengths (i.e. lengths of those mini substrings that found a match in the main string) would also be another challenge. Here's the Power Query that I could come up with that generated the [Interim Output].

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Sub String", type text}, {"Main String", type text}}),
    #"Duplicated Column" = Table.DuplicateColumn(#"Changed Type", "Sub String", "For Split"),
    #"Added Custom" = Table.AddColumn(#"Duplicated Column", "Split Count", each List.Count(Text.Split([Sub String],"|"))),
    #"Reordered Columns" = Table.ReorderColumns(#"Added Custom",{"Split Count", "Sub String", "Main String", "For Split"}),
    #"Max No Of Splittable Columns" = List.Max(#"Reordered Columns"[Split Count]),
    #"List Of MSS Columns" = List.Transform({1..#"Max No Of Splittable Columns"}, each "MSS"&Text.From(_)),
    #"Split Columns By Delimiter" = Table.SplitColumn(#"Reordered Columns","For Split",Splitter.SplitTextByDelimiter("|"), #"List Of MSS Columns"),
    #"Removed Columns" = Table.RemoveColumns(#"Split Columns By Delimiter",{"Split Count"})
in
    #"Removed Columns"

Could you please guide me out of this? You can easily tell I am new to Power Query, so I very much appreciate any help that I can get. Thank you!

1

1 Answers

1
votes

This code should work:

let
    Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
    temp = Table.AddColumn(Source, "temp", each
[a = List.Select(Text.Split([Substring], "|"), (sub)=>Text.Contains([Main String], sub)),
b = List.Transform(a, each Text.From(Text.Length(_))),
c = Table.AddColumn(Table.AddIndexColumn(Table.FromList(a), "i", 1, 1), "key", each "MSS" & Text.From([i])),
d = Table.AddColumn(Table.AddIndexColumn(Table.FromList(b), "i", 1, 1), "key", each "ML" & Text.From([i])),
e = Table.Sort(Table.Combine({c,d}),{"i"}),
f = Table.Pivot(e[[key],[Column1]], List.Distinct(e[key]), "key", "Column1"),
count = List.Count(a)][[f],[count]]),
    expand = Table.ExpandRecordColumn(temp, "temp", {"count", "f"}),
    final = Table.ExpandTableColumn(expand, "f", List.Buffer(Table.ColumnNames(Table.Sort(expand,{"count", 1}){0}[f])))
in
    final

first

In the decreasing order of lengths:

let
    Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
    temp = Table.AddColumn(Source, "temp", each
[a = List.Select(Text.Split([Substring], "|"), (sub)=>Text.Contains([Main String], sub)),
b = Table.TransformColumnTypes(Table.Sort(Table.AddColumn(Table.FromList(a), "len", each Text.Length([Column1])), {{"len", 1},{"Column1",0}}), {"len", type text}),
c = Table.AddColumn(Table.AddIndexColumn(Table.FromList(b[Column1]), "i", 1, 1), "key", each "MSS" & Text.From([i])),
d = Table.AddColumn(Table.AddIndexColumn(Table.FromList(b[len]), "i", 1, 1), "key", each "ML" & Text.From([i])),
e = Table.Sort(Table.Combine({c,d}),{"i"}),
f = Table.Pivot(e[[key],[Column1]], List.Distinct(e[key]), "key", "Column1"),
count = List.Count(a)][[f],[count]]),
    expand = Table.ExpandRecordColumn(temp, "temp", {"count", "f"}),
    final = Table.ExpandTableColumn(expand, "f", List.Buffer(Table.ColumnNames(Table.Sort(expand,{"count", 1}){0}[f])))
in
    final

second