1
votes

I have a problem in power query where my data is coming from a report that is split into pages and some of the pages skew the data to different columns. I think there may be an error based solution, but I would like it to be more redundant and not rely on text vs. number error correction. Mainly because sometimes the data that could be alphabetic in some instances, can be numeric in others. I've prepared a data set that has randomly generated replacements for names and codes. I also had to butcher the data a little to give examples of the different shifts, and to account for records split from different pages.

https://drive.google.com/file/d/0B2qUbAWJXgfyNlByV2RHODJzQjA/view?usp=sharing

There are 12 records in the data set that will eventually contain one row per record. 1st page is the Raw data stripped from the source document. These are Check History records (masked) that need to be moved to a single row per record with separate columns for four specific areas:

[Names, Dates, Check numbers, etc][Earnings][Deductions][Taxes]

Record Info including Names, Dates, Record ID Numbers, and amounts is the fist thing extracted and formatted from the raw data. The steps I applied in NameData and CheckData will show how those records are extracted and formatted, also some of the skewed data in this section was easy to reconcile with merge functions and conditional columns.

Each individual Pay Item (An earning code, Deduction Code, Or Tax Code) is formatted then pivoted to it's own column. You can see an example of this maneuver in the Earnings Query. The PayItemReference query is some basic filters I use as a starting point to My Pay Items. You can see in that Query that the codes will shift from column to column, with Text and Numbers mixed. There can be spaces between the codes and their values, or there can be no space, it can also shift columns completely.

I am working on consolidating codes and their values to regular columns, then I can merge, unpivot, pivot etc to get to the final formatting. I have tried using conditional columns and errors, but there are always small issues with either on the original data set. I just need some fresh eyes and new approaches to the data.

1
You need to explain in exact detail how your data can look like and how it should be split into columns. In your example I don't see any shifts at all as all rows have 7 positions. Also for numbers it may be relevant if there are any of ,.- or currency symbols and the culture of the data may also be relevant (e.g. for dates 01/02/2017 is February 1 in the UK and January 2nd in the US).MarcelBeug
I hope my edits make the data a little more presentable. all numbers are usually decimal numbers as they can either represent $ currency or Decimal hours records depending on context.CRSPLK
This is not enough. Without clear specifications how to determine which part of the input should be put in which column, with all possible alternative shifts, we won't be able to help you.MarcelBeug
It is hard to describe the original data set, I apologize, I also cannot share the data as it is HIPAA regulated. Let me see what I can do to get you a more clear example, along with what I intend to do with it.CRSPLK
@CRSPLK Too bad you didn't use @ to address you comment to me - I saw it just today. Anyway, hope my solution helps. It may require some fine-tuning though.Eugene

1 Answers

2
votes

This was a challenging task.

First it is good idea to split table back into pages, since column structure for each page is probably unique. Thus I form list of tables, each table for one page. Then I have to process each page: extract column names, add summary information for each row, filter not needed rows, and set column names. This is done for each table in the list by using custom function ConvertTable. Afterwards you just combine resulting tables.

Here:

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    AddRowNum = Table.AddColumn(Table.AddIndexColumn(Source, "Index", 1, 1), "RowNum", each Number.Mod([Index]-1, 52)+1, type number),
    CountTables = {1..(Number.RoundUp(Table.RowCount(AddRowNum)/52, 0))},
    ListTables = List.Transform(CountTables, (ListItem)=>Table.SelectRows(AddRowNum, each [Index] > 52 * (ListItem - 1) and [Index] <= 52 * ListItem)),

    ConvertTable = (tbl as table) as table =>
    let
        hdr1 = Table.Transpose(Table.FillDown(Table.Transpose(Table.FromRecords({tbl{6}})), {"Column1"})),
        hdr2 = Table.FromRecords({tbl{7}}),
        ColNames = Table.Transpose(Table.SelectColumns(Table.FirstN(Table.AddColumn(Table.Transpose(Table.Combine({hdr1, hdr2})), "ColumnName", each [Column1] & ": " & [Column2]), 19), {"ColumnName"})),
        AddPayDate = Table.AddColumn(tbl, "Pay Date", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column9]) = "Pay Date" then [Column9] else null, type date),
        AddPeriodEndDate = Table.AddColumn(AddPayDate, "Period End Date", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column12]) = "Period End Date" then [Column12] else null, type date),
        AddJobCode = Table.AddColumn(AddPeriodEndDate, "Job Code", each if [RowNum] > 8 and Text.Trim(tbl{[RowNum]-2}[Column14]) = "Job Code" then [Column14] else null, Int64.Type),
        AddCheckInfo = Table.AddColumn(AddJobCode, "Check Info", each if [RowNum] > 8 and Text.Trim([Column1]) = "Check Printed:" then Table.Transpose(Table.SelectRows(Table.Transpose(Table.FromRecords({_})), each [Column1] <> null)) else null),
        ExpandedCheckInfo = Table.ExpandTableColumn(AddCheckInfo, "Check Info", {"Column4", "Column6", "Column8"}, {"Check Amount", "Direct Deposit", "Net"}),
        FillUp = Table.FillUp(ExpandedCheckInfo, {"Column3", "Check Amount", "Direct Deposit", "Net"})//Table.AddColumn(AddJobCode, "tmp2", each if [RowNum] < 9 then "" else (if Text.Trim([Column1]) = "Check Printed:" then (if [Column3] = null then -1 else [Column3]) else null), type text), {"tmp2"}),
        FillDown = Table.FillDown(FillUp, {"Column1", "Column5", "Pay Date", "Period End Date", "Job Code"}),
        AddCheckEEIDfixed = Table.AddColumn(FillDown, "Check:EEID.fixed", each Text.From([Column5]) & ":" & Text.From([Column3]), type text),
        FilteredExtraRows = Table.SelectRows(AddCheckEEIDfixed, each [RowNum] > 8 and Text.Trim([Column1]) <> "Check Printed:" and Text.Trim([Column7]) <> "PerControl" and Text.Trim(tbl{[RowNum]-2}[Column7]) <> "PerControl" and [#"Check:EEID.fixed"] <> null),
        DemotedHeaders = Table.DemoteHeaders(FilteredExtraRows),
        GetColumnNames1 = Table.Combine({Table.FromRecords({DemotedHeaders{0}}), ColNames}),
        GetColumnNames2 = Table.PromoteHeaders(Table.FillDown(GetColumnNames1, Table.ColumnNames(GetColumnNames1))),
        SetColumnNames = Table.PromoteHeaders(Table.Combine({GetColumnNames2, FilteredExtraRows}))
    in
        SetColumnNames,

    ConvertedList = List.Transform(ListTables, (t) => ConvertTable(t)),
    GetWholeTable = Table.Combine(ConvertedList)
in
    GetWholeTable