0
votes

I am trying to cleaning up some data in a huge dataset. One column holds values for the Sales aamount. Example values could be like those:

Clean Data:

Sales Potential

230

120

300

However, at some points there appear something like this:

Dirty Data

0, 0, 0, 0, 0

4, 0, 0, 0

0, 0, 480

0, 200, 0

In the first case of the dirty data the cell shoul only contains a zero: 0 In all other cases I would like to extract, if there is any non-zero number, this number and replace the cell with this value or add a new cleanded-column.

So the dirty data cleaned up:

Cleaned Data:

0

4

480

200

My approach was using RegExpressions in R as I am loading the data into Power-BI using Power-Query. I tried to find a pattern where I extract the value I am looking for and place it in a new column. However, my resutls looks like nothing. Is there maybe a much simpler approach to achieve this in R?

Code so far: library(stringr) OutputRegEx <- data.frame(MyDataset) Splitter = function(x) substr(str_extract(x,'[1-9]'),1,7) OutputRegEx[["RegExAuswertung"]] <- apply(OutputRegEx[43],1, function(x) Splitter(x) )

3

3 Answers

1
votes

In Powerquery, insert a custom column with below formula

=List.Max(List.Transform(Text.Split(Text.From([Sales Potential]),","), each Number.FromText(_)))

The formula splits everything on commas, puts into a list, converts the list from text into numbers, then takes the maximum number from the list.

1
votes

This R solution seems to do what you want:

SalesPotential <- c("0, 0, 0, 0, 0", "4, 0, 0, 0","0, 0, 480","0, 200, 0")

library(stringr)    
str_extract(gsub(",", "", SalesPotential), "(?=(0\\s){4})\\d+|[1-9]+(0{1,})?")
[1] "0"   "4"   "480" "200"

Using gsubthis solution first removes the commas in gsub(",", "", SalesPotential) submitting this edited vector to str_extract. It then goes on to define two patterns, one for values where there are no other numbers but 0, another for values that start with non-0digits and may have one or more 0s at the end.

If you want to have clean numbers, convert to numeric:

as.numeric(str_extract(gsub(",", "", SalesPotential), "(?=(0\\s){4})\\d+|[1-9]+(0{1,})?"))
[1]   0   4 480 200
0
votes

Well, you can achieve the desired result in Power Query itself either by using M-formula language or, by using the GUI itself.

Let me tell you the simplest approach.

If I am correct then, the column has some cleaned numbers and some comma delimited numbers.

So what you do is

Split the column by comma for each occurance.

So, you will get (n+1)-number of columns if the maximum no. Of comman in any cell is "n"

Now, you have to create a conditional column that checks for numbers greater than zero in all these columns and gives the output.

Bhmy doing so,you will get non-zero numbers in that calculated column for dirty data and the same number for the cleaned data.

After doing that you can delete all those comma delimited columns and keep the conditional column only.

Now the formula should be as follows :

if delcol1 <> 0 then delcol1 elseif delcol2 <> 0 then delcol2 elseif.......

delcol2 <> 0 then delcoln

This is the easiest way out of the probelm that I can think of.

However, there are other alternatives also for getting the same answer.