1
votes

I want to do the following thing:

  1. Get min and max for every measurement column within the same label (range of rows)
  2. Define a range for the interesting values (e.g. maximum * 0.6 up to maximum)
  3. Check for every measurement column within the same label if value lies in this interval (=True) or not (=False)
  4. Replace values outside this interval with NaN

I have a df like this:

df = {'Measurement_1': [1, 2, 7, 9, 34, 67, 50], 'Measurement_2': [3, 4, 30, 22, 55, 13, 45], 'Label': ['cable', 'cable', 'cable', 'wood', 'wood', 'wood', 'wood']}

I´m getting the maximum and minimum value for every measurement and every label as a separate df and define maximum_set and minimum_set for further conditions like this:

df_sort = df.groupby('Label')
measurement_values = df.columns.tolist()
measurement_values.remove('Label')

labels = df['Label'].unique().tolist()

df_sort = df_sort[measurement_values]

for label, data in df_sort:
    minimum = df_sort.min()
    maximum = df_sort.max()
    maximum_set = maximum
    minimum_set = maximum*0.6

Now I want to iterate over the original df to get a "measurement_filtered" column with a boolean value for every measurement and every label (e.g. check if all measurement_1 values lies between minimum_set and maximum_set). I´m doing this with the following loop:

for measurement in measurement_values:
     for label in labels: 
         df[measurement +'_filtered'] = df.loc[df['Label'] == label,measurement].between(minimum_set.loc[label][measurement],maximum_set.loc[label][measurement], inclusive ='both')

This loop gives the correct boolean entries for every measurement and label. The problem is that the for loop and the .loc are overwriting the iterations before. So the final df_filtered columns only include the last "label"-iteration (in this example 'wood'), the rest of the _filtered columns is empty. I don´t know how to fix this problem.

1

1 Answers

0
votes

Let us not use for-loop instead we can use a vectorized/fast approach, here is the annotated code:

# select the measurement cols    
cols = df.filter(like='Measure')

# groupby label and find the max value per grp
max_ = cols.groupby(df['Label']).transform('max')

# Create a boolean condition
cond = cols.ge(max_ * 0.6) & cols.le(max_)

# Mask the values with NaN where the above condition is 
# False then assign the result back to original dataframe
df.assign(**cols.where(cond))

   Measurement_1  Measurement_2  Label
0            NaN            NaN  cable
1            NaN            NaN  cable
2            7.0           30.0  cable
3            NaN            NaN   wood
4            NaN           55.0   wood
5           67.0            NaN   wood
6           50.0           45.0   wood