0
votes

I have two inputs in a dataframe, and I need to create an output that depends on both inputs (same row, different columns), but also on its previous value (same column, previous row).

This dataframe command will create an example of what I need:

df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])

The rules are simple:

  • If input_1 is 1, output is 1 (input_1 is a trigger function)
  • output will remain as 1 as long as input_2 is also 1. (input_2 works kind of like a memory function)
  • For all the others, output will be 0

The rows go in sequence as they happen in time, I mean, row 0 output influences row 1 output, row 1 output influences row 2 output, and so on. So output depends on input_1, input_2, but also on its own previous value.

I could code it looping through the dataframe, computing and assigning values using iloc, but it is painfully slow. I need to run this through many thousands of rows for tens of thousands of dataframes, so I am looking for the most efficient way to do it (preferably vectorization). It can be with numpy or other library/method that you know.

I searched and found some questions about vectorization and row-looping, but I still don't see how to use those techniques. Example questions: How to iterate over rows in a DataFrame in Pandas?. Also this one, What is the most efficient way to loop through dataframes with pandas?

I appreciate your help

2
Rule number 4 conflicts with the original statement. You say that the output depends on inputs #1 and #2 and on its previous value. In rule #4 you say that the output also depends on the previous value of input #2. Please specify which of the statements is correct.Sergey
Thanks for clarification. In this case, rule #4 is unnecessary. Try to write the conditions in the form of a number of three bits (input #1, input#2, prev. output). Let's translate your rules for the language of numbers. Rule #1: output will be 1 if the input is more than 3 (combinations 100 101 110 111). Rule #2 output will be 1 if the input is 3 (011). Rule #3: output is zero in the remaining cases, that is, if the input number is less than 3 (combinations 000 001 010). Rule #4 says that if the input number is 2 (010), then the output will be 0, but we already know this from rule #3Sergey
Hi @Sergey. You have a point in translating the rules into numbers. However "previous output" belongs to a different row, and that is exactly what I want to overcome in order to vectorize the solution. I don't want to put it in an horizontal rule, because that is not what I have. I will delete rule #4, as it is repeating something already said. Thanks!!!xiaxio
Hi @xiaxio, did I understand correctly that you initially only have two columns and zero as the initial output?Sergey
Hi @Sergey. I have 'input_1' and 'input_2' columns. I need to generate the 'output' column. I cannot use it as you did in your answer.xiaxio

2 Answers

3
votes

If I understand you right, you want to know how to compute column output. You can do for example:

df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)

Prints:

    input_1  input_2  output  output_2
0         0        0       0         0
1         0        1       0         0
2         0        0       0         0
3         1        1       1         1
4         0        1       1         1
5         0        1       1         1
6         0        0       0         0
7         0        1       0         0
8         0        1       0         0
9         1        1       1         1
10        1        1       1         1
11        0        1       1         1
12        0        1       1         1
13        1        1       1         1
14        0        1       1         1
15        0        1       1         1
16        0        0       0         0
17        0        1       0         0
1
votes

As you explained in the discussion above we have just two inputs loaded using pandas dataframe:

df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

We have to create outputs using following rules:

#1 if input_1 is one the output is one
#2 if both inputs is zero the output is zero
#3 if input_1 is zero and input_2 is one the output holds the previous value
#4 the initial output value is zero

to generate outputs we can

  1. duplicate input_1 to the output
  2. update output with previous value if input_1 is zero and input_2 is one

because of the rules above we don't need to update the first output

df['output'] = df.input_1

for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

print(df)

The output is:

>>> print(df)
    input_1  input_2  output
0         0        0       0
1         0        1       0
2         0        0       0
3         1        1       1
4         0        1       1
5         0        1       1
6         0        0       0
7         0        1       0
8         0        1       0
9         1        1       1
10        1        1       1
11        0        1       1
12        0        1       1
13        1        1       1
14        0        1       1
15        0        1       1
16        0        0       0
17        0        1       0

UPDATE1

The more fast way to do it is modification of formula proposed by @Andrej

df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)

Without modification his formula creates wrong output for input combination [1, 0]. It holds the previous output instead of setting it to 1.

UPDATE2

This just to compare results

df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])

df['output'] = df.input_1
for idx, row in df.iterrows():
   if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
       df.output[idx] = df.output[idx-1]

df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)

The results is:

>>> print(df)
    input_1  input_2  output  output_1  output_2
0         0        0       0         0         0
1         1        0       1         1         0
2         0        1       1         1         0
3         1        1       1         1         1
4         0        1       1         1         1
5         0        1       1         1         1
6         0        0       0         0         0
7         0        1       0         0         0
8         0        1       0         0         0
9         1        1       1         1         1
10        1        1       1         1         1
11        0        1       1         1         1
12        0        1       1         1         1
13        1        1       1         1         1
14        0        1       1         1         1
15        0        1       1         1         1
16        0        0       0         0         0
17        0        1       0         0         0