0
votes

My dataset is a sales transactions history of an online store consists of 541000+ records. The head looks like this:

   InvoiceNo StockCode                          Description  Quantity          InvoiceDate  UnitPrice  CustomerID  TotalSales
0     536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6  2018-11-29 08:26:00       2.55       17850       15.30
1     536365     71053                  WHITE METAL LANTERN         6  2018-11-29 08:26:00       3.39       17850       20.34
2     536365    84406B       CREAM CUPID HEARTS COAT HANGER         8  2018-11-29 08:26:00       2.75       17850       22.00
3     536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6  2018-11-29 08:26:00       3.39       17850       20.34
4     536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6  2018-11-29 08:26:00       3.39       17850       20.34

Upon data exploration, I found out that there are cancelled invoices with matching original invoice, but the original InvoiceNo and cancelled InvoiceNo are not the same. Cancelled invoices are indicated by InvoiceNo starting with C.

  InvoiceNo  StockCode                     Description  Quantity          InvoiceDate  UnitPrice  CustomerID  TotalSales
0    541431      23166  MEDIUM CERAMIC TOP STORAGE JAR     74215  2019-01-16 10:01:00       1.04       12346     77183.6
1   C541433      23166  MEDIUM CERAMIC TOP STORAGE JAR    -74215  2019-01-16 10:17:00       1.04       12346    -77183.6
2    581483      23843     PAPER CRAFT , LITTLE BIRDIE     80995  2019-12-07 09:15:00       2.08       16446    168469.6
3   C581484      23843     PAPER CRAFT , LITTLE BIRDIE    -80995  2019-12-07 09:27:00       2.08       16446   -168469.6

Now before I perform any analysis, I want to remove those invoices that are cancelled, meaning both the original invoice and its corresponding cancelled invoice.

Since the cancellation InvoiceNo is not the same InvoiceNo as the original order, and due to the limitation of my dataset, the only way to find the matching order and cancelled order that I can think of, is to generate records were the columns StockCode, Description, UnitPrice, CustomerID and absolute Quantity are duplicated. Some orders are cancelled after a few days so InvoiceDate is not included. However when I run the code below, it returns 244747 records.

df2 = df_clean[df_clean.duplicated(['StockCode', 'Description', 'UnitPrice', 
                                               'CustomerID', 'Quantity_ABS'], keep=False)]

The reason is that, each invoice could contain multiple lines of StockCode and a customer orders the same StockCode with the same quantity across the coverage of the dataset. There are only 3679 cancellation invoices so that means the original invoice matching the cancelled invoice should also be about the same number.

Now I sorted my df2 by CustomerID then by StockCode, then by Quantity and set inplace=True for easy tracking and viewing.

df2.sort_values(by=['CustomerID', 'StockCode', 'Quantity'], inplace=True)

What I want to do is for every customerID, delete only the records that has a cancelled invoice (InvoiceNo that starts with C) as well as the matching order.

For example, this how it shows if I display all the records for CustomerID 15749:

df2[df2['CustomerID'] == '15749']
  InvoiceNo StockCode                         Description  Quantity          InvoiceDate  UnitPrice  CustomerID  TotalSales
0   C550456     21108  FAIRY CAKE FLANNEL ASSORTED COLOUR     -3114  2019-04-16 13:08:00       2.10       15749     -6539.4
1    540815     21108  FAIRY CAKE FLANNEL ASSORTED COLOUR      3114  2019-01-09 12:55:00       2.10       15749      6539.4
2    550461     21108  FAIRY CAKE FLANNEL ASSORTED COLOUR      3114  2019-04-16 13:20:00       2.10       15749      6539.4
3   C550456     21175         GIN + TONIC DIET METAL SIGN     -2000  2019-04-16 13:08:00       1.85       15749     -3700.0
4    540815     21175         GIN + TONIC DIET METAL SIGN      2000  2019-01-09 12:55:00       1.85       15749      3700.0
5   C550456     48185                  DOORMAT FAIRY CAKE      -670  2019-04-16 13:08:00       6.75       15749     -4522.5
6    540818     48185                  DOORMAT FAIRY CAKE       670  2019-01-09 12:57:00       6.75       15749      4522.5
7   C550456    85123A  WHITE HANGING HEART T-LIGHT HOLDER     -1930  2019-04-16 13:08:00       2.55       15749     -4921.5
8    540815    85123A  WHITE HANGING HEART T-LIGHT HOLDER      1930  2019-01-09 12:55:00       2.55       15749      4921.5

So here, I want to delete the records in pairs: the InvoiceNo that starts with C and the corresponding sale because they offset each other. So But InvoiceNo 550461 will remain as it is a valid sale (or not cancelled).

Another example CustomerID 16013:

  InvoiceNo StockCode                         Description  Quantity          InvoiceDate  UnitPrice  CustomerID  TotalSales
0   C539329     20733              GOLD MINI TAPE MEASURE      -300  2018-12-15 09:37:00       0.72       16013      -216.0
1    539109     20733              GOLD MINI TAPE MEASURE       300  2018-12-14 11:07:00       0.72       16013       216.0
2    539331     20733              GOLD MINI TAPE MEASURE       300  2018-12-15 09:39:00       0.72       16013       216.0
3    546521     21471        STRAWBERRY RAFFIA FOOD COVER        48  2019-03-12 12:34:00       2.95       16013       141.6
4    551618     21471        STRAWBERRY RAFFIA FOOD COVER        48  2019-05-01 11:42:00       2.95       16013       141.6
5    580763     79321                       CHILLI LIGHTS        96  2019-12-04 10:36:00       4.95       16013       475.2
6   C539329    85123A  WHITE HANGING HEART T-LIGHT HOLDER      -500  2018-12-15 09:37:00       2.55       16013     -1275.0
7    543675    85123A  WHITE HANGING HEART T-LIGHT HOLDER       150  2019-02-09 11:31:00       2.55       16013       382.5
8    555075    85123A  WHITE HANGING HEART T-LIGHT HOLDER       150  2019-05-29 12:05:00       2.55       16013       382.5
9    539109    85123A  WHITE HANGING HEART T-LIGHT HOLDER       500  2018-12-14 11:07:00       2.55       16013      1275.0

Here, only Invoice C539329/539109 and C539329/539109 will be deleted.

I can't seem to figure out how to do this programmatically and efficiently. Although I can do this manually in Excel, it will take too much time and is not allowed for the project.

Please advise if this can be done in python and how. Also if you have any other ideas on how to solve this other than what I have already done, please advise. It is highly appeciated.

Thank you in advance.