My dataset is a sales transactions history of an online store consists of 541000+ records. The head looks like this:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID TotalSales
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2018-11-29 08:26:00 2.55 17850 15.30
1 536365 71053 WHITE METAL LANTERN 6 2018-11-29 08:26:00 3.39 17850 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2018-11-29 08:26:00 2.75 17850 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2018-11-29 08:26:00 3.39 17850 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2018-11-29 08:26:00 3.39 17850 20.34
Upon data exploration, I found out that there are cancelled invoices with matching original invoice, but the original InvoiceNo
and cancelled InvoiceNo
are not the same. Cancelled invoices are indicated by InvoiceNo
starting with C
.
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID TotalSales
0 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 2019-01-16 10:01:00 1.04 12346 77183.6
1 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 2019-01-16 10:17:00 1.04 12346 -77183.6
2 581483 23843 PAPER CRAFT , LITTLE BIRDIE 80995 2019-12-07 09:15:00 2.08 16446 168469.6
3 C581484 23843 PAPER CRAFT , LITTLE BIRDIE -80995 2019-12-07 09:27:00 2.08 16446 -168469.6
Now before I perform any analysis, I want to remove those invoices that are cancelled, meaning both the original invoice and its corresponding cancelled invoice.
Since the cancellation InvoiceNo
is not the same InvoiceNo
as the original order, and due to the limitation of my dataset, the only way to find the matching order and cancelled order that I can think of, is to generate records were the columns StockCode, Description, UnitPrice, CustomerID and absolute Quantity
are duplicated. Some orders are cancelled after a few days so InvoiceDate
is not included. However when I run the code below, it returns 244747 records.
df2 = df_clean[df_clean.duplicated(['StockCode', 'Description', 'UnitPrice',
'CustomerID', 'Quantity_ABS'], keep=False)]
The reason is that, each invoice could contain multiple lines of StockCode and a customer orders the same StockCode with the same quantity across the coverage of the dataset. There are only 3679 cancellation invoices so that means the original invoice matching the cancelled invoice should also be about the same number.
Now I sorted my df2 by CustomerID
then by StockCode
, then by Quantity
and set inplace=True
for easy tracking and viewing.
df2.sort_values(by=['CustomerID', 'StockCode', 'Quantity'], inplace=True)
What I want to do is for every customerID
, delete only the records that has a cancelled invoice (InvoiceNo that starts with C) as well as the matching order.
For example, this how it shows if I display all the records for CustomerID 15749:
df2[df2['CustomerID'] == '15749']
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID TotalSales
0 C550456 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR -3114 2019-04-16 13:08:00 2.10 15749 -6539.4
1 540815 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-01-09 12:55:00 2.10 15749 6539.4
2 550461 21108 FAIRY CAKE FLANNEL ASSORTED COLOUR 3114 2019-04-16 13:20:00 2.10 15749 6539.4
3 C550456 21175 GIN + TONIC DIET METAL SIGN -2000 2019-04-16 13:08:00 1.85 15749 -3700.0
4 540815 21175 GIN + TONIC DIET METAL SIGN 2000 2019-01-09 12:55:00 1.85 15749 3700.0
5 C550456 48185 DOORMAT FAIRY CAKE -670 2019-04-16 13:08:00 6.75 15749 -4522.5
6 540818 48185 DOORMAT FAIRY CAKE 670 2019-01-09 12:57:00 6.75 15749 4522.5
7 C550456 85123A WHITE HANGING HEART T-LIGHT HOLDER -1930 2019-04-16 13:08:00 2.55 15749 -4921.5
8 540815 85123A WHITE HANGING HEART T-LIGHT HOLDER 1930 2019-01-09 12:55:00 2.55 15749 4921.5
So here, I want to delete the records in pairs: the InvoiceNo
that starts with C
and the corresponding sale because they offset each other. So But InvoiceNo 550461
will remain as it is a valid sale (or not cancelled).
Another example CustomerID 16013
:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID TotalSales
0 C539329 20733 GOLD MINI TAPE MEASURE -300 2018-12-15 09:37:00 0.72 16013 -216.0
1 539109 20733 GOLD MINI TAPE MEASURE 300 2018-12-14 11:07:00 0.72 16013 216.0
2 539331 20733 GOLD MINI TAPE MEASURE 300 2018-12-15 09:39:00 0.72 16013 216.0
3 546521 21471 STRAWBERRY RAFFIA FOOD COVER 48 2019-03-12 12:34:00 2.95 16013 141.6
4 551618 21471 STRAWBERRY RAFFIA FOOD COVER 48 2019-05-01 11:42:00 2.95 16013 141.6
5 580763 79321 CHILLI LIGHTS 96 2019-12-04 10:36:00 4.95 16013 475.2
6 C539329 85123A WHITE HANGING HEART T-LIGHT HOLDER -500 2018-12-15 09:37:00 2.55 16013 -1275.0
7 543675 85123A WHITE HANGING HEART T-LIGHT HOLDER 150 2019-02-09 11:31:00 2.55 16013 382.5
8 555075 85123A WHITE HANGING HEART T-LIGHT HOLDER 150 2019-05-29 12:05:00 2.55 16013 382.5
9 539109 85123A WHITE HANGING HEART T-LIGHT HOLDER 500 2018-12-14 11:07:00 2.55 16013 1275.0
Here, only Invoice C539329/539109
and C539329/539109
will be deleted.
I can't seem to figure out how to do this programmatically and efficiently. Although I can do this manually in Excel, it will take too much time and is not allowed for the project.
Please advise if this can be done in python and how. Also if you have any other ideas on how to solve this other than what I have already done, please advise. It is highly appeciated.
Thank you in advance.