I have a json file that looks like this.
{
"deviceAction": "",
"userID": "001",
"name": "FAILED",
"sourceUserName": "",
"@timestamp": ""
},
{
"deviceAction": "",
"userID": "001",
"name": "FAILED",
"sourceUserName": "",
"@timestamp": ""
},
{
"deviceAction": "",
"userID": "002",
"name": "FAILED",
"sourceUserName": "",
"@timestamp": ""
}
this file has over 40k objects and most of them are duplicates.
What I am trying to do is to keep only the unique values. So an output similar to this:
{
"deviceAction": "",
"userID": "001",
"name": "FAILED",
"sourceUserName": "",
"@timestamp": ""
},
{
"deviceAction": "",
"userID": "002",
"name": "FAILED",
"sourceUserName": "",
"@timestamp": ""
}
Using pandas I used this code:
data3 = pd.read_json("./sept-count/all_data.json")
data3
data3.sort_values("userID", inplace= True)
data_unique = data3.drop_duplicates(subset=['userID'], keep='first')
I saved the result to a new file, but when I open it and look for duplicates, I can still see a lot of duplicates in the userID column.
As this json file, for each object has multiple keys, how can I make sure that I can keep only unique values based on the column userID?
Thank you so much for any help you can provide me with.
reproducible example:
One of the object in my json file is this:
{
"@timestamp": "2021-08-11T05:05:26.529Z",
"name": "Logon Attempt",
"message": "Successful Logon for user (0000001)\"",
"userID": "0000001",
"destinationUserName": "0000001"
},
{
"@timestamp": "2021-08-11T05:15:11.872Z",
"name": "User Log Off",
"message": "User (0000001)\" has logged out",
"userID": "0000001",
"destinationUserName": "0000001"
},
{
"@timestamp": "2021-08-11T05:16:11.916Z",
"name": "Logon Attempt",
"message": "Successful Logon for user (0000001)\"",
"userID": "0000001",
"destinationUserName": "0000001"
},
{
"sourceUserName": "0000001",
"@timestamp": "2021-08-11T06:44:35.696Z",
"name": "TCP_DENIED",
"deviceAction": "PROXIED",
"userID": "0000001"
}
in the original file (all_data.json)
I have 2834 encounter of this userID, in the pandas processed file I have 234.
Now this seems to be working, but I believe that the drops, applies to all the columns in my json. this returns a wrong result.
What I would like to have, is to drop the duplicates based only on the userID, so if that userID is present on another field, to be dropped aswell. this way I will have only that userID in all my file.
In my output file, this is what I am having as result:
153495,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T03:30:02.468Z,,,,,
33124,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-26T12:46:08.761Z,,,,,
153525,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T03:51:56.296Z,,,,,
153628,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T04:57:28.485Z,,,,,
153474,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T02:50:54.086Z,,,,,
149538,DENIED, 000001,TCP_DENIED, 000001,2021-09-21T19:34:40.084Z,,,,,
153496,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T03:30:02.466Z,,,,,
153371,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T01:51:37.868Z,,,,,
the first encounter is the userID and the second is the sourceUserName.
Based on tha output, I would like to see only the first row I hope the example is clear enough