0
votes

I have a json file that looks like this.

{
        "deviceAction": "",
        "userID": "001",
        "name": "FAILED",
        "sourceUserName": "",
        "@timestamp": ""
    },
    {
        "deviceAction": "",
        "userID": "001",
        "name": "FAILED",
        "sourceUserName": "",
        "@timestamp": ""
    },
    {
        "deviceAction": "",
        "userID": "002",
        "name": "FAILED",
        "sourceUserName": "",
        "@timestamp": ""
    }

this file has over 40k objects and most of them are duplicates.

What I am trying to do is to keep only the unique values. So an output similar to this:

{
        "deviceAction": "",
        "userID": "001",
        "name": "FAILED",
        "sourceUserName": "",
        "@timestamp": ""
    },
    {
        "deviceAction": "",
        "userID": "002",
        "name": "FAILED",
        "sourceUserName": "",
        "@timestamp": ""
    }

Using pandas I used this code:

data3 = pd.read_json("./sept-count/all_data.json")
data3
data3.sort_values("userID", inplace= True)
data_unique = data3.drop_duplicates(subset=['userID'], keep='first')

I saved the result to a new file, but when I open it and look for duplicates, I can still see a lot of duplicates in the userID column.

As this json file, for each object has multiple keys, how can I make sure that I can keep only unique values based on the column userID?

Thank you so much for any help you can provide me with.

reproducible example:

One of the object in my json file is this:

{
        "@timestamp": "2021-08-11T05:05:26.529Z",
        "name": "Logon Attempt",
        "message": "Successful Logon for user (0000001)\"",
        "userID": "0000001",
        "destinationUserName": "0000001"
    },
{
        "@timestamp": "2021-08-11T05:15:11.872Z",
        "name": "User Log Off",
        "message": "User (0000001)\" has logged out",
        "userID": "0000001",
        "destinationUserName": "0000001"
    },
{
        "@timestamp": "2021-08-11T05:16:11.916Z",
        "name": "Logon Attempt",
        "message": "Successful Logon for user (0000001)\"",
        "userID": "0000001",
        "destinationUserName": "0000001"
    },
{
        "sourceUserName": "0000001",
        "@timestamp": "2021-08-11T06:44:35.696Z",
        "name": "TCP_DENIED",
        "deviceAction": "PROXIED",
        "userID": "0000001"
    }

in the original file (all_data.json)

I have 2834 encounter of this userID, in the pandas processed file I have 234.

Now this seems to be working, but I believe that the drops, applies to all the columns in my json. this returns a wrong result.

What I would like to have, is to drop the duplicates based only on the userID, so if that userID is present on another field, to be dropped aswell. this way I will have only that userID in all my file.

In my output file, this is what I am having as result:

153495,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T03:30:02.468Z,,,,,
33124,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-26T12:46:08.761Z,,,,,
153525,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T03:51:56.296Z,,,,,
153628,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T04:57:28.485Z,,,,,
153474,DENIED, 000001,TCP_DENIED, 000001,2021-09-22T02:50:54.086Z,,,,,
149538,DENIED, 000001,TCP_DENIED, 000001,2021-09-21T19:34:40.084Z,,,,,
153496,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T03:30:02.466Z,,,,,
153371,OBSERVED, 000001,TCP_NC_MISS, 000001,2021-09-22T01:51:37.868Z,,,,,

the first encounter is the userID and the second is the sourceUserName.

Based on tha output, I would like to see only the first row I hope the example is clear enough

1
Please post a minimal reproducible exampleLuke
Done I posted an example with some dataNayden Van

1 Answers

0
votes

If you don't want any duplicates, you're going to have to set keep=False, as such:

data_unique = data3.drop_duplicates(subset=['UserID'], keep=False)

Otherwise the first duplicate occurrence will still be included in data_unique.

From your updated description, it looks like you're trying to drop duplicates based on two columns, which can be achieved by doing:

data_unique = data3.drop_duplicates(subset=['UserID', 'sourceUserName'], keep='first')