5
votes

I have a CSV file from the Kaggle Titanic competition as follows. The record format of this file is described by the following columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. I want to analyze the data in this file and check whether passengers traveling in a group had a better survival rate. For this I assume that the value for Ticket will be the same for all passengers in a group.

I loaded the CSV in MS Access, and executed the following query to get the desired result set:

SELECT a.Ticket, a.PassengerId, a.Survived
FROM train a
WHERE 1 < (SELECT COUNT(*) FROM train b WHERE b.Ticket = a.Ticket)
ORDER BY a.Ticket

I am not being able to extract the same result set as above, without writing a loop.

1
It would help if you posted at least a few records of your expected result, that way we can validate our understanding of the question and answer.Scott Boston
Here are a few records that I obtained running SQL on the Access DB I created out of the csv file. I verified the output of the Pandas script you wrote against the SQL's output and they matched perfectly fine. Ticket,PassengerId,Survived "110152",258,1 "110152",505,1 "110152",760,1 "110413",263,0 "110413",559,1 "110413",586,1 "110465",111,0 "110465",476,0 "111361",330,1 "111361",524,1 "113505",167,1 "113505",357,1 "113572",62,1 "113572",830,1 "113760",391,1 "113760",436,1 "113760",764,1 "113760",803,1Sudipto Chatterjee

1 Answers

3
votes

Let's see if this matches:

df.groupby(['Ticket']).filter(lambda x: x.Ticket.count()>1)[['Ticket','PassengerId','Survived']]

Or with Jezrael's suggestion:

df.groupby(['Ticket']).filter(lambda x: len(x)>1)[['Ticket','PassengerId','Survived']]

I am using groupby on Tickets then filtering my dataframe to those records where the count in that ticket group is greater than 1, using filter.