2
votes

I'm relatively new to R, taken a few classes, etc. However, I'm trying to do something that has really stumped me and I just can't seem to find an answer to despite it being a task which seems trivial and likely has been done countless times. So I have a dataset and in that dataset is a field "Name" that contains, go figure, names. However, that field includes the persons title in their name, e.g. Mr., Mrs., Miss, etc. What I'm looking to do is create a new column "Title" (I've made this column already) and have that column contain the titles in a numeric form such as Mr. = 1, Mrs. = 2, Miss = 3, etc. The solutions I've found propose new subsets of data but I don't really want that, I want to add a new column in the current dataset with this information. I realize this probably sounds like a trivial task to those experienced with R but it's driving me bonkers. Thank you for any help that can be provided.

Expected output:

Name                    Title
Jones, Mr. Frank        1
Jennings, Mrs. Joan     2
Hinker, Miss. Lisa      3
Brant, Mrs. Jane        2
Allin, Mr. Hank         1
Minks, Mr. Jeff         1
Naps, Mr. Tim           1
1

1 Answers

2
votes

We can use gsub to extract the Mr/Mrs/Miss substring from the 'Name' column, convert the factor by specifying the levels as the unique elements in the vector, and finally convert to numeric class.

Using gsub, we match a particular pattern from the beginning of the string ^, i.e. match all characters that are not , ([^,]+) followed by a , and zero or more white space (\\s*) or (|) match the character . (\\.) followed by characters that are not . ([^.]+) upto the end of the string ($) and replace that with '' (2nd argument after the ,).

 v1 <- gsub('^[^,]+,\\s*|\\.[^.]+$', '', df1$Name)

 df1$Title <- as.numeric(factor(v1, levels=unique(v1)))

NOTE: We can also specify the order in the levels, i.e. factor(v1, levels= c('Mr', 'Mrs', 'Miss')). In the example provided, the unique gives the correct order as in the expected output.

Or we can match the vector with the unique elements in that vector.

df1$Title <- match(v1, unique(v1))
df1
#                 Name Title
#1    Jones, Mr. Frank     1
#2 Jennings, Mrs. Joan     2
#3  Hinker, Miss. Lisa     3
#4    Brant, Mrs. Jane     2
#5     Allin, Mr. Hank     1
#6     Minks, Mr. Jeff     1
#7       Naps, Mr. Tim     1

data

df1 <- structure(list(Name = c("Jones, Mr. Frank", "Jennings, Mrs. Joan", 
"Hinker, Miss. Lisa", "Brant, Mrs. Jane", "Allin, Mr. Hank", 
"Minks, Mr. Jeff", "Naps, Mr. Tim")), .Names = "Name", row.names = c(NA, 
-7L), class = "data.frame")