I'm trying to collect catalogue information based on text search. Search for a certain string in column Text
, and put some description into a new column C_Organization
.
Here is the sample data:
# load packages:
pacman::p_load("data.table",
"stringr")
# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))
colnames(DE) <- c("Name", "Text")
> DE
Name Text
1: John Text contains MIT
2: Sussan some text with Stanford University
3: Bill He graduated from Yale
search for a certain string and make a new data.table with new column:
mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]
# bind them together:
combine_table <- rbind(mit, yale, stanford)
combine_table
Name C_Organization
1: John MIT
2: Bill Yale
3: Sussan Stanford
This pick-and-combine approach works fine but it seems a little bit tedious. Is it possible to do it in one step in data.table
?
Edit
Due to my poor data analysis skill and the unclean data, I need to make the question clear:
The real data is a little complicated:
(1) There are cases where a person from more than two organizations, like
Jack, UC Berkeley, Bell lab
. and(2) The same person of the same organization appears for different year, like
Steven, MIT, 2011
,Steven, MIT, 2014
.I want to figure out:
(1) How many people from each organization. If one person belongs to more than one organization, make the organization which appears most as his organization. (i.e. by popularity.) For example,
John, MIT, AMS, Bell lab
, ifMIT
appears 30 times,AMS
12 times,Bell lab
26 times. Then makeMIT
as his organization.(2) count how many people for each year. This is not directly realted to my original question, but for later calculation, I don't want to throw away these records.
v = c("MIT","Yale","Stanford")
and you want to retrieve all rows inDE
having this in columntext
? – Colonel Beauvelpacman::
call reduces reproducibility of the question. You could usesapply(c("pkg1","pkg2"), require)
– jangoreckirequire(pkg1), require(pkg2)...
concisely.sapply
is really a good idea, for it doesn't require an extra package likepacman
. – Nick