0
votes

I'm trying to collect catalogue information based on text search. Search for a certain string in column Text, and put some description into a new column C_Organization.

Here is the sample data:

# load packages:
pacman::p_load("data.table",
               "stringr")

# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
                 c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))

colnames(DE) <- c("Name", "Text")

> DE
     Name                               Text
1:   John                  Text contains MIT
2: Sussan some text with Stanford University
3:   Bill             He graduated from Yale

search for a certain string and make a new data.table with new column:

mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]

# bind them together:
combine_table <- rbind(mit, yale, stanford)

combine_table

     Name C_Organization
1:   John            MIT
2:   Bill           Yale
3: Sussan       Stanford

This pick-and-combine approach works fine but it seems a little bit tedious. Is it possible to do it in one step in data.table?

Edit

Due to my poor data analysis skill and the unclean data, I need to make the question clear:

  1. The real data is a little complicated:

    (1) There are cases where a person from more than two organizations, like Jack, UC Berkeley, Bell lab. and

    (2) The same person of the same organization appears for different year, like Steven, MIT, 2011, Steven, MIT, 2014.

  2. I want to figure out:

    (1) How many people from each organization. If one person belongs to more than one organization, make the organization which appears most as his organization. (i.e. by popularity.) For example, John, MIT, AMS, Bell lab, if MIT appears 30 times, AMS 12 times, Bell lab 26 times. Then make MIT as his organization.

    (2) count how many people for each year. This is not directly realted to my original question, but for later calculation, I don't want to throw away these records.

1
you mean you have the vector v = c("MIT","Yale","Stanford") and you want to retrieve all rows in DE having this in column text ?Colonel Beauvel
Yes. Thanks for asking. You expressed my question in a much better way. I should have mentioned that this vector might be longer in the real data set.Nick
Your pacman:: call reduces reproducibility of the question. You could use sapply(c("pkg1","pkg2"), require)jangorecki
@jangorecki Thank you for letting me know this. I was allways trying to find out how to write require(pkg1), require(pkg2)... concisely. sapply is really a good idea, for it doesn't require an extra package like pacman.Nick
@Nick, in case there is several same school per Name, does the display below satisfies you?Colonel Beauvel

1 Answers

0
votes

An alternative solution which takes into account for several matches in one text, operates rowwise and binds the matches together:

uni <- c("MIT","Yale","Stanford")
DE[,idx:=.I][, c_org := paste(uni[str_detect(Text, uni)], collapse=","), idx]

this gives:

> DE
     Name                                   Text idx             c_org
1:   John                      Text contains MIT   1               MIT
2: Sussan     some text with Stanford University   2          Stanford
3:   Bill He graduated from Yale, MIT, Stanford.   3 MIT,Yale,Stanford
4:   Bill                              some text   4                  

The advantage of operating rowwise is evident when you have identical names in Name. When you do:

DE[, uni[str_detect(Text, uni)], Name]

you get not the correct result:

     Name       V1
1:   John      MIT
2: Sussan Stanford
3:   Bill      MIT
4:   Bill Stanford

=> you don't know which Bill you have in the fourth row. Moreover, Yale isn't included for the 'first' Bill (i.e. row 3 of the original dataset).


Used data:

DE <- structure(list(Name = c("John", "Sussan", "Bill", "Bill"), Text = c("Text contains MIT", "some text with Stanford University", "He graduated from Yale, MIT, Stanford.", "some text")), .Names = c("Name", "Text"), row.names = c(NA, -4L), class = c("data.table", "data.frame"))