1
votes

I have a dataset with a sample column from it shown below.

I need to delete similar countries names in each row (MAIN REQUEST)

then I need to create a column for each country (SUPPLEMENTARY REQUEST).

data<-read.table(text="
LocationCountry
United States, Belgium, France, Ireland, Netherlands, Netherlands, Netherlands, Sweden
Spain, Spain, Spain, Spain
Korea, Republic of
Korea, Republic of
Austria, Austria, Austria
United States, United States, United States, United States, United States, United States
Italy, Italy
Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of
India, Iran, Islamic Republic of
Spain, Spain, Spain, Spain
Korea, Republic of
Turkey, Turkey", header=T, sep="\n")

Any advice will be greatly appreciated

3
The first part is very easy. Second part is confusing. You mean create a column for each country in a different DataFrame?Amit
@Amit Thanks. I need it in the same dataset, if possible. Otherwise, I have a serial number for each row in my large dataset so I can merge if needed.Mohamed Rahouma
Some of the countries have , in between i.e. Korea, Republic ofakrun

3 Answers

2
votes

In base R, we can use strsplit to split into a list, get the unique elements and paste them back

data$LocationCountry <- sapply(strsplit(data$LocationCountry, ",\\s*"), 
       function(x) toString(unique(x)))

-output

data
#                                                LocationCountry
#1  United States, Belgium, France, Ireland, Netherlands, Sweden
#2                                                         Spain
#3                                            Korea, Republic of
#4                                            Korea, Republic of
#5                                                       Austria
#6                                                 United States
#7                                                         Italy
#8                                            Korea, Republic of
#9                              India, Iran, Islamic Republic of
#10                                                        Spain
#11                                           Korea, Republic of
#12                                                       Turkey

For the supplementary part, if we need to create binary columns for each element in the 'LocationCountry', then use the updated 'LocationCountry' column with unique names, split it, and apply the mtabulate

library(qdapTools)
cbind(data, mtabulate(strsplit(data$LocationCountry, ",\\s+")))

-output

                                             LocationCountry Austria Belgium France India Iran Ireland Islamic Republic of Italy
1  United States, Belgium, France, Ireland, Netherlands, Sweden       0       1      1     0    0       1                   0     0
2                                                         Spain       0       0      0     0    0       0                   0     0
3                                            Korea, Republic of       0       0      0     0    0       0                   0     0
4                                            Korea, Republic of       0       0      0     0    0       0                   0     0
5                                                       Austria       1       0      0     0    0       0                   0     0
6                                                 United States       0       0      0     0    0       0                   0     0
7                                                         Italy       0       0      0     0    0       0                   0     1
8                                            Korea, Republic of       0       0      0     0    0       0                   0     0
9                              India, Iran, Islamic Republic of       0       0      0     1    1       0                   1     0
10                                                        Spain       0       0      0     0    0       0                   0     0
11                                           Korea, Republic of       0       0      0     0    0       0                   0     0
12                                                       Turkey       0       0      0     0    0       0                   0     0
   Korea Netherlands Republic of Spain Sweden Turkey United States
1      0           1           0     0      1      0             1
2      0           0           0     1      0      0             0
3      1           0           1     0      0      0             0
4      1           0           1     0      0      0             0
5      0           0           0     0      0      0             0
6      0           0           0     0      0      0             1
7      0           0           0     0      0      0             0
8      1           0           1     0      0      0             0
9      0           0           0     0      0      0             0
10     0           0           0     1      0      0             0
11     1           0           1     0      0      0             0
12     0           0           0     0      0      1             0
1
votes

You can try this:

library(dplyr)
library(tidyr)
#Code1
data %>%
  mutate(id=row_number()) %>%
  separate_rows(LocationCountry,sep=',') %>%
  mutate(LocationCountry=trimws(LocationCountry)) %>%
  group_by(id) %>%
  filter(!duplicated(LocationCountry)) %>%
  summarise(LocationCountry=paste0(LocationCountry,collapse = ', ')) %>%
  select(-id)

Output:

# A tibble: 12 x 1
   LocationCountry                                             
   <chr>                                                       
 1 United States, Belgium, France, Ireland, Netherlands, Sweden
 2 Spain                                                       
 3 Korea, Republic of                                          
 4 Korea, Republic of                                          
 5 Austria                                                     
 6 United States                                               
 7 Italy                                                       
 8 Korea, Republic of                                          
 9 India, Iran, Islamic Republic of                            
10 Spain                                                       
11 Korea, Republic of                                          
12 Turkey                                                      
1
votes

Ok the point raised by @akrun makes it a complicated that some countries have a comma between them. However here is my solution. Not much different than that of @akrun.

RemoveRedundants <- function(Row){
  Split_Countries <- unlist(strsplit(Row, ", "))
  Unique_Countries <- paste(unique(Split_Countries, fromLast = TRUE), collapse = ", ")
  return(Unique_Countries)
}

data$UniqueCountries <- apply(data,1,RemoveRedundants)
View(data)

Gives the following as the output enter image description here