delete similar countries names in each row in r

Question

I have a dataset with a sample column from it shown below.

I need to delete similar countries names in each row (MAIN REQUEST)

then I need to create a column for each country (SUPPLEMENTARY REQUEST).

data<-read.table(text="
LocationCountry
United States, Belgium, France, Ireland, Netherlands, Netherlands, Netherlands, Sweden
Spain, Spain, Spain, Spain
Korea, Republic of
Korea, Republic of
Austria, Austria, Austria
United States, United States, United States, United States, United States, United States
Italy, Italy
Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of, Korea, Republic of
India, Iran, Islamic Republic of
Spain, Spain, Spain, Spain
Korea, Republic of
Turkey, Turkey", header=T, sep="\n")

Any advice will be greatly appreciated

The first part is very easy. Second part is confusing. You mean create a column for each country in a different DataFrame? — Amit
@Amit Thanks. I need it in the same dataset, if possible. Otherwise, I have a serial number for each row in my large dataset so I can merge if needed. — Mohamed Rahouma
Some of the countries have , in between i.e. Korea, Republic of — akrun

akrun akrun · Accepted Answer · 2020-12-29T17:22:42

In base R, we can use strsplit to split into a list, get the unique elements and paste them back

data$LocationCountry <- sapply(strsplit(data$LocationCountry, ",\\s*"), 
       function(x) toString(unique(x)))

-output

data
#                                                LocationCountry
#1  United States, Belgium, France, Ireland, Netherlands, Sweden
#2                                                         Spain
#3                                            Korea, Republic of
#4                                            Korea, Republic of
#5                                                       Austria
#6                                                 United States
#7                                                         Italy
#8                                            Korea, Republic of
#9                              India, Iran, Islamic Republic of
#10                                                        Spain
#11                                           Korea, Republic of
#12                                                       Turkey

For the supplementary part, if we need to create binary columns for each element in the 'LocationCountry', then use the updated 'LocationCountry' column with unique names, split it, and apply the mtabulate

library(qdapTools)
cbind(data, mtabulate(strsplit(data$LocationCountry, ",\\s+")))

-output

                                             LocationCountry Austria Belgium France India Iran Ireland Islamic Republic of Italy
1  United States, Belgium, France, Ireland, Netherlands, Sweden       0       1      1     0    0       1                   0     0
2                                                         Spain       0       0      0     0    0       0                   0     0
3                                            Korea, Republic of       0       0      0     0    0       0                   0     0
4                                            Korea, Republic of       0       0      0     0    0       0                   0     0
5                                                       Austria       1       0      0     0    0       0                   0     0
6                                                 United States       0       0      0     0    0       0                   0     0
7                                                         Italy       0       0      0     0    0       0                   0     1
8                                            Korea, Republic of       0       0      0     0    0       0                   0     0
9                              India, Iran, Islamic Republic of       0       0      0     1    1       0                   1     0
10                                                        Spain       0       0      0     0    0       0                   0     0
11                                           Korea, Republic of       0       0      0     0    0       0                   0     0
12                                                       Turkey       0       0      0     0    0       0                   0     0
   Korea Netherlands Republic of Spain Sweden Turkey United States
1      0           1           0     0      1      0             1
2      0           0           0     1      0      0             0
3      1           0           1     0      0      0             0
4      1           0           1     0      0      0             0
5      0           0           0     0      0      0             0
6      0           0           0     0      0      0             1
7      0           0           0     0      0      0             0
8      1           0           1     0      0      0             0
9      0           0           0     0      0      0             0
10     0           0           0     1      0      0             0
11     1           0           1     0      0      0             0
12     0           0           0     0      0      1             0

delete similar countries names in each row in r

3 Answers