1
votes

I am trying to mark all fruit with a "1" if it is only supplied by one country or a "0" otherwise.

I have two tables of data:

Table 1:

Fruit - Each row has a different fruit in it e.g. Apple, Banana, Peach,etc...

Country - Each row has the fruits main country of supply in 2-digit iso format e.g. US, UK, NO, etc...

SourceUnique - This is the column I want to fill with "1" in rows with fruit that are only supplied by one country and "0" otherwise.

Table 2:

Country - Each row has the suppliers country in 2-digit iso format like the last table.

Supplies - Each row has a list of fruits that the supplier delivers e.g. row 1 is "Apple, Banana", row 2 is "Pineapple, Peach, Pear, Apple", etc...

Both tables are imported from CSV files then my code is as follows:

Table1$SourceUnique=rep(1,length(Table1$Country))

for(i in 1:length(Table1$Country)){
  for(k in 1:length(Table2$Country)){
    if(grepl(Table1$Fruit[i], Table2$Supplies[k])==TRUE && identical(Table1$Country[i], Table2$Country[k])==FALSE){
      Table1$SourceUnique[i]=0
    }
  }
}

I get no errors but the SourceUnique column does not fill correctly. I get 1's and 0's with some correct and others not. After lots of searching and messing around I have accepted that I have no idea and need help, so any advice or solutions would be fantastic.

Thanks.

Edit for more info:

Some fruits have many suppliers from the same country and Table2$Supplies is messy with other words in it annoyingly.

Example data:

Table1$Country <- c("UK","US","NO")
Table1$Fruit <- c("Apple","Banana","Pear")

Table2$Country <- c("UK","US","UK")
Table2$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")

Edit Again:

grepl and identical work in my code when I run them separately with numbers. I can't understand why they do not work in my loops... In theory my code loops through "Supplies", searches the two criteria and returns a 0 when both criteria are satisfied. It then moves on to the next i ("fruit") and repeats. Maybe the && is my problem? it seems correct from my knowledge.

An Excel solution would also work for my purpose but I am not experienced enough with Excel to know where to start with that.

2
Hi, welcome to SO. Please consider reading up on How to Ask and how to produce a reproducible example. It makes it easier for others to help you.Heroka
Sorry about that, I tried creating a small table of data with HTML but it didn't seem to work so I tried to be as descriptive as possible. My data is investors not fruit so I can't share the CSV files... Two tables with 3 random rows would work.GoBuildAngus
What is your expected output for the given example data ?user5249203
A new column in Table 1 with 3 rows which are 1, 0 and 0. Apple is 1 as it is only supplied by the UK. The other two are 0's as Bananas are supplied by the UK and US and Pears are supplied by Norway mainly but we can see that they are supplied by the US and the UK also.GoBuildAngus
Is it feasible to extract the relevant values from your real equivalent of Supplies? In your example data this could be done by splitting at the occurrence of the regex "\\s*(and|,)\\s*".Mikko Marttila

2 Answers

0
votes

Perhaps you can simplify the problem by counting the occurrences of each fruit in table 2 for each country:

for (i in Table1$Fruit){
  as.integer(rowSums(table(grepl(i,Table2$Supplies),Table2$Country))[2]==1)
}

This gives you 1 for those fruits that only occur once per country in table 2 and 0 otherwise.

0
votes

Assuming it's possible to construct a regular expression to extract the values of "fruit" from the Supplies column in your real data, here's a data manipulation approach to the problem.

# prepare your sample data
fruit <- suppliers <- list()

fruit$Fruit <- c("Apple","Banana","Pear")
fruit$Country <- c("UK","US","NO")
fruit <- data.frame(fruit)

suppliers$Country <- c("UK","US","UK")
suppliers$Supplies <- c("Apple,Pear","Banana,Pear","Banana and Apple")
suppliers <- data.frame(suppliers)

library(dplyr)
library(tidyr)  # version 0.5.0 or later

# data manipulation for the desired result
suppliers %>%
    # split values of Supplies into a new row at each occurance of sep
    separate_rows(Supplies, sep = "\\s*(and|,)\\s*") %>%
    group_by(Supplies) %>%
    # summarize which fruit are supplied from only one country
    summarize(SourceUnique = as.numeric(n_distinct(Country) == 1)) %>%
    left_join(fruit, ., by = c("Fruit" = "Supplies"))
#        Fruit Country SourceUnique
#     1  Apple      UK            1
#     2 Banana      US            0
#     3   Pear      NO            0

If speed is desired, the same could likely be formulated using data.tables which provide excellent performance for working with large data.