Calculating distance between two variables and generating new variable

Question

I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider

firm  us  euro  asia  africa  stock  year
A     1    4     3      5      46    2001
A     2    0     1      3      889   2002
B     2    3     1      1      343   2001
B     0    2     1      3      43    2002
C     1    3     4      2      345   2001

I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 (1,4,3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = 0.2045883 * 343+ 0.1052075 * 345 = 106.4704 for the year 2001 for firm A.

I want to get a table including spill like this

firm  us  euro  asia  africa  stock  year  spill
A     1    4     3      5      46    2001  106.4704 
A     2    0     1      3      889   2002    
B     2    3     1      1      343   2001    
B     0    2     1      3      43    2002    
C     1    3     4      2      345   2001

Can anyone please advise?

Here are the codes for stata[https://www.statalist.org/forums/forum/general-stata-discussion/general/1409182-calculating-distance-between-two-variables-and-generating-new-variable]. I have about 3,000 firms and 30 years. It runs well but very slowly.

dt <- data.frame(id=c("A","A","B","B","C"),us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))

you have not fully elaborated how to compute what you want. for example you said. Can you explain how you got 103.1 step by step? eg you said ` .2 * 343+.1 * 345 = 103.1 ` but where did you get .2 or .1 from? Also you have .10528 * 345 but where do you use this?? Your explanation is ambiguous. Probably thats the reason why no one has helped you so far. You need to be clear. Do not assume that people know what you want but rather explaine explicitly — Onyambu
I think they rounded the cosine distances respectively. So 0.204588 became the .2 and .10528 became the .1. — TaylorV
Expecting R users to read stata code and intuit your heuristics and approximations is unrealistic. Please expand the text IN THE QUESTION BODY and explicitly define your rules. Also use any code you have in R. That way we can get a better idea where to start. Remember SO is not a code server. You are supposed to show how far you have progressed and what errors or roadblocks you need help with. — IRTFM

Evan Friedland Evan Friedland · Accepted Answer · 2019-05-11T02:00:39

Given the minimal info on how to calculate the similarity distance I've used a formula from Find cosine similarity between two arrays which will return different numbers than yours but should give the same resulting info.

I split the data by year so we can compare the unique ids. I take those individual lists and use lapply to run a for loop comparing all possibilities.

dt <- data.frame(id=c("A","A","B","B","C"), us=c(1,2,2,0,1),euro=c(4,0,3,2,3),asia=c(3,1,1,1,4),africa=c(5,3,1,3,2),stock=c(46,889,343,43,345),year=c(2001,2002,2001,2002,2001))

geo <- c("us","euro","asia","africa")
s <- lapply(split(dt, dt$year), function(a) {
  n <- nrow(a)
  for(i in 1:n){
    csim <- rep(0, n) # reset results of cosine similarity *stock vector
    for(j in 1:n){
      x <- unlist(a[i,geo])
      y <- unlist(a[j,geo])
      csim[j] <- (1-(x %*% y / sqrt(x%*%x * y%*%y)))*a[j,"stock"]
    }
    a$spill[i] <- sum(csim)
  }
  a
})

do.call(rbind, s)
#       id us euro asia africa stock year     spill
#2001.1  A  1    4    3      5    46 2001 106.47039
#2001.3  B  2    3    1      1   343 2001  77.93231
#2001.5  C  1    3    4      2   345 2001  72.96357
#2002.2  A  2    0    1      3   889 2002  12.28571
#2002.4  B  0    2    1      3    43 2002 254.00000

Calculating distance between two variables and generating new variable

1 Answers