0
votes

Problem: Reducing a data set used in regression to several smaller sets where the variables are dependent within but independent between matrices. I have a large data set with 1000 dummy variables, but only a few 'positive' for each row, and memory limits my ability to build different models. So i'm trying to split the data set into sets where there ar linear dependency between the variables in the set, but no dependency with the other sets.

Small example:

M1 <- c(1L,0L,0L,0L,1L,1L,0L,0L,0L,0L,1L,1L,0L,0L,1L,0L)
dim(M1) <- c(4,4)

Here M1 can be split into the two 'independent matrices:

M2 <- c(1,0,1,1)
M3 <- c(1,1,1,0)

But changing M1 to

M1[3,2] <- 1

Would make all row dependent and so no split is possible.

Ideally what I would like is a vector of length (nr of rows) specifying which subset a row belongs to, so that regressions could be applied on each subset. So a result in the original case would be a vector:

R <- c(1,1,2,2)

The problem is related to the rank but all answers that i have been able to find related to reducing the dim of the matrix and not sub setting the matrix into independent parts.

1
is this somehow related to model.matrix()?abhiieor
@abhiieor, yes the problems is related. The input matrix is a model matrix, which i was trying to see if i could split into independent parts, so as to reduce the size of the data sets before analysis. Unfortunately I did find the all rows were dependent, i.e. there was nos subset of data for which some columns did span all information in this subset and no other data (rows) were referenced.Thorvall

1 Answers

0
votes

Iteration through the matrix is a solution, which is implemented by the following functions (only 2d) Not pretty, nor using matrix information. But posted as a way to solve the problem:

`%ni%` <- Negate(`%in%`)
data <- hjlpmidMatrix


getRow <- function(data, col)
  {
    as.vector(which(data[,col] == 1))

  }
getCol <- function(data, row)
{
    as.vector(which(data[row,] == 1))
}


splitmatrix <- function(data) {
if (!is.matrix(data)) {
  stop("no data frame assigned to function")
  }
if (dim(data)[2] < 1) {
  stop("no columns in data")
}
vector <- dim(c(1,2))
i <- 1
col <- 1

repeat {
  rowIndex <- NULL
  colIndex <- NULL
repeat {
col <- col[col %ni% colIndex]
if (is_empty(col)) {break}
colIndex <- c(colIndex, col)
if (length(col) != 0) { row <- sapply(col,FUN = getRow, data = data) %>% unlist %>% unique()}

row <- row[row %ni% rowIndex]
if (is_empty(row)) {break}
  rowIndex <- c(rowIndex, row)
if (length(row) != 0) { col <- sapply(row,FUN = getCol, data = data) %>% unlist %>% unique()}

}

vector <- rbind(vector, cbind(i, rowIndex))
if (dim(vector)[1] < dim(data)[1])
  {
  i <- i + 1
  col <- (1:dim(data)[2])[1:dim(data)[2] %ni% colIndex]
}
else
  {break}
}
return(vector[,1])

}