0
votes

I was wondering what would be the best method to go about doing this.

I have a csv table with over 500k entries that has 3 columns. I noticed that some rows have multiple start and end positions. I want to consolidate them into a single row for them and just take the minimum and maximum for their range.

I was wondering how can I go got about it. I was thinking of using R? I open to any suggestions. Thank you in advance.

A,10,200
A,250,350
B,5,220
B,230,260
C,1,100
D,20,50
E,1,10
F,11,90
F,100,200
F,210,350
etc

I hope the final results to look like:

A,10,350
B,5,260
C,1,100
D,20,50
E,1,10
F,11,350

2
You can remove python tagEpsi95

2 Answers

1
votes

In R, we can group by the first column and summarise with c_across to return the min and max based on the other columns

library(dplyr)    
df1 %>%
  group_by(col1) %>% 
  summarise(min = min(c_across(everything())), 
             max = max(c_across(everything())), .groups = 'drop')

-output

# A tibble: 6 x 3
#  col1    min   max
#  <chr> <int> <int>
#1 A        10   350
#2 B         5   260
#3 C         1   100
#4 D        20    50
#5 E         1    10
#6 F        11   350

data

df1 <- structure(list(col1 = c("A", "A", "B", "B", "C", "D", "E", "F", 
"F", "F"), col2 = c(10L, 250L, 5L, 230L, 1L, 20L, 1L, 11L, 100L, 
210L), col3 = c(200L, 350L, 220L, 260L, 100L, 50L, 10L, 90L, 
200L, 350L)), class = "data.frame", row.names = c(NA, -10L))
1
votes

This works by pivoting your two numeric columns into one column. This way you can take the min and max on one column, which is value.

library(tidyr)
library(dplyr)

df %>% 
  tidyr::pivot_longer(cols = c(B, C)) %>%
  dplyr::group_by(A) %>% 
  dplyr::summarize(min = min(value),
                   max = max(value))

  A       min   max
  <chr> <int> <int>
1 A        10   350
2 B         5   260
3 C         1   100
4 D        20    50
5 E         1    10
6 F        11   350

Data

lines <- "
A,B,C
A,10,200
A,250,350
B,5,220
B,230,260
C,1,100
D,20,50
E,1,10
F,11,90
F,100,200
F,210,350"

df <- read.table(text = lines, header = T, sep = ",")