0
votes
chr1    26150023    26150023    ncRNA_exonic    
chr1    26162313    26162313    exonic      
chr1    26349533    26349535    exonic  
chr1    26357656    26357656    UTR5        
chr1    26487940    26487940    exonic  
chr1    26150023    26150023    ncRNA_exonic    
chr1    26162353    26162313    splicing        
chr1    26349533    26349535    exonic;splicing 
chr1    26357656    26357656    exonic      
chr1    26487940    26487940    UTR3    
chr1    26357656    26357656    intronic        
chr1    26487940    26487940    intergenic

I have a very big csv file It includes dozens of columns and thousands of rows. I want to delete rows if 4th column of those rows include any string except exonic, exonic;splicing, splicing.

After deleting my file would look like this:

chr1    26162313    26162313    exonic      
chr1    26349533    26349535    exonic 
chr1    26487940    26487940    exonic  
chr1    26162353    26162313    splicing        
chr1    26349533    26349535    exonic;splicing 
chr1    26357656    26357656    exonic

I tried with sed but It deletes unwanted rows. For example, If I have UTR3 in 10th column, It will delete that row too and I don't want that. I used this command :

sed -e '/upstream/d' -e '/downstream/d' -e '/intronic/d' -e '/intergenic/d' -e '/ncRNA_exonic/d' -e '/ncRNA_intronic/d' -e '/ncRNA_splicing/d' -e '/ncRNA_UTR5/d' -e '/UTR3/d' -e '/UTR5/d' input.csv > output.csv 

Is there anyway I can get this work?

Thanks in advance

1

1 Answers

4
votes

Use awk and a regex to test 4th column.

awk '$4 ~ "^(exonic|exonic;splicing|splicing)$"' file

Output:

chr1    26162313    26162313    exonic      
chr1    26349533    26349535    exonic  
chr1    26487940    26487940    exonic  
chr1    26162353    26162313    splicing        
chr1    26349533    26349535    exonic;splicing 
chr1    26357656    26357656    exonic