4
votes

Let's say I have a file with several million lines, organized like this:

@1:N:0:ABC
XYZ

@1:N:0:ABC
ABC

I am trying to write a one-line grep/sed/awk matching function that returns both lines if the NCCGGAGA line from the first line is found in the second line.

When I try to use grep -A1 -P and pipe the matches with a match like '(?<=:)[A-Z]{3}', I get stuck. I think my creativity is failing me here.

3
So the expected output from your example would be nothing?Tom Fenech
Well in this case... yes. But there are a few thousand in there where the first sequence is found in the second line.Ryan Ward
I don't really have any control over that. This is Illumina data from a shotgun genome. Technically, I am trying to clean it up by identifying these reads and cutting out the adapter sequence.Ryan Ward
You have total control over the example that you post in the question, though!Tom Fenech
Gotcha, cleaned it up a littleRyan Ward

3 Answers

6
votes

With awk

$ awk -F: 'NF==1 && $0 ~ s{print p ORS $0} {s=$NF; p=$0}' ip.txt
@1:N:0:ABC
ABC
  • -F: use : as delimiter, makes it easy to get last column
  • s=$NF; p=$0 save last column value and entire line for printing later
  • NF==1 if line doesn't contain :
  • $0 ~ s if line contains the last column data saved previously
    • if search data can contain regex meta characters, use index($0,s) instead to search literally
  • note that this code assumes input file having line containing : followed by line which doesn't have :


With GNU sed (might work with other versions too, syntax might differ though)

$ sed -nE '/:/{N; /.*:(.*)\n.*\1/p}' ip.txt
@1:N:0:ABC
ABC
  • /:/ if line contains :
  • N add next line to pattern space
  • /.*:(.*)\n.*\1/ capture string after last : and check if it is present in next line

again, this assumes input like shown in question.. this won't work for cases like

@1:N:0:ABC
@1:N:0:XYZ
XYZ
3
votes

This might work for you (GNU sed):

sed -n 'N;/.*:\(.*\)\n.*\1/p;D' file

Use grep-like option -n to explicitly print lines. Read two lines into the pattern space and print both if they meet the requirements. Always delete the first and repeat.

3
votes

If you actual Input_file is same as shown example then following may help you too here.

awk -v FS="[: \n]" -v RS="" '$(NF-1)==$NF'  Input_file

EDIT: Adding 1 more solution as per Sundeep suggestion too here.

awk -v FS='[:\n]' -v RS= 'index($NF, $(NF-1))' Input_file