How can I print 2 lines if the second line contains the same match as the first line?

Question

Let's say I have a file with several million lines, organized like this:

@1:N:0:ABC
XYZ

@1:N:0:ABC
ABC

I am trying to write a one-line grep/sed/awk matching function that returns both lines if the NCCGGAGA line from the first line is found in the second line.

When I try to use grep -A1 -P and pipe the matches with a match like '(?<=:)[A-Z]{3}', I get stuck. I think my creativity is failing me here.

Well in this case... yes. But there are a few thousand in there where the first sequence is found in the second line. — Ryan Ward
I don't really have any control over that. This is Illumina data from a shotgun genome. Technically, I am trying to clean it up by identifying these reads and cutting out the adapter sequence. — Ryan Ward
You have total control over the example that you post in the question, though! — Tom Fenech

Sundeep Sundeep · Accepted Answer · 2018-04-09T08:35:25

With awk

$ awk -F: 'NF==1 && $0 ~ s{print p ORS $0} {s=$NF; p=$0}' ip.txt
@1:N:0:ABC
ABC

-F: use : as delimiter, makes it easy to get last column
s=$NF; p=$0 save last column value and entire line for printing later
NF==1 if line doesn't contain :
$0 ~ s if line contains the last column data saved previously
- if search data can contain regex meta characters, use index($0,s) instead to search literally
note that this code assumes input file having line containing : followed by line which doesn't have :

With GNU sed (might work with other versions too, syntax might differ though)

$ sed -nE '/:/{N; /.*:(.*)\n.*\1/p}' ip.txt
@1:N:0:ABC
ABC

/:/ if line contains :
N add next line to pattern space
/.*:(.*)\n.*\1/ capture string after last : and check if it is present in next line

again, this assumes input like shown in question.. this won't work for cases like

@1:N:0:ABC
@1:N:0:XYZ
XYZ

How can I print 2 lines if the second line contains the same match as the first line?

3 Answers