sed/awk between two patterns in a file: pattern 1 set by a variable from lines of a second file; pattern 2 designated by a specified charcacter

1

votes

I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.

For instance,

File_1:

a
c
d

and File_2:

>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH

I have been using variations of this loop:

while read $id;
     do 
       sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1

hoping to obtain something like the following output:

>a
MEEL
>c
MEHL
>d
MLWL

But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?

bashawksedgrep

You seem to be processing FASTA file, is that right? If so, please add the fasta tag. – tripleee

5

votes

Try:

$ awk -F'>' 'FNR==NR{a[$1]; next}  NF==2{f=$2 in a} f'  file1 file2
>a
MEEL
>c
MEHL
>d
MLWL

How it works

-F'>'

This sets the field separator to >.
FNR==NR{a[$1]; next}

While reading in the first file, this creates a key in array a for every line in file file.
NF==2{f=$2 in a}

For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.
f

If f is true, print the line.

2

votes

A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.

pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2

Explanation: The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).

The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:

begin:
    read next line (from File_2) or quit on end-of-file
label_a:
    if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
        print the line
        read next line (from File_2) or quit on end-of-file
        if line begins with `>` goto label_a else goto label_b
    else goto begin

1

votes

Let me try to explain why your approach does not work well:

You need to say while read id instead of while read $id.
The sed command />$id/,/>/{//!p;} will exclude the lines which start with >.

Then you might want to say something like:

while read id; do
    sed -n "/^>$id/{N;p}" File_2
done < File_1

Output:

>a
MEEL
>c
MEHL
>d
MLWL

But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.

0

votes

If ed is available, and since the shell is involve.

#!/usr/bin/env bash

mapfile -t to_match < file1.txt

ed -s file2.txt <<-EOF                                               
  g/\(^>[${to_match[*]}]\)/;/^>/-1p
  q
EOF

It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.
Requires bash4+ because of mapfile.

How it works

mapfile -t to_match < file1.txt Saves the entry/value from file1 in an array named to_match
ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file
<<-EOF A here document, shell syntax.

g/\(^>[${to_match[*]}]\)/;/^>/-1p

g means search the whole file aka global.
( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.
^> If line starts with a > the ^ is an anchor which means the start.
[ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"
; Include the next address/pattern
/^>/ Match a leading >
-1 go back one line after the pattern match.
p print whatever was matched by the pattern.
q quit ed

sed/awk between two patterns in a file: pattern 1 set by a variable from lines of a second file; pattern 2 designated by a specified charcacter

4 Answers

How it works