1
votes

I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.

For instance,

File_1:

a
c
d

and File_2:

>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH

I have been using variations of this loop:

while read $id;
     do 
       sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1

hoping to obtain something like the following output:

>a
MEEL
>c
MEHL
>d
MLWL

But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?

4
You seem to be processing FASTA file, is that right? If so, please add the fasta tag.tripleee

4 Answers

5
votes

Try:

$ awk -F'>' 'FNR==NR{a[$1]; next}  NF==2{f=$2 in a} f'  file1 file2
>a
MEEL
>c
MEHL
>d
MLWL

How it works

  • -F'>'

    This sets the field separator to >.

  • FNR==NR{a[$1]; next}

    While reading in the first file, this creates a key in array a for every line in file file.

  • NF==2{f=$2 in a}

    For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.

  • f

    If f is true, print the line.

2
votes

A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.

pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2

Explanation: The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).

The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:

begin:
    read next line (from File_2) or quit on end-of-file
label_a:
    if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
        print the line
        read next line (from File_2) or quit on end-of-file
        if line begins with `>` goto label_a else goto label_b
    else goto begin
1
votes

Let me try to explain why your approach does not work well:

  • You need to say while read id instead of while read $id.
  • The sed command />$id/,/>/{//!p;} will exclude the lines which start with >.

Then you might want to say something like:

while read id; do
    sed -n "/^>$id/{N;p}" File_2
done < File_1

Output:

>a
MEEL
>c
MEHL
>d
MLWL

But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.

0
votes

If ed is available, and since the shell is involve.

#!/usr/bin/env bash

mapfile -t to_match < file1.txt

ed -s file2.txt <<-EOF                                               
  g/\(^>[${to_match[*]}]\)/;/^>/-1p
  q
EOF
  • It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.

  • Requires bash4+ because of mapfile.


How it works

  • mapfile -t to_match < file1.txt Saves the entry/value from file1 in an array named to_match

  • ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file

  • <<-EOF A here document, shell syntax.

g/\(^>[${to_match[*]}]\)/;/^>/-1p
  • g means search the whole file aka global.

  • ( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.

  • ^> If line starts with a > the ^ is an anchor which means the start.

  • [ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"

  • ; Include the next address/pattern

  • /^>/ Match a leading >

  • -1 go back one line after the pattern match.

  • p print whatever was matched by the pattern.

  • q quit ed