Extracting Specific Fasta record/s from a Multi-fasta File

While dealing with multi-fasta files, it is often required to extract few fasta sequences which contain the keyword/s of interest. One fast way to do this, is by awk.

For example:

Input file: hg19_genome.fa


We would like to extract the sequence for Chr2 from hg19_genome.fa. Use the following command:

$ awk ‘BEGIN {RS=”>”} /Chr2/ {print “>”$0}’ hg19_genome.fa



Note that, the search keyword (here ‘Chr2’) doesn’t have to be an exact match. If you use ‘MT‘ instead, you will get the third and fourth entry, since ‘MT’ is a sub-string of the third and fourth fasta record.

Now lets break down the command so that we don’t have to mug it up or we could mold it and use it in variety of other places.

  • awk — This is the main command (Or more of a very powerful programming language)
  • — We write every bit of awk code inside these single quotes
  • BEGIN — This tells the awk to execute the immediately following code in curly brackets at the beginning.
  • {RS=”>”} — Record separator  (If we look at the file, we can observe every sequence starts with a “>” sign. This helps us to separate two fasta records)
  • /Chr2/ — keyword to search in the entire record
  • {print “>”$0} — Here $0 is the current record (From “Chr2” to the entire sequence till next “>”). We added “>” at the beginning just to get the standard identitifer which is not included in $0.
  • hg19_genome.fa — This is the input multi-fasta file that we have used.

Suppose we are interested in more that one keyword then two possibilities arise:

You want BOTH the keywords present,
awk ‘BEGIN {RS=”>”} /Chr2/ && /MT/ {print “>”$0}’ hg19_genome.fa

You want EITHER of the keyword present,
awk ‘BEGIN {RS=”>”} /Chr2|MT/ {print “>”$0}’ hg19_genome.fa

Note: If you are using Windows, you can download and install ‘gwak’ and use similar command. In zsh shell you might need to use an escape character for | (pipe).

I am sure many of you might have different flavors to do the same. If you think it is worth sharing then the comment box is all yours.

Happy Coding !!

4 thoughts on “Extracting Specific Fasta record/s from a Multi-fasta File

    • Hello Tanvir,
      Thanks for stopping by.

      Coming to your nice question. I am sure there will be many better ways to do this task. What I could think of is running the aforementioned awk command inside a bash “for loop”.

      For example:
      Lets say, We have a one file, keywords.txt with all the keywords to search in a bigger sequence file, hg19.txt

      What could work is,

      for i in $(cat keywords.txt);
      awk -v key=”$i” ‘BEGIN {RS=”>”} match($0, key) {print “>”$0}’ hg19.txt >> results.txt ; done

      You will get your desired sequences in results.txt
      Good luck!!

  1. Hi Amol! I tried your code with small amount of sequences and keywords, it works well. Thanks. But when I used them on a much larger dataset, it didn’t show what I expected. I am not sure if I used it right or it needs rigid forms of inputs.

  2. Hello,

    I’m not so familiar with awk code. What about if it is required to extract all sequences from multi-fasta file and keep the name?

    I found similar code which separate sequences based on regular expression match and then write it to a file numbered sequentially:

    awk ‘/^>/{s=++d”.fasta”} {print > s}’

    I would like to have similar code but to keep original name of the sequence from multi-fasta not to write it to a files numbered sequentially.

    Please, can you help.

    Best regards

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.