Bioinformatician’s Pocket Reference !!

It is amusing how the brain of bioinformaticians work! Learning a new programming language for days feels so much of fun that making 5-minute discussion with neighbors (unless under special circumstances!) in our own mother-tongue. Today every bioinformatician keeps more than few languages and core IT toolkits on their plate. It has become mandatory to be able to mold different code snippets to build our own custom workflows, and thus keeping syntax at our fingertips has become essential.Although Google is the best way to get syntax problem solved, it is not a bad idea to keep reference sheets is our smartphones or stick out some printed sheets on the back of your door, in the old fashion way!!

1) Apache

2) Awk/Gwak

3) C

4) C++

5) Debian

6) Git

7) HTML

8) Java

  9) LaTeX
latex

10) Mathematica

11) Matlab

12) MySQL

13) Perl

14) PHP

15) Python

16) R
R

17) Screen

18) Ubuntu

19) UNIX

20) Vim

These are handpicked reference sheets and you may encounter various other versions of these over Internet. If you find any version of reference sheet which is worth sharing, feel free to paste the link below.

At the end, I sincerely acknowledge the authors who have put their efforts in designing these informative reference sheets and made them available to us.

Advertisements

Docear: For Scientific Literature Management

In the field of research we encounter numerous useful articles. This plethora of literature, thus demands a methodical management. While I was struggling to find an efficient way to put all my articles into perspective, I came across ‘Docear’, which is described as a free and open source academic literature suite. After experiencing the simplicity and benefits of using Docear in academic projects, I was pleasantly surprised and equally tempted to write about it.The most appealing features of Docear, includes:
  1. Simple user-interface
  2. Ability to help create mind maps and link documents directly
  3. Seamless integration of PDFs along with their headings and custom annotations
  4. Powerful search, and
  5. Reference management (including support with MS word)
  6. Works with Windows, Mac and Unix. A further detailed list of features can be accessed from their official website.
Additionally, the website is equipped with tutorials and snapshots, which are very straight forward. I find this open-source project promising and I am hoping to receive timely updates for it in future.

 

Many kudos to the Docear team for developing this truly resourceful software suite.

Creating ‘Swap’ [‘Virtual’] Memory on Linux/Unix Operating System

Here’s some help for when you have too little RAM/memory and are trying to do memory-intensive steps, like indexing the human genome reference or doing other NGS-related processing.The way to do it is to create a ‘swap file’, as follows:

1) Check disk/drive usages:

df

2) Create the space. This step is long if you select a large amount of space. In this example, 512MB is created under root (/), given that block size (bs) is 1024 bytes:

sudo dd if=/dev/zero of=/swapfile bs=1024 count=512k

3) Switch it on:

sudo mkswap /swapfile

*This will only last until you restart the operating system, so, useful as a temporary measure. To make the swapfile permanent, do the following:

4) Open the following file:

sudo nano /etc/fstab

Paste in the following:

/swapfile    none     swap     sw     0     0

 

Article by: Dr. Kevin Blighe

Extracting Specific Fasta record/s from a Multi-fasta File

While dealing with multi-fasta files, it is often required to extract few fasta sequences which contain the keyword/s of interest. One fast way to do this, is by awk.

For example:

Input file: hg19_genome.fa

    >Chr1
    ATCTGCTGCTCGGGCTGCTCTAT…
    >Chr2
    GTACGTCGTAGGACATGCATCG…
    >MT1
    TACGATCGATCAGCTCAGCATC…
    >MT2
    CGCCATGGATCAGCTACATGTA…

We would like to extract the sequence for Chr2 from hg19_genome.fa. Use the following command:

$ awk ‘BEGIN {RS=”>”} /Chr2/ {print “>”$0}’ hg19_genome.fa

Output:

    >Chr2
    GTACGTCGTAGGACATGCATCG…

Note that, the search keyword (here ‘Chr2’) doesn’t have to be an exact match. If you use ‘MT‘ instead, you will get the third and fourth entry, since ‘MT’ is a sub-string of the third and fourth fasta record.

Now lets break down the command so that we don’t have to mug it up or we could mold it and use it in variety of other places.

  • awk — This is the main command (Or more of a very powerful programming language)
  • — We write every bit of awk code inside these single quotes
  • BEGIN — This tells the awk to execute the immediately following code in curly brackets at the beginning.
  • {RS=”>”} — Record separator  (If we look at the file, we can observe every sequence starts with a “>” sign. This helps us to separate two fasta records)
  • /Chr2/ — keyword to search in the entire record
  • {print “>”$0} — Here $0 is the current record (From “Chr2” to the entire sequence till next “>”). We added “>” at the beginning just to get the standard identitifer which is not included in $0.
  • hg19_genome.fa — This is the input multi-fasta file that we have used.

Now,
Suppose we are interested in more that one keyword then two possibilities arise:

You want BOTH the keywords present,
awk ‘BEGIN {RS=”>”} /Chr2/ && /MT/ {print “>”$0}’ hg19_genome.fa

You want EITHER of the keyword present,
awk ‘BEGIN {RS=”>”} /Chr2|MT/ {print “>”$0}’ hg19_genome.fa

Note: If you are using Windows, you can download and install ‘gwak’ and use similar command. In zsh shell you might need to use an escape character for | (pipe).

I am sure many of you might have different flavors to do the same. If you think it is worth sharing then the comment box is all yours.

Happy Coding !!