Bioinformatician’s Pocket Reference !!

It is amusing how the brain of bioinformaticians work! Learning a new programming language for days feels so much of fun that making 5-minute discussion with neighbors (unless under special circumstances!) in our own mother-tongue. Today every bioinformatician keeps more than few languages and core IT toolkits on their plate. It has become mandatory to be able to mold different code snippets to build our own custom workflows, and thus keeping syntax at our fingertips has become essential.Although Google is the best way to get syntax problem solved, it is not a bad idea to keep reference sheets is our smartphones or stick out some printed sheets on the back of your door, in the old fashion way!!

1) Apache

2) Awk/Gwak

3) C

4) C++

5) Debian

6) Git

7) HTML

8) Java

  9) LaTeX
latex

10) Mathematica

11) Matlab

12) MySQL

13) Perl

14) PHP

15) Python

16) R
R

17) Screen

18) Ubuntu

19) UNIX

20) Vim

These are handpicked reference sheets and you may encounter various other versions of these over Internet. If you find any version of reference sheet which is worth sharing, feel free to paste the link below.

At the end, I sincerely acknowledge the authors who have put their efforts in designing these informative reference sheets and made them available to us.

Gene Ontology (GO) Enrichment Analysis in Novel Transcriptomes using BiNGO!!

A greater hurdle while dealing with differentially expressed transcripts in novel organisms is the Gene Ontology (GO) enrichment analysis and their visual interpretation.

To date there are several open-source applications available to extract GO terms corresponding to protein/nucleotide sequences (A detailed list can be accessed here, However, the best I have experienced for the De novo transcripts is InterProScan), and to perform enrichment analysis (A detailed list is here). Most of these enrichment tools work like a charm for model organisms, but only handful of them support the incorporation of custom annotations. One such tool is BiNGO (Biological Networks Gene Ontology tool), an open-source Java plug-in of Cytoscape. BiNGO can be used either on a list of genes, or interactively on subgraphs of biological networks visualized in Cytoscape. BiNGO maps the predominant functional themes of the tested gene set on the GO hierarchy.

In order to use BiNGO for novel organisms, one need to provide a custom annotation file (CAF). In principle, CAF contains the gene/transcript and GO relationship, with one relationship per line, eg.

XLOC_000001=0005515
XLOC_000001=0008270
XLOC_000001=0016491

XLOC_000002=0055114
XLOC_000003=0016491
.
.
XLOC_999999=9999999

The left value is the transcript name and right value is the GO category (without the prefix, ‘GO:’) obtained using InterProScan or synonymous tool.

The first line of GAF should always be:

(species=Custom_species)(type=Biological Process)(curator=GO)

You can choose to change species name from “Custom_species” to something else. Once the building of GAF (GAF.txt) is complete for all the annotated transcripts. It can be used in place of “Select organism/annotation” by choosing “Custom” option. (As shown in the figure below)

Additionally, one can also choose to switch to a newer ontology (obo) file downloaded from geneontology.org download page. After providing gene list of interest and choosing the appropriate options, hit the “Start BiNGO” button to start the analysis.

Cytoscape together with BiNGO offers several downstream network grooming options, which you may find useful. For more on this, visit BiNGO and Cytoscape user guides. Hope this helps in your endeavor.

Docear: For Scientific Literature Management

In the field of research we encounter numerous useful articles. This plethora of literature, thus demands a methodical management. While I was struggling to find an efficient way to put all my articles into perspective, I came across ‘Docear’, which is described as a free and open source academic literature suite. After experiencing the simplicity and benefits of using Docear in academic projects, I was pleasantly surprised and equally tempted to write about it.The most appealing features of Docear, includes:
  1. Simple user-interface
  2. Ability to help create mind maps and link documents directly
  3. Seamless integration of PDFs along with their headings and custom annotations
  4. Powerful search, and
  5. Reference management (including support with MS word)
  6. Works with Windows, Mac and Unix. A further detailed list of features can be accessed from their official website.
Additionally, the website is equipped with tutorials and snapshots, which are very straight forward. I find this open-source project promising and I am hoping to receive timely updates for it in future.

 

Many kudos to the Docear team for developing this truly resourceful software suite.

Creating ‘Swap’ [‘Virtual’] Memory on Linux/Unix Operating System

Here’s some help for when you have too little RAM/memory and are trying to do memory-intensive steps, like indexing the human genome reference or doing other NGS-related processing.The way to do it is to create a ‘swap file’, as follows:

1) Check disk/drive usages:

df

2) Create the space. This step is long if you select a large amount of space. In this example, 512MB is created under root (/), given that block size (bs) is 1024 bytes:

sudo dd if=/dev/zero of=/swapfile bs=1024 count=512k

3) Switch it on:

sudo mkswap /swapfile

*This will only last until you restart the operating system, so, useful as a temporary measure. To make the swapfile permanent, do the following:

4) Open the following file:

sudo nano /etc/fstab

Paste in the following:

/swapfile    none     swap     sw     0     0

 

Article by: Dr. Kevin Blighe

Flat files to Databases: For better Speed, Integration and Sharing

In an ordinary dictionary, a word can be sought in two different ways:
  1. Use the index and locate your word of choice, or,
  2. Start with the first word and keep going, one by one until you get there.
Obviously, the first way is the smart way. But, when it comes to a real-time organised data, most of us prefer the second way by choosing to read (line by line) and write into the flat files; even when the task is repetitive. Relational Database Management System (RDBMS), such as SQL (can be MySQL, OpenSQL, SQLite, PostgreSQL etc) are well suited for such tasks, yet they are under-implemented by many of the bioinformaticians. 

 

The use of databases can be intimidating without the formal training of database management, but this overall picture has changed to a great extent with the advent of Object Oriented Mapping (ORM) frameworks. ORMs provide language-specific, object-oriented access to databases. It brings the database handling in the comfort zone of object oriented language of user’s choice. For example, in order to access a sequence in the database, one can execute,
required_sequence=protein_sequences.find(‘P22725’)
this will issue an SQL command at the back-end which is,
SELECT * FROM protein_sequences WHERE id=’P22725′
Another hectic of database handling is the server setup and maintenance issues. This can be reduced to a great extent by adopting a flexible, server-less and fully embed-able RDBMS, such as SQLite or BerkeleyDB. The rest of the operations of creating, modifying and deleting databases, tables and rows are well taken care by ORMs. The most popular ORMs include SQLObject (Python), DBIx::Class (Perl) and Hybernate (Java), which are open source and easily implementable.

In the modern era, the data is integrated from multiple sources and in complex fashions. This vast amount of information needs to be extracted in a reasonable way and channeled into the manageable and biologically meaningful outcomes in respect to medical applications. The database system offers efficient handling of the data and at the same time it delivers easy access via web applications, making it more suitable for scientific data sharing.
    

Python Modules: Expand your reach in Bioinformatics! (Part#2: Hybrid Programming)

A very classic question in bioinformatics is, which programming language is the best for a bioinformatician? Discussions like this never end with a conclusive answer. Interestingly, people find this question as a piece of cake and jump at it with whatever they have in their hands! The result is, you get a nice rainbow of choices, right from “C” to “PHP”!

Each programming language has its own perks and disadvantages. For example, “C” has an incredible speed in execution but it is equally code-intensive in writing even a simple program. Python and Perl on other hand make the same program code-lite but with a mediocre speed of execution. Apart from these performance issues, every language is blessed with a varying degree of third party modules/libraries.

Python has provided interfaces to many system calls and libraries, giving direct access to the shell of an operating system (modules like os, subprocess let you call unix commands directly from the python terminal). Python is also usable as an extension language for applications written in other languages that need easy-to-use scripting or automation interfaces. More than 15 coding projects have started to establish a platform where python can be integrated with other programming languages like C, Java, Perl, PHP, R, Fortran etc.

These hybrid platforms are either available as python modules which can easily be imported, like we do for general (numpy, maths, random etc) modules or accessible from a parent language (i.e. Jython, python implemented in Java)

A detailed list of these hybrid platforms are accessible from here.

Some fascinating platforms I couldn’t resist to mention here are:

  • elmer: Elmer allows developers to write code in Python and execute it in C or Tcl.
  • JPype: JPype allows python programs to fully access java class libraries.
  • PyPerlish: Allows the usage of perl idioms in python.
  • RPy: Simple and efficient access to R from python.

It is interesting to note that every platform mentioned here was somebody’s dream. Since shifting to a new language might deliver new exciting features but at the same time it takes away what you loved the most about the previous one. Following are the words from the creator of PyPerlish,

I’ve used perl for several years, and been very impressed with its ease of use. When you need to do something new, chances are there is an idiom which lets you do it in a few keystrokes. I didn’t want to lose that in moving to python. Somehow I wanted to get the benefits of perl’s idioms with the robust scalability and maintainability of python. So the idea is to emulate perl idioms, no matter how we implement the python code under the covers.”       — Harry George

 

Python Modules: Expand your reach in Bioinformatics! (Part#1: Phyloinformatics)

Python is getting increasingly popular among bioinformaticians, not just due to its simplistic yet powerful structure but also due to the third party modules which are imparting domain specific added advantages. This series is dedicated towards compilation of such modules, specific to each domain.

In this section, the most popular python modules in phyloinformatics are introduced.
  

“ETE is a python programming toolkit that assists in the automated manipulation, analysis and visualization of phylogenetic and other type of trees. It provides a wide range of tree handling methods, node annotation features, programmatic access to the phylomeDB database, and automatic orthology and paralogy prediction methods. In addition, an interactive tree visualization program, as well as a highly customizable tree drawing engine, is included.”    — ETE website

ETE examples: Tree with multiple sequence alignment, Bar chart and Pie chart

ETE is very well documented and pretty easy to use. Traversing the tree in different directions (from root to leaves, and leaves to root), manipulating (adding/removing) custom features to an individual node of tree, creating graphics rich plots, integrating multiple sequence alignments, evolutionary hypothesis testing and much more can be easily achieved with this module.

“DendroPy is a Python library for phylogenetic computing. It provides classes and functions for the simulation, processing, and manipulation of phylogenetic trees and character matrices, and supports the reading and writing of phylogenetic data in a range of formats, such as NEXUS, NEWICK, NeXML, Phylip, FASTA etc. Application scripts for performing some useful phylogenetic operations, such as data conversion and tree posterior distribution summarization, are also distributed and installed as part of the libary. DendroPy can thus function as a stand-alone library, a component of more complex multi-library phyloinformatic pipelines, or as a scripting “glue” that assembles and drives such pipelines.”    — DendroPy Website

Compared to ETE, DendroPy is more focused towards computational aspect of phyloinformatics, which includes simulation of birth-death process trees, population genetic trees, coalescent tress etc. DendroPy also allows calculation of general tree statistics like tree length, node age, probability under the coalescent model, tree distances etc. Unlike ETE, DendroPy also supports variety of character matrices (DNA, RNA, Proteins, any continuous/ discrete-value data), but at the same time DendroPy allows Phylogenetic Independent Contrasts (PIC) analysis (as described by Felsenstein 1985) given a tree and continuous character matrix.
CAUTION: The current release (3.2.0) do not support python 3.0

Bio.Phylo module was introduced in BioPython 1.54. This module is simplistic but covers all the necessary functionalities including, parsing/writing various tree formats, displaying trees in different color palettes, searching and traversing methods, clade/node specific information extraction/modification etc. Bio.Phylo also allows integration of third-party application like PAML for phylogenetic analysis by maximum likelihood. Likewise, BioPython wrappers are also available for PhyML, RAxML and FastTree.
All the three modules are well documented and irreplaceable given their functional disparity. There are also couple of other modules which are highly function specific and might just fit into your requirement list. These are,
    • P4: a python package for phylogenetics
      • For maximum likelihood and Bayesian phylogenetic analysis on molecular sequences
    • Mavric: a python toolkit for phylogenetics
      • Fully interactive editing of phylogenetic trees

Your eLearning Directory of Bioinformatics Essentials!

In the previous section we discussed “What makes you a Good Bioinformatician?” and concluded that one requires to take a slice from the three major pillars of bioinformatics i.e. biology, mathematics and computer sciences. It is required to gain expertise in any of the two, and learn the fundamentals of the third one. We also discussed that there are range of options that one can opt for, from amongst the three. Here, I have hand-picked a list of online resources that can provide a PRIMER on some of these options.

Note: This page will update time to time! You might want to save this page for the current listings or Bookmark it for further updates in this catalogue.

Biology

Mathematics

Computer Sciences

Cross Platform Courses

This catalog is intentionally kept smaller, in spite of dozens of other publicly available courses. The intention is NOT to trigger a Decision Fatigue, a very common fallacy arose when mind is subjected to multiple options (as very well described by Rolf Debelli), but to facilitate the selection.

If you want to give a try 🙂 visit: http://www.coursera.org http://ocw.mit.edu/courses http://nptel.iitm.ac.in/ http://videolectures.net/ If you find any alternative and better course for the given topics, please care to use the following comment box. I will update this list whenever possible.

What Makes You a Good Bioinformatician?

Does knowing contemporary data analysis tools and keeping a set of persuasive Unix commands at your fingertips make you a good bioinformatician? Or, is it that your stunning programming skills that transform you into a big tool production house make you a good bioinformatician?

I, recently, performed an extensive literature mining and considered the opinions of pioneering researchers and scientists in the field of bioinformatics on this topic.

Disclaimer:

  1. If you are an established researcher and looking from a professional angle then this article is not for you (Although we are going to discuss the foundation here). I would be happy to redirect you to “10 Steps to Success in Bioinformatics
  2. If you are not from the category 1, then you are on the right page.

While mining I came across various raging discussions regarding, what bioinformaticians are meant to do? Some people were adamant on the notion that experts from various fields should collaborate together instead of one person doing all the things poorly! It is crucial to understand that a bioinformatician is different from mathematician/statistician and he/she is also different from a biologist in regards to their ability to establish an interface for research and also channel the requirements and findings from both the directions.

A disparity between a biologist, a mathematician and a bioinformatician!

One thing I observed is that the definition of bioinformatics is not universally agreed upon. Generally speaking, we define it as the development and application of information technology for understanding the problems in biology, most commonly molecular biology (but increasingly in other areas of biology as well). As such, it deals with methods for systematic storage, retrieval and analysis of biological data including but not limited to nucleic acid and protein sequences, their structures, functional roles, regulatory pathways and biophysical interactions.

Some researchers interpret bioinformatics more narrowly and include only those issues dealing with high throughput sequencing data along with the  integration and the analysis of OMICs. A few construe bioinformatics more broadly and include all areas of computational biology, including population modelling and molecular simulations. Others construe bioinformatics as the development of necessary tools and databases for the analysis of biological data to draw meaningful conclusions.

In spite of the diversity in the bioinformatics domains, the foundation of which can be summed up in three major pillars of education:

  1. Biology
  2. Mathematics and Statistics
  3. Computer Science

It is ideal but at the same time ambitious to be able to grasp all the three pillars of bioinformatics. A good bioinformatician would possess a thorough knowledge of any two pillars but AT THE SAME time should also be aware of the fundamentals of the third one.

If you have trouble deciding, then let your INTEREST pick your two pillars, based on which you will develop a bioinformatics flavor, right from sequence analysis to molecular dynamic simulations.

These pillars in themselves are very diverse if we dive into them. For example, when we talk about biology, it comprises of molecular biology, biochemistry, evolution, ecology, behaviour etc. Similarly, in mathematics and statistics, you have a range of options right form probability theory to analytical combinatorics. In soft skills you have options from programming languages to learning systems.

You can browse through some of these streams and learn more about them here at “Your eLearning Directory of Bioinformatics essentials“.

One becomes a bioinformatics domain expert based on what slice he/she takes from the stack. You may chose molecular biology, differentials and integrals, and Matlab or equivalent, and be a good bioinformatician in the domain of molecular dynamics and simulations. Some people might argue that this is a very naive theory and one would require to know much more than that. Well naturally there is no upper limit. If you keep your knowledge stagnant for a long, you will soon realise that you have stopped growing. The pace at which advancements are happening in the field of biology and technology, it has become vital to upgrade our knowledge, skills and techniques but at the same time it is absolutely essential to stick to your stream, because success comes from experiences and experiences comes from dedicated efforts.

Your integrity is like a tip of an iceberg. The same tip, from outside may appear smaller than a frozen lake but when it comes to summer, the tip is all that remains since it has a strong foundation which no ordinary summer can melt.

Extracting Specific Fasta record/s from a Multi-fasta File

While dealing with multi-fasta files, it is often required to extract few fasta sequences which contain the keyword/s of interest. One fast way to do this, is by awk.

For example:

Input file: hg19_genome.fa

    >Chr1
    ATCTGCTGCTCGGGCTGCTCTAT…
    >Chr2
    GTACGTCGTAGGACATGCATCG…
    >MT1
    TACGATCGATCAGCTCAGCATC…
    >MT2
    CGCCATGGATCAGCTACATGTA…

We would like to extract the sequence for Chr2 from hg19_genome.fa. Use the following command:

$ awk ‘BEGIN {RS=”>”} /Chr2/ {print “>”$0}’ hg19_genome.fa

Output:

    >Chr2
    GTACGTCGTAGGACATGCATCG…

Note that, the search keyword (here ‘Chr2’) doesn’t have to be an exact match. If you use ‘MT‘ instead, you will get the third and fourth entry, since ‘MT’ is a sub-string of the third and fourth fasta record.

Now lets break down the command so that we don’t have to mug it up or we could mold it and use it in variety of other places.

  • awk — This is the main command (Or more of a very powerful programming language)
  • — We write every bit of awk code inside these single quotes
  • BEGIN — This tells the awk to execute the immediately following code in curly brackets at the beginning.
  • {RS=”>”} — Record separator  (If we look at the file, we can observe every sequence starts with a “>” sign. This helps us to separate two fasta records)
  • /Chr2/ — keyword to search in the entire record
  • {print “>”$0} — Here $0 is the current record (From “Chr2” to the entire sequence till next “>”). We added “>” at the beginning just to get the standard identitifer which is not included in $0.
  • hg19_genome.fa — This is the input multi-fasta file that we have used.

Now,
Suppose we are interested in more that one keyword then two possibilities arise:

You want BOTH the keywords present,
awk ‘BEGIN {RS=”>”} /Chr2/ && /MT/ {print “>”$0}’ hg19_genome.fa

You want EITHER of the keyword present,
awk ‘BEGIN {RS=”>”} /Chr2|MT/ {print “>”$0}’ hg19_genome.fa

Note: If you are using Windows, you can download and install ‘gwak’ and use similar command. In zsh shell you might need to use an escape character for | (pipe).

I am sure many of you might have different flavors to do the same. If you think it is worth sharing then the comment box is all yours.

Happy Coding !!