Practical Bioinformatics the halting thought process of a working bioinformatician

2Jul/120

Things I’m working on today

Amazon AWS

I'm using Amazon AWS for so many things these days.  I use EC2 to host a MIRA Assembler image, it'll assemble anything, fast, and cheap.  Today's task is just an EC2 instance for users to RDP into to deploy Acrobat X.  I know, it's not very BNFO, but if you can run Acrobat on it, you can run ClustalX or MAST on it.

Apache Solr

I don't know much about Solr yet, except that it's Apache's search engine.  It's available as a Tomcat servlet, but I'm trying to run it up using the Blaze Appliance by initcron.  I'm very much into appliances these days.  Even on my desktop machine (6-core Phenom II), the VT-x extensions are a great way to run things in VirtualBox without a big performance penalty.  I like the idea of Amazon's CloudSearch, but I'm doing such a small index that it's probably not worth paying for.  Solr looks like a good way to index and search things over NFS and CIFS shares in the lab.

FreeNAS 8.0.4-p3

I do occasionally sully my hands with non-cloud storage.  My current in-lab NAS is a Western Digital ShareSpace.  It's a 4Tb unit, which is plenty of space even in RAID5, but the throughput's not fast at all.  It'll max out at 12Mb/sec read over NFS, and writes about 4Mb/sec at best.  Since I had an older IBM Raid Server with 12 SAS drives in it, I figured I'd use FreeNAS to replace it.  It works great, ZFS is an excellent filesystem, and the snapshots mesh nicely with Amazon S3.

The problem I encountered is in moving the data between.  I mounted the old and new filesystems, and then tried to copy the contents over:

cp --force --preserve --recursive /mnt/old/* /mnt/new/

I kept getting prompts to continue, even though I specified --force.  Turns out at least in CentOS 6.2, 'cp' is aliased to 'cp -i', in an attempt to keep you from overwriting things unintentionally.  The easy fix?

\cp --force --preserve --recursive /mnt/old/* /mnt/new/

The \ prefix avoids the alias without using 'unalias'. Now things copy smoothly. And throughput is amazing, I've seen up to 50 Mb/sec write speeds, and pulling data is much faster than that.

2Jul/120

Wow, it’s been a while.

Since I last updated, I've finished out my Master's, started full-time work at my current employer, and had a little girl.  Hopefully, I can be more productive now.

Thanks for your patience.

13Feb/110

The Record Separator

I teach people Perl.  And I enjoy it very much, since it makes me think about Perl from a beginner's perspective.  Imagine you didn't have the advantage of CPAN and BioPerl.  As a bioinformatician, you accumulate a lot of FASTA files, which come in the format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLP

You have to find a way to read in this file in your Perl script.  You know the usual suspects, open, filehandles, and the while(<$fh>) construction.  But the default behavior of Perl is to handle things one line at a time, leading to loops that sometimes have to deal with the results of the last iteration before doing new work.  Modules and BioPerl are widely available to just read the damn thing in, but remember, you’re a new Perl programmer.

Enter the record separator, ‘$/’  The record separator allows you to change that default line-by-line behavior to something else.  If it’s reading in line by line, that means Perl is breaking up the input by “\n” newline characters.  But it doesn’t have to be that way.  Try using this statement in your declarations:

local $/ = ‘\n>’;

Suddenly, your input is broken up into whatever’s between the > symbols, when they start a new line.  Now, instead of feeding in sequences line-by-line, you’re reading them in one sequence at a time.  You can use your Perl arsenal to handle any operations you need on the sequence, and deal with them in a simple, procedural way.

At the end of your script, everything is returned to normal.

Some words of caution: If you declare this at the beginning of your file, it changes some things you can’t see, namely the behavior of chomp.  Now, chomp will remove a leading > from your lines, so be aware of it.

Further Reading:

http://www.perl.com/pub/2004/06/18/variables.html

http://perldoc.perl.org/perlvar.html

Filed under: Perl No Comments
27Dec/100

GC Content

Probably the first actual bioinformatic task a person does is figuring out the GC content of a given sequence.  Why the GC content?

  • Guanine and Cytosine are two of the nucleotides, and they're always paired.
  • When paired on a double-stranded molecule of DNA, they share three hydrogen bonds, instead of the usual two.
  • The higher the proportion of Gs and Cs in a sequence, the higher its resistance to heat.

So it's a useful thing to know, since you can make some guesses about likely environment and characteristics of the sequence, even if you don't know what it is.  So how can you figure GC content?

For a short sequence, you can use an online calculator, like this, from your buddies at the Dana-Farber Cancer Institute at MIT.  Quick, dirty, free, and not really that useful, unless you're checking PCR primers.  (But you use a modern tool to generate them, right?)

For a longer sequence, the most rewarding method is likely to be learning a little bit of Perl.  If you don't have Perl, get Perl.  You probably want Perl 5.10 or 5.12, unless you can provide a compelling reason.  I'll assume your attempt to install Perl was successful, and you're now editing an empty file.  Into that empty file, you'll insert the following:

#!/usr/bin/perl
use strict;
use warnings;
my $filename = shift;
my $count_nt; ## The total number of nucleotides
my $count_gc; ## The number of Gs and Cs
my $gc; ## The actual GC content
open(my $fh, '<', $filename) or die $!;
while(<$fh>) {
my @letters = split(//,$_);
foreach(@letters) {
if($_ =~ /G|g|C|c/) {$count_gc++;}
$count_nt++;
}
}
$gc = $count_gc/$count_nt;
print "The GC content is $gc.\n";

Save this file as get_gc,pl, and make another file called genome.txt that is filled with a few lines of G, C, A, and T.  Execute the following command:

perl get_gc.pl genome.txt

And that's it, you'll get a number back that is the proportion of G and C within the artificial genome you've created.  It's a data point in itself, or you can run the same basic code on several genomes, and make a chart or histogram.  You can figure out the distribution of GC content over a genus, or maybe the entire GC content of Genbank.  (Why the hell not, right?)

GC content is one of many different tools we have to examine the genome of an organism.  There are many other ways to do this, and I don't pretend it's the most idiomatic or compact way to make it happen.  I do hope that it's readable and semi-useful.  As we continue on, we'll get more into the formats you're likely to encounter and the techniques I've used to handle them.

Filed under: BNFO101 No Comments
26Dec/100

The Central Dogma

So a lot of what I do depends on some basic understanding of molecular biology.  While an in-depth knowledge is desirable, this is one of those explanations you can probably just refer back to when confused about terms.  It does require some terminology, but it's not too bad.

The Central Dogma of Molecular Biology is a big idea.  So big, in fact, that Nobel prizes have been awarded on its topic, and all of modern biology depends on it.  It's a theory, meaning it's an internally-consistent, cohesive set of explanations we use to explain a wide variety of phenomena.  It's also a law, meaning we base a field of study on it, and the broad truth of the topic is no longer actively debated. We still fight each other tooth and nail about little details, but the consensus of the scientific community is this:

DNA -> RNA -> Protein

DNA is transcribed into RNA is translated into Proteins.

Since DNA and RNA are nucleic acids, they store data very well, but don't perform work very well. Proteins are strong and functional, but very hard to replicate.  Modern organisms and most viruses use DNA as the source of genetic information. They transcribe that DNA into an RNA template, which either goes off into the cell to be used as-is, or it can be translated into a protein, which can perform sophisticated functions.

There are lots of exceptions to the rules, like reverse transcriptases, which turn an RNA template into DNA, or Prions, which are self-replicating proteins.  (Kind of.)  The exceptions, though, are a tiny tiny fraction of the overall diversity.  The overwhelming majority of everything that has ever lived follows the procession of DNA -> RNA -> Protein.

This process is highly conserved, meaning it operates much the same whether you're talking about a virus in a sulphurous hot spring, or a cat.  This is one of the major arguments we use to explain why evolution is the paradigm we view all biology with.  So far, everything we've ever found uses the same genetic code, with a bit of tweaking by domain.  The DNA is arranged into 3-letter groups called codons, which are transcribed into a complementary RNA sequence.  That sequence determines which amino acid is added to a growing chain of amino acids, which become a protein when complete.  There are start codons and stop codons that initiate and halt translation.  One interesting artifact is that while the start codons can vary slightly between domains of life, the stop codons do not.  A stop is a stop is a stop, and they're mundane enough that in my line of work, we don't really even differentiate between them that much.

Most of my day to day work is examination of genetic code as it exists in an organism, and seeing if there is biological meaning I can pull out of the string of letters.  There are only four nucleotides, and an alphabet with only four letters is at first blush not very interesting.  Parsing out the meaning hidden in the letters is the science of Genomics, one of the most current and exciting fields of study within Bioinformatics.  I hope to help explain some of the tools we use to do this work, and illuminate the theory that underpins our conclusions.  Look for the BNFO101 Category of posts for more background information.

Filed under: BNFO101 No Comments
24Dec/100

Who am I? / Bio-what?

My name is Joe.  I'm a 30-year old Bioinformatics (BNFO) graduate student at a large state university.  I'm very active within my academic program, and am/have been a teaching assistant for several of our undergraduate classes.  I've worked with our freshman Phage Discovery Lab, a Howard Hughes Medical Institute-funded year-long freshman class, which lets new scientists isolate a new soil virus, and perform bioinformatics that illuminate a novel genome.

I also work with our senior seminar course, providing Perl and BNFO tool assistance, and helping our students to think about biological problems programatically.  Some of that work is remedial, making up for programming classes that were just skimmed through, and some of it is evolutionary, trying to implement new techniques like Modern Perl, cloud computing, and web-enabled bioinformatics in a classroom environment.  Sometimes it works out great, sometimes I fall on my face.

A lot of what I do is explain what my field is to people who may not be familiar.  My pat answer is that "we beg, borrow, or steal any technique we can to make sense of biological data."  When pressed, I talk about the need for biological knowledge, implemented with computational and informatic techniques.  And if met with quizzical looks, I try to relate it back to the Human Genome Project, and other high-profile efforts.

Mostly, I try to do good work.  I'm still learning about a lot of different aspects of this work, and there's a lot I don't know.  I hope this blog is a useful exercise in that process.