Practical Bioinformatics the halting thought process of a working bioinformatician


GC Content

Probably the first actual bioinformatic task a person does is figuring out the GC content of a given sequence.  Why the GC content?

  • Guanine and Cytosine are two of the nucleotides, and they're always paired.
  • When paired on a double-stranded molecule of DNA, they share three hydrogen bonds, instead of the usual two.
  • The higher the proportion of Gs and Cs in a sequence, the higher its resistance to heat.

So it's a useful thing to know, since you can make some guesses about likely environment and characteristics of the sequence, even if you don't know what it is.  So how can you figure GC content?

For a short sequence, you can use an online calculator, like this, from your buddies at the Dana-Farber Cancer Institute at MIT.  Quick, dirty, free, and not really that useful, unless you're checking PCR primers.  (But you use a modern tool to generate them, right?)

For a longer sequence, the most rewarding method is likely to be learning a little bit of Perl.  If you don't have Perl, get Perl.  You probably want Perl 5.10 or 5.12, unless you can provide a compelling reason.  I'll assume your attempt to install Perl was successful, and you're now editing an empty file.  Into that empty file, you'll insert the following:

use strict;
use warnings;
my $filename = shift;
my $count_nt; ## The total number of nucleotides
my $count_gc; ## The number of Gs and Cs
my $gc; ## The actual GC content
open(my $fh, '<', $filename) or die $!;
while(<$fh>) {
my @letters = split(//,$_);
foreach(@letters) {
if($_ =~ /G|g|C|c/) {$count_gc++;}
$gc = $count_gc/$count_nt;
print "The GC content is $gc.\n";

Save this file as get_gc,pl, and make another file called genome.txt that is filled with a few lines of G, C, A, and T.  Execute the following command:

perl genome.txt

And that's it, you'll get a number back that is the proportion of G and C within the artificial genome you've created.  It's a data point in itself, or you can run the same basic code on several genomes, and make a chart or histogram.  You can figure out the distribution of GC content over a genus, or maybe the entire GC content of Genbank.  (Why the hell not, right?)

GC content is one of many different tools we have to examine the genome of an organism.  There are many other ways to do this, and I don't pretend it's the most idiomatic or compact way to make it happen.  I do hope that it's readable and semi-useful.  As we continue on, we'll get more into the formats you're likely to encounter and the techniques I've used to handle them.

Filed under: BNFO101 No Comments

The Central Dogma

So a lot of what I do depends on some basic understanding of molecular biology.  While an in-depth knowledge is desirable, this is one of those explanations you can probably just refer back to when confused about terms.  It does require some terminology, but it's not too bad.

The Central Dogma of Molecular Biology is a big idea.  So big, in fact, that Nobel prizes have been awarded on its topic, and all of modern biology depends on it.  It's a theory, meaning it's an internally-consistent, cohesive set of explanations we use to explain a wide variety of phenomena.  It's also a law, meaning we base a field of study on it, and the broad truth of the topic is no longer actively debated. We still fight each other tooth and nail about little details, but the consensus of the scientific community is this:

DNA -> RNA -> Protein

DNA is transcribed into RNA is translated into Proteins.

Since DNA and RNA are nucleic acids, they store data very well, but don't perform work very well. Proteins are strong and functional, but very hard to replicate.  Modern organisms and most viruses use DNA as the source of genetic information. They transcribe that DNA into an RNA template, which either goes off into the cell to be used as-is, or it can be translated into a protein, which can perform sophisticated functions.

There are lots of exceptions to the rules, like reverse transcriptases, which turn an RNA template into DNA, or Prions, which are self-replicating proteins.  (Kind of.)  The exceptions, though, are a tiny tiny fraction of the overall diversity.  The overwhelming majority of everything that has ever lived follows the procession of DNA -> RNA -> Protein.

This process is highly conserved, meaning it operates much the same whether you're talking about a virus in a sulphurous hot spring, or a cat.  This is one of the major arguments we use to explain why evolution is the paradigm we view all biology with.  So far, everything we've ever found uses the same genetic code, with a bit of tweaking by domain.  The DNA is arranged into 3-letter groups called codons, which are transcribed into a complementary RNA sequence.  That sequence determines which amino acid is added to a growing chain of amino acids, which become a protein when complete.  There are start codons and stop codons that initiate and halt translation.  One interesting artifact is that while the start codons can vary slightly between domains of life, the stop codons do not.  A stop is a stop is a stop, and they're mundane enough that in my line of work, we don't really even differentiate between them that much.

Most of my day to day work is examination of genetic code as it exists in an organism, and seeing if there is biological meaning I can pull out of the string of letters.  There are only four nucleotides, and an alphabet with only four letters is at first blush not very interesting.  Parsing out the meaning hidden in the letters is the science of Genomics, one of the most current and exciting fields of study within Bioinformatics.  I hope to help explain some of the tools we use to do this work, and illuminate the theory that underpins our conclusions.  Look for the BNFO101 Category of posts for more background information.

Filed under: BNFO101 No Comments