Practical Bioinformatics the halting thought process of a working bioinformatician


GC Content

Probably the first actual bioinformatic task a person does is figuring out the GC content of a given sequence.  Why the GC content?

  • Guanine and Cytosine are two of the nucleotides, and they're always paired.
  • When paired on a double-stranded molecule of DNA, they share three hydrogen bonds, instead of the usual two.
  • The higher the proportion of Gs and Cs in a sequence, the higher its resistance to heat.

So it's a useful thing to know, since you can make some guesses about likely environment and characteristics of the sequence, even if you don't know what it is.  So how can you figure GC content?

For a short sequence, you can use an online calculator, like this, from your buddies at the Dana-Farber Cancer Institute at MIT.  Quick, dirty, free, and not really that useful, unless you're checking PCR primers.  (But you use a modern tool to generate them, right?)

For a longer sequence, the most rewarding method is likely to be learning a little bit of Perl.  If you don't have Perl, get Perl.  You probably want Perl 5.10 or 5.12, unless you can provide a compelling reason.  I'll assume your attempt to install Perl was successful, and you're now editing an empty file.  Into that empty file, you'll insert the following:

use strict;
use warnings;
my $filename = shift;
my $count_nt; ## The total number of nucleotides
my $count_gc; ## The number of Gs and Cs
my $gc; ## The actual GC content
open(my $fh, '<', $filename) or die $!;
while(<$fh>) {
my @letters = split(//,$_);
foreach(@letters) {
if($_ =~ /G|g|C|c/) {$count_gc++;}
$gc = $count_gc/$count_nt;
print "The GC content is $gc.\n";

Save this file as get_gc,pl, and make another file called genome.txt that is filled with a few lines of G, C, A, and T.  Execute the following command:

perl genome.txt

And that's it, you'll get a number back that is the proportion of G and C within the artificial genome you've created.  It's a data point in itself, or you can run the same basic code on several genomes, and make a chart or histogram.  You can figure out the distribution of GC content over a genus, or maybe the entire GC content of Genbank.  (Why the hell not, right?)

GC content is one of many different tools we have to examine the genome of an organism.  There are many other ways to do this, and I don't pretend it's the most idiomatic or compact way to make it happen.  I do hope that it's readable and semi-useful.  As we continue on, we'll get more into the formats you're likely to encounter and the techniques I've used to handle them.

Filed under: BNFO101 Leave a comment
Comments (0) Trackbacks (0)

No comments yet.

Leave a comment


No trackbacks yet.