The Record Separator
I teach people Perl. And I enjoy it very much, since it makes me think about Perl from a beginner's perspective. Imagine you didn't have the advantage of CPAN and BioPerl. As a bioinformatician, you accumulate a lot of FASTA files, which come in the format:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLP
You have to find a way to read in this file in your Perl script. You know the usual suspects, open, filehandles, and the while(<$fh>) construction. But the default behavior of Perl is to handle things one line at a time, leading to loops that sometimes have to deal with the results of the last iteration before doing new work. Modules and BioPerl are widely available to just read the damn thing in, but remember, you’re a new Perl programmer.
Enter the record separator, ‘$/’ The record separator allows you to change that default line-by-line behavior to something else. If it’s reading in line by line, that means Perl is breaking up the input by “\n” newline characters. But it doesn’t have to be that way. Try using this statement in your declarations:
local $/ = ‘\n>’;
Suddenly, your input is broken up into whatever’s between the > symbols, when they start a new line. Now, instead of feeding in sequences line-by-line, you’re reading them in one sequence at a time. You can use your Perl arsenal to handle any operations you need on the sequence, and deal with them in a simple, procedural way.
At the end of your script, everything is returned to normal.
Some words of caution: If you declare this at the beginning of your file, it changes some things you can’t see, namely the behavior of chomp. Now, chomp will remove a leading > from your lines, so be aware of it.
Further Reading:
GC Content
Probably the first actual bioinformatic task a person does is figuring out the GC content of a given sequence. Why the GC content?
- Guanine and Cytosine are two of the nucleotides, and they're always paired.
- When paired on a double-stranded molecule of DNA, they share three hydrogen bonds, instead of the usual two.
- The higher the proportion of Gs and Cs in a sequence, the higher its resistance to heat.
So it's a useful thing to know, since you can make some guesses about likely environment and characteristics of the sequence, even if you don't know what it is. So how can you figure GC content?
For a short sequence, you can use an online calculator, like this, from your buddies at the Dana-Farber Cancer Institute at MIT. Quick, dirty, free, and not really that useful, unless you're checking PCR primers. (But you use a modern tool to generate them, right?)
For a longer sequence, the most rewarding method is likely to be learning a little bit of Perl. If you don't have Perl, get Perl. You probably want Perl 5.10 or 5.12, unless you can provide a compelling reason. I'll assume your attempt to install Perl was successful, and you're now editing an empty file. Into that empty file, you'll insert the following:
#!/usr/bin/perluse strict;use warnings;my $filename = shift;my $count_nt; ## The total number of nucleotidesmy $count_gc; ## The number of Gs and Csmy $gc; ## The actual GC contentopen(my $fh, '<', $filename) or die $!;while(<$fh>) {my @letters = split(//,$_);foreach(@letters) {if($_ =~ /G|g|C|c/) {$count_gc++;}$count_nt++;}}$gc = $count_gc/$count_nt;print "The GC content is $gc.\n";
Save this file as get_gc,pl, and make another file called genome.txt that is filled with a few lines of G, C, A, and T. Execute the following command:
perl get_gc.pl genome.txt
And that's it, you'll get a number back that is the proportion of G and C within the artificial genome you've created. It's a data point in itself, or you can run the same basic code on several genomes, and make a chart or histogram. You can figure out the distribution of GC content over a genus, or maybe the entire GC content of Genbank. (Why the hell not, right?)
GC content is one of many different tools we have to examine the genome of an organism. There are many other ways to do this, and I don't pretend it's the most idiomatic or compact way to make it happen. I do hope that it's readable and semi-useful. As we continue on, we'll get more into the formats you're likely to encounter and the techniques I've used to handle them.
The Central Dogma
So a lot of what I do depends on some basic understanding of molecular biology. While an in-depth knowledge is desirable, this is one of those explanations you can probably just refer back to when confused about terms. It does require some terminology, but it's not too bad.
The Central Dogma of Molecular Biology is a big idea. So big, in fact, that Nobel prizes have been awarded on its topic, and all of modern biology depends on it. It's a theory, meaning it's an internally-consistent, cohesive set of explanations we use to explain a wide variety of phenomena. It's also a law, meaning we base a field of study on it, and the broad truth of the topic is no longer actively debated. We still fight each other tooth and nail about little details, but the consensus of the scientific community is this:
DNA is transcribed into RNA is translated into Proteins.
Since DNA and RNA are nucleic acids, they store data very well, but don't perform work very well. Proteins are strong and functional, but very hard to replicate. Modern organisms and most viruses use DNA as the source of genetic information. They transcribe that DNA into an RNA template, which either goes off into the cell to be used as-is, or it can be translated into a protein, which can perform sophisticated functions.
There are lots of exceptions to the rules, like reverse transcriptases, which turn an RNA template into DNA, or Prions, which are self-replicating proteins. (Kind of.) The exceptions, though, are a tiny tiny fraction of the overall diversity. The overwhelming majority of everything that has ever lived follows the procession of DNA -> RNA -> Protein.
This process is highly conserved, meaning it operates much the same whether you're talking about a virus in a sulphurous hot spring, or a cat. This is one of the major arguments we use to explain why evolution is the paradigm we view all biology with. So far, everything we've ever found uses the same genetic code, with a bit of tweaking by domain. The DNA is arranged into 3-letter groups called codons, which are transcribed into a complementary RNA sequence. That sequence determines which amino acid is added to a growing chain of amino acids, which become a protein when complete. There are start codons and stop codons that initiate and halt translation. One interesting artifact is that while the start codons can vary slightly between domains of life, the stop codons do not. A stop is a stop is a stop, and they're mundane enough that in my line of work, we don't really even differentiate between them that much.
Most of my day to day work is examination of genetic code as it exists in an organism, and seeing if there is biological meaning I can pull out of the string of letters. There are only four nucleotides, and an alphabet with only four letters is at first blush not very interesting. Parsing out the meaning hidden in the letters is the science of Genomics, one of the most current and exciting fields of study within Bioinformatics. I hope to help explain some of the tools we use to do this work, and illuminate the theory that underpins our conclusions. Look for the BNFO101 Category of posts for more background information.
Who am I? / Bio-what?
My name is Joe. I'm a 30-year old Bioinformatics (BNFO) graduate student at a large state university. I'm very active within my academic program, and am/have been a teaching assistant for several of our undergraduate classes. I've worked with our freshman Phage Discovery Lab, a Howard Hughes Medical Institute-funded year-long freshman class, which lets new scientists isolate a new soil virus, and perform bioinformatics that illuminate a novel genome.
I also work with our senior seminar course, providing Perl and BNFO tool assistance, and helping our students to think about biological problems programatically. Some of that work is remedial, making up for programming classes that were just skimmed through, and some of it is evolutionary, trying to implement new techniques like Modern Perl, cloud computing, and web-enabled bioinformatics in a classroom environment. Sometimes it works out great, sometimes I fall on my face.
A lot of what I do is explain what my field is to people who may not be familiar. My pat answer is that "we beg, borrow, or steal any technique we can to make sense of biological data." When pressed, I talk about the need for biological knowledge, implemented with computational and informatic techniques. And if met with quizzical looks, I try to relate it back to the Human Genome Project, and other high-profile efforts.
Mostly, I try to do good work. I'm still learning about a lot of different aspects of this work, and there's a lot I don't know. I hope this blog is a useful exercise in that process.
