Protein and Nucleic Acid Alignments and Motif/Homology Searches using GCG

April 2000

By James Carothers

DISCLAIMER: There are essentially an unlimited number of ways to perform sequence alignments and motif/homology searches. Although this protocol will explain basic aspects of the GCG implementations of the most popular algorithms, you should be aware that other (sometimes much better!) ones do exist.

Strategic Planning

It is important to keep in mind that there is no one "right" alignment. Motifs are essentially what you want them to be (a single nucleotide could constitute a motif, for example). Homology is a somewhat subjective term. Choice of algorithm and algorithm parameters must be considered in terms of the desired answer (I will try to give examples so this will become clearer). Therefore, you should first decide how you want to use the data; for secondary structure prediction, pseudo-phylogenetic analysis, etc. Then, try to choose the tool that will give the best result as you've defined it. I give brief descriptions of the theoretical underpinnings of some of the GCG tools below. The local GCG manual pages have more information (http://helix.mgh.harvard.edu). Biological Sequence Analysis is an excellent text and is highly recommended for anyone wanting more thorough explanations.

Basic Protocol

1. Getting your sequences into GCG

You should first FTP your sequences to your genetics account using either the Fetch program for Macintosh, WSFTP for PC, or the Unix FTP commands from the genetics command line itself.

To FTP files using the Unix FTP commands, log into genetics as usual. Make a directory for your sequences files by typing mkdir [directory name]. (To remove this directory later, type rmdir [directory name]. Type cd [directory name] to change to the specified directory. Type FTP [hostname] to connect to the remote host where the desired sequences reside (e.g. ftp bpf.med.harvard.edu will connect to the BioPolymer Facility server). After logging in according to the prompts for username and password, use the ls command to list the contents of the current directory. cd to the directory containing your files. Type mget [filename] to transfer the sequence file to your current genetics directory. You can transfer all the files by typing mget *. When you are done transfering the files, typing quit will close the connection to the remote FTP server.

Once you have the files in a genetics directory, type GCG at a genetics > prompt. This loads the Genetics Computing Group package. Typing seqlab will open the GCG graphical interface. The first screen will contain the "MAIN LIST". Eventually it will have a listing of your sequences (after they have been saved as .rsf files). Now, however, it will probably be blank. Under the mode tab, select "Editor". This is main GCG interface for working with sequences. Load your sequences by selecting import under the file heading. Choose the appropriate directory and file names as usual, click OK. If you have problems loading your sequences (a typical error message is that the file contains no sequence information), this means that either the file name or the file format is unreadable. First check to see if the file name contains any spaces; Unix systems cannot handle these very well. If so, go back to the genetics> prompt by either closing seqlab or by opening another terminal window. cd to the appropriate directory and try renaming the file with the command mv '[old file name with spaces]' to [new_file_name_with_no_ spaces]. Notice the ' ' surrounding the old file name. The easiest modification is to replace all the spaces with underscores _. Return to seqlab and try opening the sequence file again. If it still doesn't open, the files are probably not in a readable format. seqlab is somewhat finicky in this regard. It is relatively easy to alter this with a text editor (simpletext on the Mac, vi or emacs on a Unix machine) by including a "carrot" followed by a sequence title at the beginning of the first line (e.g. >sequence #1). This is the FASTA format and is seqlab-friendly. If you are analyzing a large number of sequences you may want to automate renaming and FASTA formatting. Jonathan Urbach and I (James) have written a couple of short perl scripts that will do both (see below).

2. Analyzing your sequences

The seqlab interface, while preferable to working from the command line, does not always respond intuitively. You should play around with it to familiarize yourself with its idiosyncracies before you try to do something time sensitive.

The function heading contains all of the pertinent algorithms. To run any of them you can either select the sequences you want to analyze by highlighting or by choosing to assay all of them once the particular function window is opened.

For pairwise alignments there are two choices; Gap which performs a global alignment and Best-fit which can do local alignments.

Gap uses the Needle-Wunsch algorithm which is a dynamic programming scheme. It seeks to maximize the number of matches and minimize the number of gaps. Accordingly, the gap creation and extension penalties are the most important parameters. Because it is a global alignment it works best with a pair of sequences which are identical, or nearly identical in length (e.g. the results of a doped selection). Differences in length can be partially accommodated by reducing the gap extension penalty (lower numbers mean less penalty).

Note that the scoring matrices used for all of these programs may not be as accurate for in vitro selected proteins and nucleic acids as they are for "real" ones. This is because the substitution probabilities (of a G for an A or the replacement of Ile for Leu, for example) garnered from biological data may not be accurate under non-biological selection pressures. This would seem to be especially true of structured RNA molecules. You wouldn't want to penalize a G to A substitution where the pairing partner is a U as much as when the pairing partner is a C. The good news is that it may not make a very big difference AND you can change the values if you really want (though it's not trivial to do so!). The bad news is, you probably can't tell if it matters unless you test it.

Another consideration is the manner in which significance is evaluated. Many of these algorithms use a randomization scheme. That is, they randomize portions of one sequence and re-align it. It then compares this fake alignment with the real one. If your sequence contains a lot of repeats you may want to increase the amount of randomization (careful though, raising the number of RANs greatly increases computational time).

Best-fit is probably a better way of perfoming a pairwise alignment where the sequences have different lengths. Best-fit is an implementation of the Smith-Waterman algorithm. It seeks to maximize the number of matches by inserting gaps to find regions of local similarity. It can be used to identify common motifs among selected molecules. That is, it can identify regions of conservation amidst long stretches of sequence not subject to constraint. In fact, I find that Best-fit can be used more easily to identify nucleic acid motifs than dedicated motif-finders such as MEME.

As with Gap, the gap creation and extension penalties should be considered carefully. It is also subject to the same scoring matrix issues. Significance is scored in the same way so adjust the RAN parameter as necessary.

Despite the name, ProfileGap uses the Smith-Waterman algorithm and is thus more like Best-fit than it is Gap. It can be used to align a sequence with a previously-defined set. This is useful if you want to quickly assess how closely related a newly-identified molecule is to a particular family of sequences (i.e. the extent of similarity = homology?). To make a profile from a set of aligned sequences use the ProfileMake program.

Pileup is perhaps the most useful, yet potentially misused, program in the GCG package. It employs the method of Feng and Doolittle to generate a multiple sequence alignment using progressive, pairwise alignments. It starts with an aligned pair and then expands the cluster until all of the sequences have been compared with the preexisting cluster. The pairwise alignments are performed using the Needleman-Wunsch method (thus it is Gap-like). It works best if the sequences are of uniform length. The default settings do not penalize gap creation at the ends so it is a little bit permissive with regard to varied sequence lengths. Note that the dendrogram it creates is not a phylogenetic tree- it merely represents the clustering order used in the final alignment.

Pileup can be used with a relatively high gap creation penalty (say 4 or 5) if you want to try to divide sequences into families (if you're careful, the dendrogram can be used for this). As an aside, I believe that gap extension penalties should always be set to 1 or 0 due to the preponderance of insertions and deletions in in vitro selected molecules. On the other hand, if you are analyzing the results of a doped selection you may want an aesthetically pleasing alignment (that is one that has clean vertical spacing) so you can identify co-variation and/or spots of mutation. Setting the gap creation penalty lower, to 2, 1 or even 0 will facilitate this.

MEME uses the method of Bailey and Elkin to identify likely motifs. Roughly, a Bayesian probability model is employed to predict the likelihood of a common sequence element being found amongst un-aligned sequences. I have not found MEME to be very useful; for small molecules with small expected motifs it is difficult to get statistically meaningful results. Another problem is that it's hard to know, a priori, what size words to use in the search. Believe it or not, the best way to identify motifs is probably to use Best-fit or even Pileup to do an alignment and then peruse the sequences manually (and confirm them experimentally...).

Background reading

Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis:

Probabilistic models of proteins and nucleic acids. Cambridge Press:

Cambridge and New York.

 

Perl Scripts for renaming and formatting files for seqlab

makenicenames can be entered using a text editor. Running it with the command perl makenicenames will execute the script and will remove the spaces from all the file names in the current directory.

#!/usr/bin/perl

#written by Jonathan Urbach, April 2000

#removes the spaces from all files in the current directory

chomp(@filenames=`ls`);

foreach $fn (@filenames)

{

($on=$fn) =~ s/ /_/g ;

print ("$fn ---> $on\n");

rename ($fn, $on);

}

makefasta can be entered with a text editor. It will add a ">[filename]" to the beginning of ALL files in a specified directory. WARNING makefasta modifies all the files within the specified directory so be sure it only contains the desired sequences. Use the command perl makefasta. At the prompt for directory enter [target directory]/.

#!/usr/bin/perl

#written by James Carothers, April 2000

#makes into fasta format for seqlab

#it uses the filename as the sequence name

#specify the directory containing only non-FASTA formatted sequence files

print "\n enter name of directory containing (desired!) files: \n";

$dir = <STDIN>;

chomp $dir;

chomp(@filenames=`ls $dir`);

foreach $filename (@filenames)

{

$filenameb = "$dir$filename";

print "\n $filenameb \n";

open (INFO, $filenameb);

@sequence = <INFO>;

print "\n @sequence \n";

close (INFO);

open (INFO, ">$filenameb");

print INFO ">$filename \n";

print INFO "@sequence\n";

close (INFO);

}