Designing Nucleic Acid Pools for In Vitro Selection

Jonathan Urbach

February 4, 1998


When constructing a pool for in vitro selection, a great deal of effort goes into designing the pool before any oligonucleotide synthesis is begun. The main constraint in pool design is to have the required structural elements for a selection, such as an aptamer or primer binding site. In order for the pool to be suitable for a selection, it must be possible to maximize PCR efficiency in order to minimize PCR selective pressure. Finally, the pool must be technologically and financially feasible to construct.

The Promoter Region

All pools used for RNA selections and for peptide selections contain a T7 promoter. While it is in theory possible to use another type of promoter, T7 is the most practical. A minimal T7 promoter sequence is:

  -17              -1    5'-TAATACGACTCACTATA-3' T7 promoter

This, when annealed to an antisense DNA strand, is enough to allow transcription by T7 RNA polymerase.(1) This represents the beginning of the 5' primer used to construct pools for in vitro selection. Immediately after this sequence are the first bases of the 5' constant region. Two or more bases, especially G and C, in the -18 and -19 positions can be beneficial for transcription because they stabilize the 5' end of the promoter region duplex.

Transcription is optimal when initiated with a stretch of purines. The first two nucleotides, in the +1 and +2 positions are the most critical (there is no 0 position; +1 follows -1 in this case). GG transcribes the best of all starting sequences. GC and GA are about two fold less efficient. (2) The presence of U (or T in the DNA code) should be avoided in the first 6 or so bases.

Since long pools are often constructed by ligation of smaller pool fragments, it is sometimes useful to have a restriction site within the 3' and 5' constant regions. These are generally Ava I (C|YCGRG), Ban I (G|GYRCC), and Sty I (C|CWWGG) (3). These are useful because they can have asymmetric restriction sites with easily ligated four-base overhangs and because they are relatively inexpensive, about $0.02/unit. Asymmetric restriction sites are useful because they minimize dimerization by self ligation.

It can be useful to have some identifying pool feature within or flanking the random region of a pool. Since it is not part of the primer, this serves as a marker for the pool, an indicator of its origin. In situations where contamination of one pool with another is suspected, having an internal identifier in the pools is helpful. Such an identifier can be a restriction site or some other stretch of nonrandom bases.

Often, after an initial selection, it is desirable to redesign the pool and reselect. Redesign may include deleting, inserting or substituting regions. Restriction sites may also be included in pools when the need for reselection is anticipated.

Other major types of structural elements that appear in pools are preengineered functional regions. These can be preselected aptamers for use as substrate binding domains , (4), (5) or as starting points for other related functions(6); stem loop structures derived from naturally occurring ribozymes(7), or paired regions intended to facilitate some function(8).

Minimizing Secondary Structure and Mispriming

Generally after all the pool construction constraints are considered, the pool designer is left to select the remainder of the 5' and 3' constant regions. The sequence of the constant regions can be chosen to minimize secondary structure and mispairings that can interfere with proper PCR amplification.

A useful tool for designing pools is the codon matrix, a 4x4x4 matrix containing all the 64 possible nucleotide triplets. In designing a pool, one can start with the basic required sequences. This includes the T-7 promoter region, the restriction sites and any other design elements to be incorporated into the pool. Starting with the first triplet, advancing one base at a time, cross out every codon in the pool, and circle all the complements. Then, start adding bases one at a time to the constant regions, avoiding as you go those that generate a triplet that has already been used. For longer sequences, it becomes impossible to avoid repeating codons. Still it is advisable to minimize repeats. In general, homologous regions of 4 bases or less are not a great concern.

Homology between the 3' ends of primers with internal pool sequences is especially deleterious because it can lead to mispriming, which not only produces deletions, but also soaks up primer to give poor full length PCR product. This situation may occur when restriction sites in pools are repeated due to pool construction. The result is the emergence of PCR products of less than full length. These less-than-full-length PCR artifacts are generally amplified more efficiently than larger PCR products and can therefore be a problem. They may be minimized by choosing primers that overlap only minimally with the repeated restriction site. Reverse transcription and PCR conditions may also be chosen to reduce this problem. Reverse transcription at 42°C instead of 37°C is advisable. During PCR, annealing of primers at 60°C instead of 50°C can also be helpful.


Using mfold to Minimize Pool Secondary Structure

Another useful tool for optimizing pool structure is the program mfold which is part of the GCG sequence analysis package. One may enter a sequence into mfold and output a folded structures. If, for example, the lowest energy folded structure is a 4 base stem containing a wobble, there is no problem. If there are 3 low energy structures, all with 7 base stems, which are not part of any predesigned structure, this may be problematic. Changing a single base to remove a stem may bring about a whole new structure that had not been anticipated. It is good to reconfirm that a stem has been removed by running mfold on every new sequence.

Using mfold on genetics is relatively simple. The GCG package may be initialized on genetics by typing:


A sequence may be entered by running seqed:
	seqed filename.seq

The program starts in a comment mode. After comments are entered, ctrl<D> will start the sequence entry mode of the program (figure 2). After the sequence is entered, ctrl<D> then wq will save the sequence and quit the program.

To use mfold, type:

	mfold filename.seq

The program will launch and ask you at which base do you want to begin your folding and at which base do you wish to stop:> mfold important.seq	Begin (* 1 *) ?	End (*   119 *) ?	What should I call the energy matrix output file (* important.mfold *) ?	Folding .......................	CPU time: 00.97	Output file: important.mfold

To display folded structures, type

	plotfold -showseq filename.seq <return>

You will then be prompted for loop size and different kinds of output formats:> plotfold -showseq important.mfold	PlotFold displays the optimal and suboptimal secondary structures	for an RNA molecule predicted by MFold.  	Process set to plot with LASERWRITER attached to  	using the psd graphic interface.	Maximum size of interior loop = 30 	Maximum lopsidedness of an interior loop = 30 	Do you want to display:   	SURVEY OF OPTIMAL AND SUBOPTIMAL FOLDINGS            A) energy dotplot            B) p-num plot 	SAMPLING OF OPTIMAL AND SUBOPTIMAL FOLDINGS            C) circles            D) domes            E) mountains            F) squiggles            G) text output            H) connect file output 	Please choose one (* A *):

Chose F or G and you will be prompted for a few more parameters:

Energy of optimal structure = -4.8 Plot structures at what energy increment (* 2.0 *) ? Up to how many structures do you want to plot (* 25 *) ? What window size (* 3 *) ? What should I call the structure output file (* important.fld *) ? Structures plotted: 1

Typing more important.fld will display the foldings in a text format. If instead of F(text) you chose G(squiggles output), you will output a graphical image of the structure that can be printed on a postscript laser-printer or viewed with a program like macgs. The structure in figure 3 is an example of a "squiggles" output.

More details on how to use the GCG Package are available from the department computer personnel.

An Alternative Method of Generating Constant Region Sequence

Kourosh Salehi suggests using a basic program to generate random sequences, then using mfold to select sequences with minimal secondary structure as described in the previous section. An example of his program, written for QBASIC on the PC is this:

	750 randomize timer	810 print "How many "; : input m	800 print "Enter length "; : input L	820 print "Print (Y or any key)"; : input P$	840 for h=1 to m	850 G=0: A=0: T=0: C=0: GC=0	890 OL$=""	900 for b=1 to L	1010 R=RND*4: R=R+1	1020 R=INT(R)	1030 if R=1 then N$ = "G": G=G+1	1040 if R=2 then N$ = "A": A=A+1	1050 if R=3 then N$ = "T": T=T+1	1060 if R=4 then N$ = "C": C=C+1	1070 OL$=OL$ + N$	1100 next b	1120 GC = (G+C)/(G+C+A+T)	1130 print OL$, GC	1140 if P$ = "Y" then lprint OL$, GC	1150 if P$ = "Y" then lprint	1200 next h	2000 REM Program written by KSA, 1997
This program does not work on macintosh, as written. The following version works with Macintosh Chipmunk Basic:
	750 randomize timer	800 print "Enter length "; : input l	810 print "How many "; : input m	820 print "Save output to a file? (y or any key)"; : input p$	830 if p$ = "y" then open "SFPutFile" for output as #1	840 for h = 1 to m	850 g = 0 : a = 0 : t = 0 : c = 0 : gc = 0	890 ol$ = ""	900 for b = 1 to l	1010 r = rnd(4)	1030 if r = 0 then n$ = "G" : g = g+1	1040 if r = 1 then n$ = "A" : a = a+1	1050 if r = 2 then n$ = "T" : t = t+1	1060 if r = 3 then n$ = "C" : c = c+1	1070 ol$ = ol$+n$	1100 next b	1120 gc = (g+c)/(g+c+a+t)	1130 print ol$,gc	1140 if p$ = "y" then print #1,ol$,gc	1200 next h	1300 close #1	2000 rem Program written by KSA, 1997	2001 rem Ported to Chipmunk Basic by JMU, 1998	2010 end


At this point, the pool is ready to be synthesized. Although not all structural problems can be foreseen, many of them can and can be avoided by careful planning of the sequence.

(1) J. F. Milligan and O. C. Uhlenbeck, Methods in Enzymology, 180, 51 (1989).
(2) Ibid.
(3) Single letter codes: R= A or G, Y= C or T, W= A or T
(4) J. R. Lorsch, and J. W. Szostak, Nature, 371, 31 (1994).
(5) A. J. Hager, and J.W. Szostak, Chemistry and Biology, 4, 607-617 (1997).
(6) M. Famulok, J. Am. Chem. Soc., 116, 1698 (1994).
(7) D. P. Bartel, Science, 261, 1411 (1993).
(8) P. A. Lohse, and J. W. Szostak, Nature, 381, 442 (1996).