Protein pool design

Tony Keefe

4/12/98


Statistical appearence of different amino acids

Obviously you will want a range of amino acids in yourpool. Using a mixture of all four bases will ensure that all twenty aminoacids have some probability of appearing in each position. Whatever mixtureyou use you will want to consider the statistical mixture of amino acidsthat will result and whether this is suitable for your purposes.

Frequency of stop codon appearence

Using a mixture of all four bases will also introducestop codons into your pool, this will in turn lower the proportion of expressedprotein that is able to fuse to its mRNA. By altering with the proportionsof ATGC in the synthesis mixture the frequency of stop codons can be reducedthough this will influence the proportions of all of the amino acids.

Hydrophobicity

When considering what proportions of amino acids are suitablefor a particular purpose, one of the factors considered should be the hydrophobicity.While attached to the mRNA the protein is unlikely to precipitate, butif the ultimate aim is to discover a protein which can act independentlyof it's mRNA, then hydrophobicity should be considered. Counting the proportionof hydrophobic amino acids in your pool and comparing it to averaged naturalproteins is one approach. I took the hydrophobic amino acids to be Phe,Met, Ile, Leu, Val, Cys, Trp, Ala, Thr, Gly, Ser, (11 in total), thosewhich have a positive free energy of transfer from an a-helix in a membraneinterior into water. These amino acids appear 57.9% of the time in an averageof 356 proteins (Nishikawa and Ooi, J. Biochem. 91, 1821-4,(1982)) and I used this as the goal for my pool. More sophisticated approacheswould consider particular classes of proteins related to the target system,and would use a better measure of hydrophobicity such as the average energyit would take to move the hydrophobic amino acids in the pool from a hydrophobicenvironment into water.

Overall charge and number of charges

Another factor to consider is the average charge of thepool and the average number of charges. An approximate average total chargecan be obtained by subtracting the average number of (Asp and Glu) from(Lys and Arg). More sophisticated approaches would consider the intendedpH and other ionizable amino acids such as His. Having some total chargewill encourage dissolution in water, and it would probably be best if thesign of the total average charge on the pool was the opposite of the signof the charge of the intended substrate. The number of charged residues(which obviously can be calculated by adding the numbers of Asp, Glu, Lysand Arg) is important since charged groups are often found in catalyticcenters.

Frequently used codons end in G or C

Statistical studies of sequenced genomes have shown thatcertain codons appear more frequently for the majority of amino acids forwhich more than one codon exists. The more popular codons tend to end inG or C which may relate to the extra efficiency that results from havingthree hydrogen bonds in the wobble position of the tRNA-mRNA complex. Consequentlyyou may wish to design your pool so that G and C are the only nucleotidesin the third position of each codon.

Periodicity

Some mixtures of bases may result in the total omissionof stop codons, VNN for example (V is a mixture of A, G and C) does notcode for any stop codons. Unfortunately such approaches always result inthe loss of some of the 20 amino acids, VNN does not code for Cys, Phe,Trp and Tyr for example. Mixing different such codons together can givea pool which contains all twenty amino acids but no stop codons. This howevernecessarily introduces some periodicity into the pool (assuming that theappearence of the different codons is itself periodic). Periodicity isnot in itself bad, it can result in structure, alternate hydrophobic andhydrophilic amino acids will give beta-sheets while alternate pairs ofhydrophobic and hydrophilic amino acids will give alpha-helices. However,periodicity is something to think about.

Promoter

You will need some kind of transcription promoter at the5' end of the pool. This can be added or changed by PCR. I used the T7promoter sequence TAA TAC GAC TCA CTA TA.

Enhancer

You will need some kind of translation enhancer beforethe initiating methionine codon, I used the truncated 5'-UTR TMV sequenceGGG ACA ATT ACT ATT TAC AAT TAC A.

AUG then G

The open reading frame should start with an initiatingcodon, AUG is the obvious one to use. The second codon often starts witha G in natural proteins so that's what I used, alternatively it is oftenVal in natural proteins which could also be used in pools.

Restriction sites for construction of pool from multiplepieces

If your pool is anything other than very short you willneed to synthesize more than one piece of DNA and join these together withrestriction enzymes. This will result in short sections of fixed aminoacids in the middle of the open reading frame, the fixed amino acids thatthese will result in should be checked, try to get innocuous ones, insertan extra fixed base or two if necessary or change restriction enzyme, althoughonly Ban I, Ban II, Sty I or Ava I and some others should be used sincethey are the only restriction enzymes which act upon non-palindromic (asymmetric)sites.

May need a cysteine in each piece for thiopropyl sepharoseseparation.

May need extra fixed methionines to increase the 35S-methioninesignal.

Frameshift

Insertions and deletions are much more common than mostpeople suppose. The first protein pool that I constructed (from two pieceseach approximately 130 nucleotides long) was cloned and only 1 out of 8clones had no frameshifted region, many had more than one frameshift. Youshould consider the effect of frameshifts upon your pool. Appendix 1 showsvarious statistics relating to my pool, while appendices two and threeshow the effect of frameshifts upon the same statistics. The effects arepronounced. The best way of avoiding the huge loss of diversity that theseframeshifts result in is to initially select only those pieces of the poolthat are not frameshifted before PCR and ligating them together. Loss ofdiversity of each fragment can easily be compensated for by an increasein combinatorial diversity at the ligation stage. Design considerationswhich will allow the direct selection of non-frameshifted sequences includeHis tags, protease sequence, etc. or any property that results from theprotein sequence, frameshifts will obviously destroy such properties. Eachpiece will need to be ligated to an oligo bearing puromycin at it's 3'end and translated to give a protein-nucleic acid fusion. Selected moleculescan then be amplified using RT-PCR and ligated to give a non-frameshiftedwhole pool.

Consider that others may wish to construct other poolsfrom your pieces and/or use your pool for different purposes.

Consider that some sequences are immune or nearly so toframe shift

NNN NNN NNN is immune though fixed parts will still beframeshifted

YXY XYX YXY XYX has a periodicity of two which is changedfrom ABAB to BABA on frameshift (X and Y are any mixture of the four bases,A and B are the codons YXY and XYX).

XXY YXX YYX XYY XXY YXX YYX XYY has a periodicity of fourwhich is changed from ABCDABCD to DABCDABC on frameshift (X and Y are anymixture of the four bases, A, B, C and D are the codons XXY, YXX, YYX andXYY).

In general any sequence (ABCD...)n can be used to generatea frameshift-independent pool, so long as the number of codons within therepeating unit is not exactly divisible by 3.

Flexible linker

It may be wise to include a structureless flexible linkerregion in the amino acids which will be closest to the nucleic acid partof the fusion molecule, these are coded at the 3' end of the open readingframe. This should help to discourage the protein region of the fusionmolecule from associating with the nucleic acid part. (SerGly)n has beenused in similar systems in the past. It is probably a good idea to avoidrepeating sequences in the nucleotides that code for this sequence as thiscould result in misaligned primers.

Appendix 1

Frequency of appearance (%) of amino acids in resultingfrom various "codons", an

"average protein" and my protein pool.

NNN NNS NNY NYN VNN NGG NNC VNS Protein* Pool
 
 
NNNNNSNNY NYNVNN NGGNNC VNSProtein* Pool  
Ala6.25 6.256.25 12.58.333 06.25 8.3338.8 6.250
Arg9.375 9.3756.25 012.5 506.25 12.54.4 14.063
Asn3.125 3.1256.25 04.167 06.25 4.1675.25? 4.688
Asp3.125 3.1256.25 04.167 06.25 4.1675.25? 4.688
Cys3.125 3.1256.25 00 06.25 01.4 3.125
Gln3.125 3.1250 04.167 00 4.1675.3? 1.563
Glu3.125 3.1250 04.167 00 4.1675.3? 1.563
Gly6.25 6.256.25 08.333 256.25 8.3338.1 9.375
His3.125 3.1256.25 04.167 06.25 4.1672.1 4.688
Ile4.688 3.1256.25 9.3756.25 06.25 4.1675.0 4.688
Leu9.375 9.3756.25 18.758.333 06.25 8.3338.1 6.250
Lys3.125 3.1250 04.167 00 4.1676.5 1.563
Met1.563 3.1250 3.1752.083 00 4.1671.9 1.563
Phe3.125 3.1256.25 6.250 06.25 03.8 3.125
Pro6.25 6.256.25 12.58.333 06.25 8.3334.7 6.250
Ser9.375 9.37512.5 12.54.167 012.5 4.1676.8 7.813
Thr6.25 6.256.25 12.58.333 06.25 8.3335.9 6.250
Trp1.563 3.1250 00 250 01.1 3.125
Tyr3.125 3.1256.25 00 06.25 03.3 3.125
Val6.25 6.256.25 12.58.333 06.25 8.3337.0 6.250
UAA1.563 00 00 00 00 0
UAG1.563 3.1250 00 00 00 0
UGA1.563 00 00 00 00 0
           
HPh¥60.761.3 62.587.5 54.250 62.554.2 57.957.8 
¥, % Hydrophobic residues (HPh), hydrophobic residues (shown bold)taken as Phe, Met, Ile, Leu, Val, Cys, Trp, Ala, Thr, Gly, Ser, (11 intotal), those which have a positive free energy of transfer from an a-helixin a membrane interior into water.

S=G, C N=A, U, G, C Y=C, U V=A, G, C

? In the reported data the % of (Glu and Gln) and (Asp and Asn) wereeach combined, here an approximate value of half of the combined valueis given.

* Average of 356 proteins (Nishikawa and Ooi, J. Biochem. 91,1821-4, (1982)).

Pool: 50% NNC, 37.5% VNS, 12.5% NGG.

Net positive charge is 0.0937 per amino acid (Arg + Lys)-(Glu + Asp)in the pool.
 
 
Appendix 2

Frameshift: one insertion.

Pool 1ADK4C

TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACA ATTACA
ATG GNS NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNCVNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNS GCCAAG GNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNSNNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG ATG TGC TCT GGA TCT TCTGGA TCT

VNS becomes CVN21 times which has no STOPs

NNC becomes SNN22 times and GNN 6 times which has no STOPs

NGG becomes CNG7 times which has no STOPs

GCC AAG GNS becomes SGC CAA GGN(gly/arg, gln, gly)

ATG TGC TCT GGA TCT TCT GGA TCT becomes GATGTG CTC TGG ATC TTC TGG ATC (asp, val, leu, trp, ile,phe, trp, ile)
 
 
%37.539.310.712.5 
 CVNSNNGNNCNGEntirely frameshifted pool
Ala 12.525 7.59
Arg33.3312.5 2520.54
Asn    0
Asp 6.2512.5 3.79
Cys    0
Gln16.676.25 2511.83
Glu 6.2512.5 3.79
Gly 12.525 7.59
His16.676.25  8.71
Ile    0
Leu 12.5 258.04
Lys    0
Met    0
Phe    0
Pro33.3312.5 2520.54
Ser    0
Thr    0
Trp    0
Tyr    0
Val 12.525 7.59
UAA    0
UAG    0
UGA    0
Appendix3

Frameshift: one deletion.

Pool 1ADK4C

TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACA ATTACA
ATG GNS NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNCVNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNS GCCAAG GNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNSNNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG ATG TGC TCT GGA TCT TCTGGA TCT

VNS becomes NSN20 times NSG 1 time, NSNhas STOPs 3.125% of the time

NNC becomes NCV21 times and NCN 7 times which have no STOPs

NGG becomes GGN6 times GGA 1 time which has no STOPs

GCC AAG GNS becomes CCA AGG NSN(pro, arg, ala/arg/cys/gly/pro/ser/thr/trp/STOP see below)

NSN has STOPs 3.125% of the time

ATG TGC TCT GGA TCT TCT GGA TCT becomes TGTGCT CTG GAT CTT CTG GAT CT (cys, ala, leu, asp, leu,leu, asp)

 
 
%35.71.7937.512.510.71.79 
 NSNNSGNCVNCNGGNGGAEntirely frameshifted pool
Ala12.512.52525  17.19
Arg18.7525    7.14
Asn      0
Asp      0
Cys6.25     2.23
Gln      0
Glu      0
Gly12.512.5  10010017.18
His      0
Ile      0
Leu      0
Lys      0
Met      0
Phe      0
Pro12.512.52525  17.19
Ser18.7512.52525  19.42
Thr12.512.52525  17.19
Trp3.12512.5    1.34
Tyr      0
Val      0
UAA      0
UAG      0
UGA3.125     1.12
 
 

 
 

Appendix 4

Various facts relating to my pool:

Pool ADK 1A

One fragment from which the pool was constructed

GGG ACA ATT ACT ATT TAC AAT TAC AAT GGNSNN CVN SNN CVN SNN CVN SNN CNG GNN CVN SNN CVN SNN CVN SNN CNG GNN CVNSNN CVN SNN CVN SNN CNG GNN CVN SNN CVN SGC CAA GGT CTG CTC AATGAT

Total = 135 nucleotides long

Primer: [GGG followed by the truncated TMV translationenhancer] methionine for initiation GNS to have G as the nucleotide afterthe initiation codon random region (V=A, G, C; S=C, G; N=A, T, G, C) restrictionenzyme site Primer: (including Met, Cys no STOP).

Primer ADK 1A5'

TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACAATT ACA ATG G

T7 promoter followed by GGG followed by translation enhancer,ATG, G, 46 nucleotides long, Tm 69.4 C, (59.9 C underlined).

Primer ADK 1A3'

ATC ATT GAG CAG ACC TTG GC

Complementary to Cys, Met, restriction enzyme site, noSTOP, 20 nucleotides long, Tm 58.5 C.

Pool ADK 1B

Second fragment from which the pool was constructed.

GAC TGC TCA GAT GTC CAA GGN SNNCVN SNN CNG GNN CVN SNN CVN SNN CVN SNN CNG GNN CVN SNN CVN SNN CVN SNNCNG GNN CVN SNN CVN SNN CVN SNN CNG GAT GTG CTC TGG ATC TTC TGG ATC T

Total = 130 nucleotides long

Primer: [14 random nucleotides restriction enzyme site]random region (V=A, G, C; S=C, G, N=A, T, G, C) Primer: [methionine fordetection cysteine for purification spacer (SerGlySerSerGlySer, not themost frequently used codons to avoid too many Gs)].

Primer ADK 1B5'

GAC TGC TCA GAT GTC CAA GG

Complementary to restriction enzyme site, partially random,20 nucleotides long, Tm 56.4°C.

Primer ADK 1B3'

AGA TCC AGA AGA TCC AGA GCA CAT CC

Complement of constant spacer region, Cys, Met, GG, 26nucleotides long, Tm 66°C,

Underline = Primer sequence (or complementary sequence),Bold = Restriction enzyme site.

Expressed peptide:

Initially there is 1 methionine.

Next is 1 partly-randomized amino acid GNS (one of Val,Ala, Asp, Glu, Gly), to have G as the first nucleotide after the initiationcodon (Microb. Rev. 47, 1-45, (1983)).

First random region is 28 amino acids long.

Central constant region is 2 amino acids long GCC (Ala),AAG (Lys).

Next is 1 partly-randomized amino acid GNS (one of Val,Ala, Asp, Glu, Gly).

Second random region is 28 amino acids long.

1 methionine for detection.

1 cysteine for purification.

Terminal constant region is 6 amino acids long, (SerGlySerSerGlySer),a spacer from the nucleotides.

Expressed peptide is 69 amino acids long in total.

ATG GNS NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNCVNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNS GCCAAG GNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNSNNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG ATG TGC TCT GGA TCT TCTGGA TCT

DNA Pool

TAA TAC GAC TCA CTA TAG GGA CAA TTA CTA TTT ACA ATTACA
ATG GNS NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNCVNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNS GCCAAG GNS NNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG NNC VNS NNC VNSNNC VNS NNC NGG NNC VNS NNC VNS NNC VNS NNC NGG ATG TGC TCT GGA TCT TCTGGA TCT

V=A, G, C; S=C, G, N=A, T, G, C

Appendix 5

International Union of Biologists codes for mixtures ofbases.

M = AC

R = AG

W = AT

S = CG

Y = CT

K = GT

V = AGC

H = ACT

D = AGT

B = CGT

N = ACGT
 
Appendix 6 

A comparison of codon usage among four different species

Zhang, S., Zubay, G. and Goldman, E., Gene 105,61-72 (1991).