Field of the invention
[0001] The present invention relates to a method for producing a polypeptide in a host cell,
wherein the nucleotide sequences encoding the polypeptide have been modified with
respect to their codon-usage, in particular the codon-pairs that are used, to obtain
improved expression of the nucleotide sequence encoding the polypeptide and/or improved
production of the polypeptide.
Background of the art
[0002] The present invention relates to improved methods for producing polypeptides. Numerous
approaches have been applied in generating strains for protein over-expression and/or
production. This includes, but is not limited to, making strains with multi-copies
of the gene encoding the protein of interest (POI) and applying strong promoter sequences.
[0004] More recently, in
WO 03/85114 a harmonization of codon use was described which takes into effect the distribution
of all codons in genes of the host organism, assuming that these effect protein folding.
[0005] The availability of fully sequenced genomes of many organisms in recent years,
e.g. Bacillus subtilis (Kunst et al. 1997),
Bacillus amiloliquefaciens, Aspergillus niger (
Pel et al., 2007, Nat Biotech. 25: 221-231),
Kluyveromyces lactis, Saccharomyces cerevisiae (
http://www.yeastgenome.org/), various plant genomes, mouse, rat and human, has offered the possibility of analyzing
different aspects of the gene sequences themselves in relation to their natural expression
level (mRNA or protein level). A good example is codon usage (bias) analysis, and
subsequent single-codon optimization. Note that single-codon optimization is herein
understood to refer to codon optimization or codon harmonization techniques that focus
on the optimization of codons as single independent entities, in contrast to codon-pair
optimization, which is the topic of the current invention.
[0008] Gutman and Hatfield (1989, Proc. Natl. Acad. Sci USA 86:3699-3703) analyzed a larger set of sequences for all possible codon pairs for
E. coli and found that codon pairs are directionally biased. In addition, they observed that
highly underrepresented pairs are used almost used twice as frequently as overrepresented
ones in highly expressed genes, whereas in poorly expressed genes overrepresented
pairs are used more frequently.
US 5,082,767 (Hatfield and Gutman, 1992) discloses a method for determining relative native codon pairing preferences in
an organism and altering codon pairing of a gene of interest in accordance with said
codon pairing preferences to change the translational kinetics of said gene in a predetermined
manner, with examples for
E. coli and
S. cerevisiae. However, in their method, Hatfield and Gutman only optimize individual pairs of adjacent
codons. Moreover, in their patent (
US 5,082,767), it is claimed to increase translational kinetics of at least a portion of a gene
by a modified sequence in which codon pairing is altered to increase the number of
codon pairs that, in comparison to random codon pair usage, are the more abundant
and yet more under-represented codon pairs in a organism. The present invention discloses
a method to increase translation by a modified sequence in which codon pairing is
altered to increase the number of codon pairs that, in comparison to random codon
pair usage, are the more over-represented codon pairs in an organism.
[0009] Moura et al. (2005, Genome Biology, 6:R28) analyzed the entire
S. cerevisae ORFeome but did not find a statistically significant bias for about 47% of the codon
pairs. The respective values differed from one species to another, resulting in "codon
context maps" that can be regarded as "species-specific fingerprints" of the codon
pair usage.
[0010] Boycheva et al. (2003, Bioinformatics 19(8):987-998) identified two sets of codon pairs in
E. coli referred to as hypothetically attenuating and hypothetically non-attenuating by looking
for over- and under-represented codon pairs among genes with high and poor expression.
However, they do not propose a method to apply this finding, nor gave any experimental
prove for their hypothesis. Note that these groups are defined completely opposite
to the ones defined by Gutman and Hatfield (1989, 1992,
supra), who proposed a non-attenuating effect for highly underrepresented pairs in highly
expressed genes.
[0012] As for the implications of biases in codon pair utilization,
Irwin et al. (1995, J. Biol. Chem. 270:22801-22806) demonstrated in
E. coli that the rate of synthesis actually decreased substantially when replacing a highly
underrepresented codon pair by a highly overrepresented one and increased when exchanging
a slightly underrepresented codon pair for a more highly underrepresented. This is
quite remarkable as it is rather the opposite of what one would expect given the influence
of single codon bias on protein levels.
[0013] However, none of the above-cited art discloses how to optimize the codon-pair usage
of a full-length codon sequence taking account of the fact that by definition codon
pairs overlap and that therefore optimization of each individual codon pair affects
the bias of the overlapping up- and downstream codon pairs. Moreover, none of the
cited art discloses a method that combines optimization of both single codons as well
as codon pairs. Codon pair optimization taking into account said codon pair overlapping
and optional combination of said codon-pair optimization with single-codon optimization
would greatly improve expression of the nucleotide sequence encoding the polypeptide
of interest and/or improve production of said polypeptide.
[0014] There is thus still a need in the art for novel methods for optimization of coding
sequences for improving the production a polypeptide in a host cell.
Summary of the invention
[0015] An object of the present invention is to provide a method for optimizing the coding
sequence for efficient gene transcription and protein translation. To that effect,
the invention provides a method of optimization of a nucleotide sequence encoding
a predetermined amino acid sequence, whereby the coding sequence is optimized for
expression in a predetermined host cell, the method comprising: (a) generating at
least one original coding sequence that codes for the predetermined amino acid sequence;
(b) generating at least one newly generated coding sequence from this at least one
original coding sequence by replacing in this at least one original coding sequence
one or more codons by a synonymous codon; (c) determining a fitness value of said
at least one original coding sequence and a fitness value of said at least one newly
generated coding sequence while using a fitness function that determines at least
one of single codon fitness and codon pair fitness for the predetermined host cell;
(d) choosing one or more selected coding sequence amongst said at least one original
coding sequence and said at least one newly generated coding sequence in accordance
with a predetermined selection criterion such that the higher is said fitness value,
the higher is a chance of being chosen; and (e) repeating actions b) through d) while
treating said one or more selected coding sequence as one or more original coding
sequence in actions b) through d) until a predetermined iteration stop criterion is
fulfilled.
[0016] In embodiments, the invention addresses aspects like single codon usage, codon harmonization,
dinucleotide usage, and related to that codon-pair bias. The method can be performed
by a computer program running on a computer that uses a mathematical algorithm for
sequence analysis and sequence optimization that may be implemented in MATLAB (
http://www.mathworks.com/).
[0017] In addition to positive codon optimization (e.g. for modulation of gene expression
and protein production in a positive way), the invention also provides a method for
adapting codons towards "bad" codon pairs (i.e. negative codon-pair optimization).
The latter method is useful for control purposes as well as for modulating gene expression
in a negative way.
Brief description of the drawings
[0018] It is observed that the present invention will be illustrated with reference to several
figures which are only intended to illustrate the invention and not to limit its scope
which is defined by the annexed claims and its equivalents.
Figure 1 shows a computer arrangement on which the method of the invention can be
performed.
Figure 2 shows a flow chart of an embodiment of the invention.
Figure 3 shows a distribution of codon pair bias values for 3,721 sense:sense codon
pairs in different organisms. The numbers in the top right corner of each histogram
are the standard deviations for the observed distribution; the mean values (not shown)
are between -0.06 and - 0.01 for all organisms.
Figure 4 shows the correlation in codon pair bias of various organisms. The correlation
coefficient is shown in the top right corner of each subplot.
Figure 5 shows a codon bias map for A. niger. The bias values range from -0.67 to 0.54, where in other organisms they might even
get slightly above +-0.9 (see also Figure 3). The highest intensities of black in
these diagrams represent values of 0.9 (Figures 5A and 5C for the positive values,
green in the original) and -0.9 (Figures 5B and 5D for the negative values, red in
the original). In Figures 5A and B the rows and columns are sorted according to the
codons their alphabetical order. In Figures 5C and 5D the rows are sorted according
to the alphabetical order of the third position nucleotide as first sorting criterion
and the middle position nucleotide as second sorting criterion, and first position
nucleotide as third sorting criterion.
Figure 6 shows a codon bias map for B. subtilus. The bias values range from -0.97 to 0.87, where in other organisms they might even
get slightly above +-0.9 (see also Figure 3). The highest intensities of black in
these diagrams represent values of 0.9 (Figure 6A for the positive values, green in
the original) and -0.9 (Figure 6B for the negative values, red in the original).
Figure 7 shows a codon bias map for E. coli. The bias values range from -0.97 to 0.85, where in other organisms they might even
get slightly above +-0.9 (see also Figure 3). The highest intensities of black in
these diagrams represent values of 0.9 (Figure 7A for the positive values, green in
the original) and -0.9 (Figure 7B for the negative values, red in the original).
Figure 8 shows a codon bias map for 479 highly transcribed genes of A. niger, analogous to the previous Figures 5-7. The highest intensities of black in these
diagrams represent values of 0.9 (Figure 8A for the positive values, green in the
original) and -0.9 (Figure 8B for the negative values, red in the original). The maximum
bias value in this group is 0.91 the minimum is -1, i.e. some possible codon pairs
do not occur at all, although their individual codons and the encoded amino acid pair
do. This might be a result of the smaller size of 188,067 codon pairs, compared to
5,885,942 in the full genome. However, the main reason will be the real under representation
of such pairs due to selection in highly expressed genes.
Figure 9 shows a Scatter plot of bias in a group of 479 highly expressed genes (vertical
axis) versus the bias in all genes (horizontal) of A. niger. All 3,721 codon pairs not involving stop codons are shown. Colours from light grey
to black were assigned according to the absolute values of the z-scores in the overall
genome, i.e. light dots in the plot do not have a significant bias in all genes),
as were sizes according to the absolute z-scores in the highly expressed group, i.e.
very small dots do not have a significant bias there (here |z-score|<1.9). The solid
black line indicates where both bias values are equal; the dashed line shows the best
linear approximation of the actual correlation (identified by principal component
analysis); its slope is around 2.1.
Figure 10 Fitness values of the 4,584 A. niger genes compared to the logarithm of their transcription levels. The correlation coefficient
is -0.62.
Figure 11 shows single codon vs. codon pair optimization. The wild type (fitsc(gFUA)=0.165, fitcp(gFUA)=0.033) does not fit on this plot (it would be far to the right and above). It is
clear that the cpi parameter determines a trade-off between single codon and codon pair fitness. The
optimal gene is always the one with the lowest values for fitsc and fitcp. Given the position of the dots, it is therefore not clear for which value of cpi the best gene could be obtained, since we do not know yet whether single codon usage
or codon pair usage is more important. Although, the examples provides strong evidence
that codon pair fitness is very important in addition to single-codon fitness, which
means that cpi should be chosen at least >0.
Figure 12 shows two diagrams that show the sequence quality of the first 20 (out of
499) codons of the aforementioned FUA (see also Example 2). The black dots indicate
the desired codon ratios, whereas the x-marks show the actual ones (in the whole gene),
connected via a dashed line. Single codon fitness can then be interpreted as the average
of the lengths of these dashed lines (note that for codons where desired and actual
ratio are equal, as for example TGG (which has no synonymous codons) on position 4
and 5, this "length" is zero; note also that "length" can never be negative). The
black bars, in turn, show the weights of the pair formed by the two adjacent codons.
The black dots (in the middle, below the bars) indicate the minimum weight of any
codon pair that encodes the same dipeptide. The codon pair fitness is then the average
height of these bars (note that height as used here can well be negative).
Figure 13 depicts the convergence of fitcombi using the described genetic algorithm approach of the invention for optimization
of the amyB gene that results in SEQ ID NO. 6.
Figure 14 depicts, for reasons of explanation, a part of a single-codon distribution
diagram, like one is shown for example in Figure 15. The two graphs indicate the single-codon
usage for the two synonymous codons that code for phenylalanine: UUU (top) and UUC
(bottom). The X- axis and Y- axis of both graphs goes from 0% to 100%. The grey histogram
is a codon-usage histogram, normalized for each amino acid (group of synonymous codons),
for a group of 250 highly expressed A. niger genes, where the genes are binned in groups having 0%, >0 - <10%, 10 - <20%, ...
, 90 - <100%, 100%. For example, 50% of the highly-expressed genes fall in the group
with 0% usage of the UUU codon, and consequently 100% usage of the UUC codon for coding
phenylalanine. The white bar gives the codon-usage of gene A (WT amyB in this case)
in similar bins as for the histogram; thus 100% in bin 20-30% (20% with 3/15 codons
being UUU) for gene A, and consequently 100% in bin 80-<90% (80% with 12/15 being
UUC). The black bar gives the statistics for gene B (the single-codon optimized variant
for amyB in this case). In a similar way, one can create a matrix of 16 times 4 graphs,
showing statistics for all 64 codons, see for example Figure 15.
Figure 15 (parts 1 and 2) depicts the single-codon frequency for the single-codon
optimized amyB gene (black) versus the wild-type amyB gene (white). The grey histogram
depicts the statistics for 250 highly-expressed genes in A. niger. It is clear that certain codons, like the one for cysteine (UGU/UGC), histidine (CAU/CAC),
tyrosine (UAU/UAC) and others were subject to real improvements.
Figure 16 (parts 1 and 2) depicts the single-codon frequency for an amyB gene that
has been optimized with respect to both single-codon and codon-pairs (black) versus
the wild-type amyB gene (white). The grey histogram depicts the statistics for 250
highly-expressed genes in A. niger. It is clear that these graphs highly resemble the situation for the single-codon
optimized gene depicted in Figure 15.
Figure 17 depicts a part of the full diagram (Figure 18) with single-codon and codon
pair statistics for the WT amyB gene of A. niger. On the X-axis, one finds the subsequent codons in a gene starting at position 1 with
the start-codon ATG. The black dot '.' indicates the target single-codon ratio for
the codon at this position with respect to its synonymous codons. For ATG this is
1.0 (100%). The cross 'x' is the actual codon ratio it the shown gene; a dotted line
shows the difference between the target ratio and the actual ratio. The codon-pair
weight is a value between -1 and 1. The bar indicates the actual codon-pair weight
of the adjacent codons, while the pentagram indicates the weight of the optimal achievable
synonymous codon-pair (not taking into account the neighboring pairs). For example
the first bar is -0.23 which is the weight for 'ATG-GTC', second is 0.66 being the
weight for 'GTC -GCG'.
Figure 18 depicts the single codon and codon pair statistics for SEQ ID NO. 2 (WT
AmyB).
Figure 19 depict the single codon and codon pair statistics for SEQ ID NO. 5 (single
codon-optimized AmyB).
Figure 20 depict the single codon and codon pair statistics for SEQ ID NO. 6 (single
codon and codon pair optimized WT AmyB).
Figure 21 depicts a plasmid map of expression vector pGBFINFUA-1. Figure 21 also provides
a representative map for plasmid pGBFINFUA-2 and pGBFINFUA-3. All clones originate
from the pGBFIN-12 (described in WO99/32617) expression vector. Indicated are the glaA flanking regions relative to the variant sequences of the amyB promoter and the A. niger amyB cDNA sequence encoding alpha-amylase. The E. coli DNA can be removed by digestion with restriction enzyme NotI, prior to transformation of the A. niger strains.
Figure 22 depicts a schematic representation of integration through single homologous
recombination. The expression vector comprises the selectable amdS marker, and the
glaA promoter connected to the amyB gene. These features are flanked by homologous regions of the glaA locus (3' glaA and 3" glaA, respectively) to direct integration at the genomic glaA locus.
Figure 23 depicts alpha-amylase activity in culture broth for A. niger strains expressing three different constructs. Depicted is the alpha-amylase activity
in culture broth of A. niger strains expressing a native amyB construct, wherein (1) the translation initiation sequence and the translation termination
sequence were modified (pGBFINFUA-1), and (2) the translation initiation sequence,
the translation termination sequence and the single-codon usage were modified (pGBFINFUA-2),
and (3) the translation initiation sequence, the translation termination sequence
and the single-codon usage and codon-pair usage were modified (pGBFINFUA-3) according
a method of the invention. Alpha-amylase activities are depicted in relative units
[AU], with the average of the 6 one-copy strains of the FUA1 group of 10 strains at
day 4 set at 100%. The ten transformants per group indicated are independently isolated
and cultivated transformants.
Figure 24 (A and B) depicts the single-codon frequency for the single-codon optimization
for Bacillus species. An explanation of the sub-graphs is given by Figure 14. The grey histogram presents
the codon distribution for the 50 highest expressed genes in B. subtilis, see text. The black bars indicate the target single-codon frequency.
Figure 25 depicts the single codon and codon pair statistics for SEQ ID NO. 14 (1/3),
SEQ ID NO. 17 (2/3) and SEQ ID NO. 14 (3/3), the sequenced optimized using codon pair
+ single codon (1/3), single-codon (2/3), and negative codon-pair + single codon optimization
(3/3), respectively. See Figure 17 for an explanation of the graph.
Figure 26. E.coli/Bacillus shuttle vector pBHA-12. The multiple cloning sites (MCS) 1 and 2 are depicted.
Figure 27. An example of cloning of a gene in the E.coli/Bacillus shuttle vector pBHA-12. The Figure shows the cloned part A and B (grey arrows) of
the SEQ ID NO. 9. The cloning sites of the part 1A are depicted: NdeI and BamHI, for the part 1B SmaI and KpnI. The E. coli part was excised using PvuII.
Detailed description of the invention
[0019] In addition to single codon bias, other structures in the nucleotide sequence are
likely to influence protein expression as well, e.g. dinucleotides or repeats of certain
short nucleotide sequences (codon usage after all can be interpreted as a pattern
in tri-nucleotide sequences in line with the reading frame). This work presents a
method for identifying a preference for certain codon pairs, i.e. whether codons appear
in the gene as if they were selected according to the identified codon usage ratios,
but then distributed randomly in the gene (with respect to the amino acid sequence),
or whether some codons appear more often next to certain codons and less often next
to others.

[0020] An analysis of codon pairs also covers other aspects, namely dinucleotide usage around
the reading frame borders and a possible preference for certain single nucleotides
next to a codon. The present invention discloses methods for generating a codon-pair
bias table for a given host organism whereby either all identified ORFs of sequenced
full genomes are used as input or selected groups of genes, e.g. highly expressed
genes. The present invention discloses a method wherein a codon-pair bias table thus
identified is subsequently applied for optimization of codon-pair distribution in
a gene of interest (GOI) for improving the expression of the corresponding protein
of interest (POI).
[0021] Single codon optimization offers a good starting point for improving expression levels
of proteins of interest. Whereas others tried to overcome drawbacks resulting from
the presence of rejected codons in the gene of interest by adaptation of the host
organism, inserting additional copies of tRNA genes for tRNAs with low abundance (
e.g. Stratagene BL-21 CodonPlus
™ competent cells, Novagen Rosetta
™ host strains, both
E. coli), the present inventors have focused on the adaptation of the genes of interest themselves.
Unwanted codons in a genetic sequence have been replaced by synonymous ones so that
the single codon distribution of the resulting sequence was as close as possible to
previously identified desired codon ratios.
[0022] This codon harmonization, however, still has a very large number of possible genes
that are equally "optimal" since the overall codon distribution in an optimized gene
is the selection criterion, so further desired properties of the codon sequence can
easily be taken into account, for example the absence of certain enzyme's restriction
sites or codon pairs known to cause frameshifts. One step further, one could optimize
codon pair usage to a limited extent. But when optimizing codon pairs of a gene, e.g.
towards the usage of the most abundant ones, the single codon usage of the resulting
sequence might not be close to the optimum, since there might be preferred codon pairs
consisting of underrepresented single codons, so a balance between single codon and
codon pair optimization must be found. The present invention discloses methods that
allow balancing both single codon and codon pair optimization. Codon pair optimization
taking into account codon pair overlapping and optional combination of said codon-pair
optimization with single-codon optimization greatly improve expression of the nucleotide
sequence encoding the polypeptide of interest and/or improve production of said polypeptide.
[0023] In the context of this invention, a nucleotide coding sequence or coding sequence
is defined as a nucleotide sequence encoding a polypeptide. The boundaries of the
coding sequence are generally determined by the start codon (usually ATG in eukaryotes,
while it can be one of ATG, CTG, GTG, TTG in prokaryotes) located at the beginning
of the open reading frame at the 5' end of the mRNA and a stop codon (generally one
of TAA, TGA, TAG, although exceptions to this 'universal' coding exists) located just
downstream of the open reading frame at the 3' end of the mRNA. A coding sequence
can include, but is not limited to, DNA, cDNA, RNA, and recombinant nucleic acid (DNA,
cDNA, RNA) sequences (note that it is well known in the art that Uracil, U, replaces
the deoxynucleotide Thymine, T, in RNA). If the coding sequence is intended for expression
in a eukaryotic cell, a polyadenylation signal and transcription termination sequence
will usually be located 3' to the coding sequence. A coding sequence comprises a translational
initiator coding sequence, and optionally a signal sequence, and optionally one or
more intron sequences. Even though the terms "coding sequence" and "gene" strictly
do not refer to the same entity, both term are frequently used interchangeably herein
and the skilled person will understand from the context whether the term refers to
a full gene or only its coding sequence.
Method and computer arrangement for single codon and/or codon pair adaptation
[0024] As for the single codon usage properties of highly expressed genes, a "manual" comparison
of single codon ratios in all genes and a group of highly expressed ones has lead
to some "desired codon ratios" for the improvement of genes with respect to their
expression level.
[0025] Single codon adaptation of a gene can then be performed by: (1) calculating the actual
ratios in the gene, repeatedly picking a codon (e.g. randomly) whose desired ratio
is lower than the actual one and replacing it by a synonymous one with a too low ratio;
or (2) calculating the desired number of each codon using the "desired codon ratios",
making groups of synonymous codons, and repeatedly picking a codon (e.g. randomly)
from a synonymous group coding for the pre-specified amino acid, for each position
in the gene.; making multiple variants using method (1) and/or (2) and based on additional
selection criteria picking the most relevant gene (e.g. wanted and unwanted restriction
sites and/or folding energy).
[0026] Yet this approach is not suitable for codon pair adaptation, firstly because visual
inspection of bias data for all codon pairs is out of the question in view of the
complexity and secondly because altering of one codon pair, which means replacing
at least one of the two participating codons, will also affect at least one of the
adjacent codon pairs, so "desired codon pair ratios" would be unachievable. Because
of the constraints implied by this, a deterministic approach was considered too complex
and not promising enough and a "genetic algorithm" approach was then chosen.
[0027] It is observed that the term "genetic algorithm" may be confusing in the sense that
it seems to relate to genetic engineering. However, a "genetic algorithm" is an approach
from computer science that is used to approximate solutions to multidimensional optimization
problems (
Michalewicz, Z., Genetic Algorithms + Data Structure = Evolution Programs, Springer
Verlag 1994;
David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley, Reading MA, 1989; http://en.wikipedia.org/wiki/Genetic_algorithm). In the present invention, this
approach is used in solving the optimization problem of selecting the "best" possible
gene, i.e. coding sequence for a particular protein of interest. In this approach,
each position in the gene, i.e. each codon can be considered one dimension, with the
set of values being discrete and determined by the available synonymous codons.
[0028] Generally, in a genetic algorithm, at first a set of possible "solutions" to the
problem is often generated randomly, or by variations on initial provided solutions
(although many other methods approaches exist). This set is called "population"; its
elements are "individuals" or "chromosomes", mostly represented by vectors (in the
mathematical sense) containing coordinates for each dimension. Since genetic algorithms
were modeled after processes involved in natural selection, much of the terminology
is borrowed from genetics. However, since they are (unlike in this case) mostly applied
in the field of computer science and to, but also some example to application of genetic
algorithms in biological science problems have been presented, e.g. for protein secondary
structure prediction (
Armano et al. 2005 BMC Bioinformatics 1(6) Suppl. 4:S3);
in silico metabolic network optimization (
Patil et al. 2005 BMC Bioinformatics. 23(6):308); clustering gene expression data (
Di Gesu et al. 2005 BMC Bioinformatics.7(6):289).
[0029] In the present case, a vector contains codons. From that population, new individuals
are created by altering certain positions of an existing individual ("mutation") or
by combining a part (i.e. certain coordinates) of an individual with another part
(i.e. the coordinates for the other dimensions) from another individual ("crossover").
It is then examined how good these individuals are (since the new ones are also possible
solutions to the initial optimization problem) and the better ("fittest") of the individuals
are taken again as initial population for generating new individuals ("next generation";
e.g. , the best 10, 20, 30, 40, 50, 60% are kept, but many other possibilities exist
to selecting a subset for offspring for obtaining a convergence toward fitter individuals,
e.g. roulette wheel selecting, see Michalewicz, Z, 1994). When allowing the best individual
from the initial population to be taken over to the next generation, it is ensured
that with every population the quality of the possible solutions gets better or at
least stays the same. It is then assumed that with a run of this algorithm for many
generations (= iterations; some hundred to several thousand, depending on the complexity
of the problem) one will get a solution close to the optimum. Genetic algorithms have
been investigated closely in computer science, including properties such as optimal
proportion of population size and number of generations, how to prevent the algorithm
from getting stuck in local optima etc., but this should not matter much here now.
For information on how to set these parameters for the actual optimization procedure,
see the description implemented genetic algorithm in MATLAB in Example 2.
[0030] This will be explained in detail with reference to Figure 2. Figure 2 shows a flow
chart of a genetic algorithm for gene optimization. Such a genetic algorithm can be
performed on a suitably programmed computer, an example of which will be shown in
and explained first with reference to Figure 1. Figure 1 shows an overview of a computer
arrangement that can be used to carry out the method according to the invention. The
arrangement comprises a processor 1 for carrying out arithmetic operations.
[0031] Note that genetic algorithms are generally non-deterministic as they involve randomized
steps (e.g. randomized selection criteria and/or randomized operator choice and/or
randomized generation of potential solutions), however, exceptions exist that perform
in a deterministic way. "Genetic algorithms" is a generic tool for those algorithms
that deal with a group (called population) of potential solutions, which is by screening
and/or selection and/or removal, and/or (re)introduction of (newly) generated solutions
driven toward and optimal solution by using one or multiple objectives. Considering
this definition, also methods described as evolutionary programming, evolutionary
algorithms, classic genetic algorithms, real-coded genetic algorithms, simulated annealing,
ant algorithms, and also Monte-Carlo and chemotaxis methods, belong to a similar class
of algorithms, opposite to methods that are based on the convergence of a single potential
solutions toward an optimal solution using a deterministic algorithm, like linear
programming and gradient algorithms. Furthermore, a skilled person will understand
from the context whether another original term refers to the same class of algorithms.
Moreover, although a genetic algorithm is the preferred method, we do not exclude
any other method than genetic algorithms for solving the single-codon and/or codon-pair
optimization problem as described within this invention.
[0032] The processor 1 is connected to a plurality of memory components, including a hard
disk 5, Read Only Memory (ROM) 7, Electrically Erasable Programmable Read Only Memory
(EEPROM) 9, and Random Access Memory (RAM) 11. Not all of these memory types need
necessarily be provided. Moreover, these memory components need not be located physically
close to the processor 1 but may be located remote from the processor 1.
[0033] The processor 1 is also connected to means for inputting instructions, data etc.
by a user, like a keyboard 13, and a mouse 15. Other input means, such as a touch
screen, a track ball and/or a voice converter, known to persons skilled in the art
may be provided too.
[0034] A reading unit 17 connected to the processor 1 is provided. The reading unit 17 is
arranged to read data from and possibly write data on a data carrier like a floppy
disk 19 or a CDROM 21. Other data carriers may be tapes, DVD, memory sticks etc. as
is known to persons skilled in the art.
[0035] The processor 1 is also connected to a printer 23 for printing output data on paper,
as well as to a display 3, for instance, a monitor or LCD (Liquid Crystal Display)
screen, or any other type of display known to persons skilled in the art.
[0036] The processor 1 may be connected to a communication network 27, for instance, the
Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area
Network (WAN), etc. by means of I/O means 25. The processor 1 may be arranged to communicate
with other communication arrangements through the network 27.
[0037] The data carrier 19, 21 may comprise a computer program product in the form of data
and instructions arranged to provide the processor with the capacity to perform a
method in accordance with the invention. However, such computer program product may,
alternatively, be downloaded via the telecommunication network 27.
[0038] The processor 1 may be implemented as stand alone system, or as a plurality of parallel
operating processors each arranged to carry out subtasks of a larger computer program,
or as one or more main processors with several sub-processors. Parts of the functionality
of the invention may even be carried out by remote processors communicating with processor
1 through the network 27.
[0039] Now the genetic algorithm of Figure 2 will be explained, as may be performed on processor
1 when it runs a computer program stored in its memory.
[0040] In action 32 the computer generates one or more genes that code for a predetermined
protein. This can be done by taking data to that effect from a table stored in the
memory of the computer. Such genes may e.g. be:
➢ ATG'GTT'GCA'TGG'TGG'TCT'...
➢ ATG'GTA'GCA'TGG'TGG'TCA'...
➢ ...
[0041] For the purpose of the algorithm, these generated genes are termed "original genes".
[0042] After action 32, the computer program performs one or more iteration loops by performing
actions 34-40 one or more times.
[0043] In action 34, the computer program generates new genes by replacing one or more of
the codons in the original gene(s) by synonymous codons such that the newly generated
gene(s) still code for the predetermined protein (crossover & mutation process). To
be able to do so, the memory of the computer stores a codon usage table which shows
which codons code for which amino acids. (Note that deviations from the "universal
code" exist and are taken into account if this is the case for the specified host
organisms, see for example
Laplaza et al., 2006, Enzyme and Microbial Technology, 38:741-747). Knowing the sequence of amino acids in the protein, the computer program can select
alternative codons from the table as are well known in the art. Using the example
of action 32, the newly generated genes may be (indicated in bold):
o ATG'GTT'GCA'TGG'TGG'TCT'...
o ATG'GTA'GCA'TGG'TGG'TCA'...
➢ ATG'GTT'GCA'TGG'TGG'TCA'...
o ATG'GTA'GCA'TGG'TGG'TCA'...
➢ ATG'GTA'GCC'TGG'TGG'TCA'...
[0044] In action 36, a quality value of all genes including the original and the newly generated
genes is determined by the computer program using a fitness function which determines
at least one of codon fitness and codon pair fitness. Examples of such fitness functions
will be explained in detail below in the section "Performing codon pair optimization".
[0045] In action 38, a number of genes showing a best fitness based on the fitness function
are selected for taking part in the "breeding process" (crossover and mutation), and
a number of genes showing worst fitness based on the fitness function are selected
for removal from the population. These numbers may be predetermined numbers or depend
on a predetermined amount of improvement of fitness. The selection of those genes
might be deterministic, but generally a stochastic process is followed where the "fittest
genes" having a higher change for being selected for breeding, and the opposite for
deletion from the population. This method is called roulette-wheel selection.
[0046] The resulting selected genes for breeding may e.g. be (non-selected genes are shown
with a deletion line):
■ ATG'GTT'GCA'TGG'TGG'TCT'...
■ ATG'GTT'GCA'TGG'TGG'TCA'...
[0047] In action 40, the computer program tests whether one or multiple termination criteria
are fulfilled. Often one of the termination criteria is a predetermined maximum number
of iterations. Alternative criteria are checking if the fitness obtained by the selected
genes is improved with at least a minimum threshold value relative to the fitness
of the original genes, or checking if the fitness obtained by the selected genes is
improved with at least a minimum threshold value relative to the fitness of the gene
with had best fitness n iterations ago (preferably n is a value in <10,100> is chosen).
If the overall termination coterie is not fulfilled the computer program jumps back
to action 34 while treating the selected genes as "original genes".
[0048] If, in action 40, the computer program establishes that the improvement is below
the minimum threshold value further iteration of the actions 34-38 does not make much
sense and the computer program continues with action 42.
[0049] It is to be understood that any other suitable iteration stop criterion, like the
number of iterations performed, can be used in action 40 to leave the iteration actions
34-40 and continue with action 42.
[0050] In action 42, the gene with the best fitness amongst all selected genes is selected
and presented to the user, e.g. via the monitor or via a printout by means of printer.
[0051] In the case of gene adaptation using a genetic algorithm, it has to be assured that
the crossover is always performed at a reading frame position, because otherwise the
resulting amino acid sequence might be changed when combining one nucleotide of one
and two nucleotides of another codon. For better convergence, a modified mutation
operator is proposed that for this mutation operator only those synonymous codon replacements
have been allowed that result in at least one of better single codon or better codon
pair usage.
[0052] So an important question for codon pair optimization now is how to measure the quality
of the individuals. This so-called fitness function can be regarded the central part
of the genetic algorithm, since it is the actual function to be optimized. In the
present invention, a preferred approach is to assign a real number (called weight)
to each codon pair and take the average of the weights in a gene as its "fitness",
thus resulting in a function to be minimized.
[0053] It the current description, the inventors describe the process of gene optimization
as a minimization problem. This is a rather arbitrarily approach. Note that, if a
function
f were to be maximized, one could as well look for the minimum of
-f, so this is no restriction to generality.
[0054] Hence, a method for determining codon pair weights has to be identified, where codon
pairs considered good for expression level have a low weight and pairs considered
bad a high one.
Identification of codon pair weights for gene adaptation
[0055] For identification of codon pair weights that relate to a higher transcription/expression
level, and which may serve as input for adaptation of codon pair usage, the following
methods may be applied, which are herein exemplified by
A. niger, for which a transcription levels for most of the expressed genes are known, and for
B. subtilis, for which data on transcription levels was available and also a set of 300 highly
expressed genes.
[0056] In
A. niger, where a complete ranking extracted from GeneChip data was available for the aforementioned
set of 4,584 actually expressed genes (see Example 1), the mean codon pair weights
of each gene (i.e. the equivalent of the
fitcp(
g) values) were calculated. Then the genes were sorted according to fitness values
(ascending order) and expression level (descending order). Since highly expressed
genes are supposed to have low codon pair fitness values, these two rankings would
be equal when using ideal codon pair weights, so a comparison of these two rankings
can give information about the quality of the weights used in the fitness function
(where slightly more attention was given to the "correct" ranking of the highly expressed
genes than to the ranking of the mediocre ones). Additionally, the correlation coefficient
(covariance divided by the standard deviation of each variable) between ranking and
average codon pair weights of the 4,584 genes was calculated.
[0057] Several possible sets of weights may be used in the methods of the invention, including
on or more selected from the group consisting of: (i) bias values from the whole genome;
(ii) bias values from a group of highly expressed genes; (iii) bias with all the values
that do not have a certain minimum z-score set to zero (whereby the z-score is determined
as described in Example 1.1.4); (iv) bias values raised to the power of 2 or 3, 4,
5 or higher (to give highly preferred or rejected codons a lower/higher influence);
(v) z-scores themselves; (vi) difference of bias values/z-scores from the highly expressed
group and the full genome; and, (vii) combinations of one or more of (i) - (vi).
[0058] For the genetic algorithm, their negations have been used, since preferred codon
pairs had been arbitrarily identified with positive values, whereas the genetic algorithm
performs minimization. This applies to all the above-mentioned weights.
[0059] A more preferred weight matrix may be obtained - as described above - by calculating
the codon pair "bias" in a highly expressed group using expected values calculated
based on the codon ratios of the whole genome. Let

still denote the single codon ratio of
ck in the whole genome data set and

the occurrences of a pair (
ci,
cj) in the highly expressed group, then the calculation of the "combined expected values"

corresponds to

and thus

[0060] Where
w((
ci,
cj)) is defined as a weight of a codon pair (
ci,
cj) in a sequence g of codons. Note that since the optimization function will look for
a minimum average weight, the two terms of the numerator have been reversed compared
to the equation for the bias values, but this does not affect the correlation with
the expression levels other than that it changes the sign.
[0061] Unlike all other weight sets tested, codon pairs involving codons that are more underrepresented
in the highly expressed group get a slight disadvantage here. Thus, these weights
are the only ones that also reflect the different single codon bias of the highly
expressed group and all genes. Using these weights carries the risk of rejecting some
codon pairs that actually have a positive bias in the highly expressed group, but
consist of (in the highly expressed group) rarely used codons. However, since our
desired single codon ratios are usually not identical to those in the group of genes
with high expression, but more "extreme" than these, single codon optimization would
replace these underrepresented anyway, so we can consider the weights described above
very convenient for codon pair optimization. Thus, although the codon pair weights
also reflect single codon bias to a limited extend, for the optimization, single codon
usage is regarded as a separate, additional issue.
Optimization of single codons and codon pairs using a genetic algorithm
[0062] In the method of the invention preferably a computer arrangement programmed to perform
a genetic algorithm as described herein above is used to perform codon pair adaptation
or combined single codon and codon pair adaptation have been performed. Applying a
genetic algorithm for single codon adaptation is also possible and not excluded from
the invention, but here undesired codons can be replaced by synonymous codons without
constraints with respect to neighboring codons and therefore using a genetic algorithm
is not really necessary.
[0063] As for codon pairs, changing a single codon will usually alter the weight of two
codon pairs, and therefore codon pair optimization is heavily constrained because
a single codon change replacing an unwanted codon pair will always change another
codon pair, and this is not necessarily a change for the better, and correcting a
change for the worse in an adjacent codon pair will then again alter another pair,
and so on.
[0064] For the mutation operator, only those alterations of the codon sequence have been
allowed that did not change the encoded peptide sequence and that improved at least
one of single codon fitness and codon pair fitness, i.e. before changing a codon the
mutation operator looks for synonymous codons that are either underrepresented (according
to the desired single codon ratios) or one where the two codon pairs it is involved
in have better weights. It is selected randomly which one of the two types of mutation
is performed. Performing the former "mutation" operator on every single codon is sufficient
for creating a single-codon-optimized gene without any use of the genetic algorithm.
[0065] The quality of a gene is determined considering two aspects, namely single codon
"fitness" and codon pair "fitness". The latter is simply the average of the weights
w((
c(
k),
c(
k+
1)) of all codon pairs in a sequence g of codons (or gene). I.e., when g again symbolizes
the sequence of codons, |
g| its length (in codons) and c(k) its k-th codon:

[0066] Single codon fitness is defined to be the difference of the actual codon ratios in
the gene and the target codon ratios, normalized for the number of occurrences of
every codon. Single codon ratios are defined and may be determined as described in
Example 1.1.2 herein. Let

be the desired ratio (or frequency) of codon
ck and

as before the actual ratio in the gene
g, then the single codon fitness is defined as

[0067] Thus,
fitsc can reach values in [0,1] with the optimal sequence being close to 0, whereas
fitcp is limited by the weights, which here are also in [-1,1].
[0068] To optimize for both aspects, in an embodiment, a combined fitness function has been
introduced:

[0069] Here,
cpi, which stands for "codon pair importance", is a real value greater than zero and determines
which of the two fitness functions has more influence on the combined fitness. With
cpi close to zero, the denominator approaches zero when
fitsc(
g) gets better (i.e. also close to zero) and thus small changes in
fitsc(g) influence
fitcombi(
g) more than small changes in
fitcp(
g), whereas with a high
cpi slight improvements in
fitcp(
g) may have a larger effect on
fitcombi(g) than medium improvements in
fitsc(g). Note that
fitcombi values that are obtained using different values of
cpi are not comparable (
cpi close to 0 might result in
fitcombi values close to -100, whereas is
fitcombi usually between 0 and -1 for
cpi > 0.2).
[0070] In an embodiment, a "penalty" is added if
g contains certain unwanted sequences, e.g. restriction sites or sequences resulting
in undesired secondary structures in mRNA. This may be useful when constructing synthetic
genes, but in itself is unrelated to optimization of single codon and codon pair usage.
A modified fitness function becomes:

where
P(
g) denotes a penalty function that creates a positive weight in case an unwanted sequence
structure is part of gene
g.
[0071] It is to be understood that in the embodiments of the invention herein the nucleotide
and amino acid sequences may be theoretical sequences that exist only on e.g. paper
or another preferably computer readable data carrier, or they may exist as a tangibly,
physically created embodiment.
[0072] In a first aspect the invention therefore relates to a method of optimization of
a nucleotide coding sequence that codes for a predetermined amino acid sequence, whereby
the coding sequence is optimized for expression in a predetermined host cell. The
method preferably comprises the steps of: (a) generating at least one original coding
sequence that codes for the predetermined amino acid sequence; (b) generating at least
one newly generated coding sequence from this at least one original coding sequence
by replacing in this at least one original coding sequence one or more codons by a
synonymous codon; (c) determining a fitness value of said at least one original coding
sequence and a fitness value of said at least one newly generated coding sequence
while using a fitness function that determines at least one of single codon fitness
and codon pair fitness for the predetermined host cell; (d) choosing one or more selected
coding sequence amongst said at least one original gene and said at least one newly
generated coding sequence in accordance with a predetermined selection criterion such
that the higher is said fitness value, the higher is a chance of being chosen; and,
(e) repeating actions b) through d) while treating said one or more selected coding
sequence as one or more original coding sequence in actions b) through d) until a
predetermined iteration stop criterion is fulfilled.
[0073] According to an embodiment of the invention, the method preferably comprises the
steps of: (a) generating at least one original coding sequence that codes for the
predetermined amino acid sequence; (b) generating at least one newly generated coding
sequence from this at least one original coding sequence by replacing in this at least
one original coding sequence one or more codons by a synonymous codon; (c) determining
a fitness value of said at least one original coding sequence and a fitness value
of said at least one newly generated coding sequence while using a fitness function
that determines codon pair fitness for the predetermined host cell; (d) choosing one
or more selected coding sequence amongst said at least one original gene and said
at least one newly generated coding sequence in accordance with a predetermined selection
criterion such that the higher is said fitness value, the higher is a chance of being
chosen; and, (e) repeating actions b) through d) while treating said one or more selected
coding sequence as one or more original coding sequence in actions b) through d) until
a predetermined iteration stop criterion is fulfilled.
[0074] According to another embodiment of the invention, the method preferably comprises
the steps of: (a) generating at least one original coding sequence that codes for
the predetermined amino acid sequence; (b) generating at least one newly generated
coding sequence from this at least one original coding sequence by replacing in this
at least one original coding sequence one or more codons by a synonymous codon; (c)
determining a fitness value of said at least one original coding sequence and a fitness
value of said at least one newly generated coding sequence while using a fitness function
that comprises determining single codon fitness and codon pair fitness for the predetermined
host cell; (d) choosing one or more selected coding sequence amongst said at least
one original gene and said at least one newly generated coding sequence in accordance
with a predetermined selection criterion such that the higher is said fitness value,
the higher is a chance of being chosen; and, (e) repeating actions b) through d) while
treating said one or more selected coding sequence as one or more original coding
sequence in actions b) through d) until a predetermined iteration stop criterion is
fulfilled.
[0075] In the methods preferably the predetermined selection criterion is such that said
one or more selected coding sequence have a best fitness value according to a predetermined
criterion. The methods according to the invention, may further comprises, after action
e): selecting a best individual coding sequence amongst said one or more selected
coding sequences where said best individual coding sequence has a better fitness value
than other selected coding sequences.
[0076] In the methods of the invention, the said predetermined iteration stop criterion
preferably is at least one of: (a) testing whether at least one of said selected coding
sequences have a best fitness value above a predetermined threshold value; (b) testing
whether none of said selected coding sequences has a best fitness value below said
predetermined threshold value; (c) testing whether at least one of said selected coding
sequences has at least 30% of the codon pairs with associated positive codon pair
weights for the predetermined host cell in said original coding sequence being transformed
into codon pairs with associated negative weights; and, (d) testing whether at least
one of said selected coding sequences has at least 10, 20, 30, 40, 50, 60, 70, 80
or 90% of the codon pairs with associated positive weights above 0 for the predetermined
host cell in said original coding sequence being transformed into codon pairs with
associated weights below 0.
[0077] In the methods of the invention the fitness function preferably defines single codon
fitness by means of:

where g symbolizes a coding sequence, |
g| its length,
g(
k) its k-th codon,

is a desired ratio of codon c(k) (APPENDIX 2; CR vectors) and

an actual ratio in the nucleotide coding sequence
g.
[0078] In the methods of the invention the fitness function preferably defines codon pair
fitness by means of:

where
w((
c(
k),
c(
k + 1)) is a weight of a codon pair in a coding sequence
g, |
g| is length of said nucleotide coding sequence and
c(
k) is k-th codon in said coding sequence.
[0079] More preferably, in the methods of the invention the fitness function is defined
by means of:

where
cpi is a real value greater or equal zero,
fitcp(
g) is codon pair fitness function,
fitsc(
g) is a single codon fitness function,
w((
c(
k),
c(
k + 1)) is a weight of a codon pair in a coding sequence
g (APPENDIX 3; CPW matrix), |
g| is length of said coding sequence, c(
k) is
k-th codon in said sequence of codons,

is a desired ratio of codon
c(
k) and

an actual ratio in the coding sequence
g. Preferably
cpi is between 0 and 10, more preferably between 0 and 0.5 and most preferably about
0.2.
[0080] In the methods of the invention, the codon pair weights w (APPENDIX 3) may be taken
from a 64x64 codon pair matrix including stop codons. Note that the weights for stop:sense
pairs and stop:stop pairs are always zero. The codon pair weights w are preferably
calculated on the basis of a computer-based method, using as input at least one of:
(a) a genome sequence of the predetermined host cell for which at least 5, 10, 20
or 80% of the protein encoding nucleotide sequences are sequenced; (b) a genome sequence
of a related species to the predetermined host cell for which at least 5, 10, 20 or
80% of the protein encoding nucleotide sequences are sequenced; (c) a group of nucleotide
sequences consisting of at least 200 coding sequences of the predetermined host cell;
and, (d) a group of nucleotide sequences consisting of at least 200 coding sequences
of a species related to the predetermined host cell. A related species is herein understood
to refer to a species of which the nucleotide sequence of the small subunit ribosomal
RNA has at least 60, 70, 80, or 90% identity with the nucleotide sequence of the small
subunit ribosomal RNA of the predetermined host cell (
Wuyts et al., 2004, Nucleic Acids Res. 32: D101-D103).
[0081] The codon pair weights w need not be determined for all of the possible 61 x64 codon
pairs including the termination signal as stop codon but may be determined for only
a fraction thereof, e.g. for at least 5%, 10%, 20%, 50%, and preferably 100% of the
possible 61x64 codon pairs including the termination signal as stop codon.
Selection highly expressed genes
[0082] For calculation of the codon pair weight matrices and the single codon target ratio
vectors one can apply a set of nucleotide sequences from the specified host cell itself,
a set of nucleotide sequences from a related species, or a combination of both. The
set A of nucleotide sequences is called the 'reference set all'. Most preferably this
set contains the full set of open reading frames (ORFs) for an organism that is completely
sequenced (>95%).
[0083] In a preferred embodiment of the invention, a subset B is selected that contains
a subset that is overrepresented with highly expressed genes or genes coding for highly
expressed proteins. This set can be determined using measurements, and subsequent
ranking, like a mRNA hybridization using array technology, e.g. arrays from Affymetrix,
Nimblegen, Agilent or any other source for the reference set A. Other measurements
can be RT-PCR, protein gels, MS-MS analysis, or any other measurement technique known
by the person skilled in the art. Besides making a ranking on the basis of measurements,
one can also apply bioinformatics tools to either predict directly a group of highly
expressed genes, for example by selecting the most biased genes (Carbone et al, 2003),
or by selecting genes known to be highly expressed in a wide range of organisms. Among
these are, ribosomal proteins, glycolytic and TCA cycle genes involved in primary
metabolism, genes involved in transcription and translation.
[0084] Preferably, the codon pair weights w are calculated on the basis of a computer-based
method, using as input the group of highly expressed genes in the predetermined host
cell. Highly expressed genes are herein understood to mean genes whose mRNA's can
be detected at a level of at least 10, preferably 20, more preferably 50, more preferably
100, more preferably 500 and most preferably at least 1,000 copies per cell. For example,
Gygi
et al. measured ∼15,000 mRNA molecules per yeast cell. The abundance of specific mRNAs was
determined to be in the range of 0.1-470 per cell (
Gygi, S.P., Y. Rochon, B.R. Franza and R. Aebersold (1999). Correlation between protein
and mRNA abundance in yeast. Mol. Cel. Biol. 19(3):1720-30) or a factor 10 lower: 0.01-50 per cell (by
Akashi, H. (2003). Translational selection and yeast proteome evolution. Genetics
164(4): 1291-1303.).
[0085] Alternatively, the group of highly expressed genes in the predetermined host cell
may be the group comprising the 1000, 500, 400, 300, or 200 or 100 most abundant mRNA's
or proteins. The skilled person will recognize that for calculation of single-codon
ratio's the group-size of highly expressed genes might be small, since at maximum
only 64 target values are being specified. Here a reference set with high-expressed
genes might be as low as 1 gene, but generally one considers 1% of the genome size
a representive set of the highly expressed genes, see for example
Carbone, A. et al. (2003) (Codon adaptation index as a measure of dominating codon
bias. Bioinformatics. 19(16):2005-15). For the calculation of a codon-pair weight matrix, usually a set of 200-500 reference
genes fulfils, which corresponds with 2-7% of a bacterial genome (3000-15000 genes).
[0086] Another possibility is to derive a subset of presumably highly expressed genes from
literature. For example, for
Bacillus subtilis, being a model organism, quite some literature on single-codon bias exists. A good
overview on the state-of-the-art for
B. subtilis is given by the work of Kanaya
et al. (1999). In our approach, see example 4, we group the data in a subset of highly-expressed
groups on the basis of mRNA levels measured by Affymetrix technology, and compare
these sequences with the whole set of genome ORFs. Other options that have been used
in literature are protein expression data, and functional categorical groups of (expected)
genes like ribosomal proteins, proteins involved in translation and transcription,
sporulation, energy metabolism, and the flagellar system (Kanaya
et al., 1999; Karlin and Mrazek, 2000).
[0087] Indeed one often finds, for example, high codon bias in the ribosomal proteins, as
well as in the other named groups. However, generally not all genes in the latter
groups show such behavior. Also, we do not know how ribosomal proteins react in low-growth
production conditions. Therefore, a straightforward measurement technique to deriving
a subset of highly expressed genes seems to be logic. Then we can choose transcriptomics
(TX) and/or proteomics (PX) data. For both there are pros and cons. TX gives a rather
complete picture for mRNA levels of genes in the full genome, while PX data might
be biased by overrepresentation of water-soluble proteins. TX data is a direct measure
for the available mRNA that is subject to translation, while protein is part of an
accumulation process in which turnover also plays an important role. Anyway, TX and
PX data are shown to correlate for the highly-expressed genes (Gygi et al, 1999).
Another interesting work is the prediction of highly-expressed (PHX) genes by deviation
from the average codon usage and similarity to ribosomal proteins, and those involved
in translation and transcription processing factors, and to chaperone degradation
proteins (Karlin and Mrazek, 2000). In particular for fast growing organisms, like
Bacillus,
E. coli, etc., major glycolytic genes and tricarboxylic acid cycle genes are found to belong
to the above group. The method prediction compares well with known highly-expressed
genes at mRNA data and protein expression.
[0088] The skilled person will appreciate that both the single codon weights and codon-pair
weights w may be determined for modified host cells that have been modified with respect
to the content and nature of their tRNA encoding genes, i.e. host cells comprising
additional copies of existing tRNA genes, new (exogenous) tRNA genes, including non-natural
tRNA genes, including genes encoding tuna's that have been modified to include non-natural
amino-acids or other chemical compounds, as well as host cells in which one or more
tRNA genes have been inactivated or deleted.
[0089] In the method of the invention, the original coding nucleotide sequence that codes
for predetermined amino acid sequence may be selected from: (a) a wild-type nucleotide
sequence that codes for the predetermined amino acid sequence; (b) a reverse translation
of the predetermined amino acid sequence whereby a codon for an amino acid position
in the predetermined amino acid sequence is randomly chosen from the synonymous codons
coding for the amino acid; and, (c) a reverse translation of the predetermined amino
acid sequence whereby a codon for an amino acid position in the predetermined amino
acid sequence is chosen in accordance with a single-codon bias for the predetermined
host cell or a species related to the host cell.
Host cells
[0090] In the methods of the invention the predetermined host may be any host cell or organism
that is suitable for the production of a polypeptide of interest by means of expression
of an optimized nucleotide coding sequence. The host cell may thus be a prokaryotic
or a eukaryotic host cell. The host cell may be a host cell that is suitable for culture
in liquid or on solid media. Alternatively, the host cell may be a cell that is part
of a multicellular tissue or and multicellular organism such as a (transgenic) plant,
animal or human.
[0091] The host cells may be microbial or non-microbial. Suitable non-microbial host cells
include e.g. mammalian host cells such as Hamster cells: CHO (Chinese hamster ovary),
BHK (Baby Hamster Kidney) cells, mouse cells (e.g. NS0), monkey cells such as COS
or Vero; human cells such as PER.C6
™ or HEK-293 cells; or insect cells such as Drosophila S2 and Spodoptera Sf9 or Sf21
cells; or plant cells such as tobacco, tomato, potato, oilseed rape, cabbage, pea,
wheat, corn, rice,
Taxus species such as
Taxus brevifolia,
Arabidopsis species such as
Arabidopsis thaliana, and
Nicotiana species such as
Nicotiana tabacum. Such non-microbial cells are particularly suitable for the production of mammalian
or human proteins for use in mammalian or human therapy.
[0092] The host cell may also be microbial host cells such as bacterial or fungal cells.
Suitable bacterial host cells include both Gram-positive and Gram-negative bacteria.
Examples of suitable bacterial host cells include bacteria from the genera
Bacillus,
Actinomycetis,
Escherichia,
Streptomyces as well as lactic acid bacteria such as
Lactobacillus,
Streptococcus,
Lactococcus, Oenococcus, Leuconostoc, Pediococcus, Carnobacterium, Propionibacterium,
Enterococcus and
Bifidobacterium. Particularly preferred are
Bacillus subtilis,
Bacillus amyloliquefaciens, Bacillus licheniformis,
Escherichia coli, Streptomyces coelicolor, Streptomyces clavuligerus, and
Lactobacillus plantarum,
Lactococcus lactis.
[0093] Alternatively, the host cell may be a eukaryotic microorganism such as a yeast or
a filamentous fungus. Preferred yeasts as host cells belong to the genera
Saccharomyces,
Kluyveromyces, Candida,
Pichia, Schizosaccharomyces, Hansenula, Kloeckera, Schwanniomyces, and
Yarrowia. Particularly preferred
Debaromyces host cells include
Saccharomyces cerevisiae, and
Kluyveromyces lactis.
[0094] According to a more preferred embodiment, the host cell of the present invention
is a cell of a filamentous fungus. "Filamentous fungi" include all filamentous forms
of the subdivision Eumycota and Oomycota (as defined by Hawksworth et al., 1995, supra).
The filamentous fungi are characterized by a mycelia wall composed of chitin, cellulose,
glucan, chitosan, mannan, and other complex polysaccharides. Vegetative growth is
by hyphal elongation and carbon catabolism is obligatory aerobic. Filamentous fungal
genera of which strains may be used as host cells in the present invention include,
but are not limited to, strains of the genera
Acremonium, Aspergillus, Aureobasidium, Cryptococcus, Filibasidium, Fusarium, Humicola,
Magnaporthe, Mucor, Myceliophthora, Neocallimastix,
Neurospora, Paecilomyces, Penicillium, Piromyces, Schizophyllum, Chrysosporium, Talaromyces,
Thermoascus, Thielavia, Tolypocladium, and
Trichoderma. Preferably a filamentous fungus belonging to a species selected from the group consisting
of
Aspergillus niger, Aspergillus oryzae, Aspergillus sojae, Trichoderma reesei or
Penicillium chrysogenum. Example of suitable host strains include:
Aspergillus niger CBS 513.88 (
Pel et al., 2007, Nat Biotech. 25: 221-231),
Aspergillus oryzae ATCC 20423, IFO 4177, ATCC 1011, ATCC 9576, ATCC14488-14491, ATCC 11601, ATCC12892,
P. chrysogenum CBS 455.95,
Penicillium citrinum ATCC 38065,
Penicillium chrysogenum P2,
Acremonium chrysogenum ATCC 36225 or ATCC 48272,
Trichoderma reesei ATCC 26921 or ATCC 56765 or ATCC 26921,
Aspergillus sojae ATCC11906,
Chrysosporium lucknowense ATCC44006 and derivatives thereof.
[0095] The host cell may be a wild type filamentous fungus host cell or a variant, a mutant
or a genetically modified filamentous fungus host cell. Such modified filamentous
fungal host cells include e.g. host cells with reduced protease levels, such as the
protease deficient strains as
Aspergillus oryzae JaL 125 (described in
WO 97/35956 or
EP 429 490); the tripeptidyl-aminopeptidases-deficient
A. niger strain as disclosed in
WO 96/14404, or host cells with reduced production of the protease transcriptional activator
(prtT; as described in
WO 01/68864,
US2004/0191864A1 and
WO 2006/040312); host strains like the
Aspergillus oryzae BECh2, wherein three TAKA amylase genes, two protease genes, as well as the ability
to form the metabolites cyclopiazonic acid and kojic acid have been inactivated (BECh2
is described in
WO 00/39322); filamentous fungal host cells comprising an elevated unfolded protein response
(UPR) compared to the wild type cell to enhance production abilities of a polypeptide
of interest (described in
US2004/0186070A1,
US2001/0034045A1,
WO01/72783A2 and
WO2005/123763); host cells with an oxalate deficient phenotype (described in
WO2004/070022A2 and
WO2000/50576); host cells with a reduced expression of an abundant endogenous polypeptide such
as a glucoamylase, neutral alpha-amylase A, neutral alpha-amylase B, alpha-1, 6-transglucosidase,
proteases, cellobiohydrolase and/or oxalic acid hydrolase (as may be obtained by genetic
modification according to the techniques described in
US2004/0191864A1); host cells with an increased efficiency of homologous recombination (having deficient
hdfA or
hdfB gene as described in
WO2005/095624); and host cells having any possible combination of these modifications.
[0096] In a method of the invention, the predetermined amino acid sequence may be an amino
acid sequence (of a polypeptide of interest) that is heterologous to said predetermined
host cell, or it may be an amino acid sequence (of a polypeptide of interest) that
is homologous to said predetermined host cell.
[0097] The term "heterologous" when used with respect to a nucleic acid (DNA or RNA) or
protein refers to a nucleic acid or protein that does not occur naturally as part
of the organism, cell, genome or DNA or RNA sequence in which it is present, or that
is found in a cell or location or locations in the genome or DNA or RNA sequence that
differ from that in which it is found in nature. Heterologous nucleic acids or proteins
are not endogenous to the cell into which it is introduced, but has been obtained
from another cell or synthetically or recombinantly produced. Generally, though not
necessarily, such nucleic acids encode proteins that are not normally produced by
the cell in which the nucleic acid is expressed. Any nucleic acid or protein that
one of skill in the art would recognize as heterologous or foreign to the cell in
which it is expressed is herein encompassed by the term heterologous nucleic acid
or protein. The term heterologous also applies to non-natural combinations of nucleic
acid or amino acid sequences, i.e. combinations where at least two of the combined
sequences are foreign with respect to each other.
[0098] The term "homologous" when used to indicate the relation between a given (recombinant)
nucleic acid or polypeptide molecule and a given host organism or host cell, is understood
to mean that in nature the nucleic acid or polypeptide molecule is produced by a host
cell or organisms of the same species, preferably of the same variety or strain.
[0099] The predetermined amino acid sequence may be the sequence of any polypeptide of interest
having a commercial or industrial applicability or utility. Thus, the polypeptide
of interest may be an antibody or a portion thereof, an antigen, a clotting factor,
an enzyme, a hormone or a hormone variant, a receptor or portions thereof, a regulatory
protein, a structural protein, a reporter, or a transport protein, intracellular protein,
protein involved in secretion process, protein involved in folding process, chaperone,
peptide amino acid transporter, glycosylation factor, transcription factor. Preferably,
the polypeptide of interest is secreted into the extracellular environment of the
host cell by the classical secretion pathway, by a non-classical secretion pathway
or by an alternative secretion pathway (described in
WO 2006/040340). In case the polypeptide of interest is an enzyme it may e.g. be an oxidoreductase,
transferase, hydrolase, lyase, isomerase, ligase, catalase, cellulase, chitinase,
cutinase, deoxyribonuclease, dextranase, esterase. More preferred enzymes include
e.g. carbohydrases, e.g. cellulases such as endoglucanases, β-glucanases, cellobiohydrolases
or β-glucosidases, hemicellulases or pectinolytic enzymes such as xylanases, xylosidases,
mannanases, galactanases, galactosidases, pectin methyl esterases, pectin lyases,
pectate lyases, endopolygalacturonases, exopolygalacturonases rhamnogalacturonases,
arabanases, arabinofuranosidases, arabinoxylan hydrolases, galacturonases, lyases,
or amylolytic enzymes; hydrolase, isomerase, or ligase, phosphatases such as phytases,
esterases such as lipases, proteolytic enzymes, oxidoreductases such as oxidases,
transferases, or isomerases, phytases, aminopeptidases, carboxypeptidases, endo-proteases,
metallo-proteases, serine-proteases, catalases, chitinases, cutinases, cyclodextrin
glycosyltransferases, deoxyribonucleases, alpha-galactosidases, beta-galactosidases,
glucoamylases, alpha-glucosidases, beta-glucosidases, haloperoxidases, invertases,
laccases, mannosidase, mutanases, peroxidases, phospholipases, polyphenoloxidases,
ribonucleases, transglutaminases, glucose oxidases, hexose oxidases, and monooxygenases.
Several therapeutic proteins of interest include e.g. antibodies and fragment thereof,
human insulin and analogs thereof, human lactoferrin and analogs thereof, human growth
hormone, erythropoietin, tissue plasminogen activator (tPA) or insulinotropin. The
polypeptide may be involved in the synthesis of a metabolite, preferably citric acid.
Such polypeptides e.g. include: aconitate hydratase, aconitase hydroxylase, 6-phosphofructokinase,
citrate synthase, carboxyphosphonoenolpyruvate phosphonomutase, glycolate reductase,
glucose oxidase precursor goxC, nucleoside-diphosphate-sugar epimerase, glucose oxidase,
Manganese-superoxide-dismutase, citrate lyase, ubiquinone reductase, carrier proteins,
citrate transporter proteins, mitochondrial respiratory proteins and metal transporter
proteins.
Computer, program and data carrier
[0100] In a further aspect the invention relates to a computer comprising a processor and
memory, the processor being arranged to read from said memory and write into said
memory, the memory comprising data and instructions arranged to provide said processor
with the capacity to perform the method of the invention.
[0101] In another aspect the invention relates to a computer program product comprising
data and instructions and arranged to be loaded in a memory of a computer that also
comprises a processor, the processor being arranged to read from said memory and write
into said memory, the data and instructions being arranged to provide said processor
with the capacity to perform the method of the invention.
[0102] In yet another aspect the invention relates to a data carrier provided with a computer
program product as defined above.
Nucleic acid molecules
[0103] In a further aspect the invention relates to a nucleic acid molecule comprising a
coding sequence coding for a predetermined amino acid sequence. The coding sequence
preferably is a nucleotide sequence that does not resemble a naturally occurring coding
sequence. Rather the coding sequence in the nucleic acid molecule is a nucleotide
sequence that is not found in nature but is an artificial, i.e. an engineered, man-made
nucleotide sequence that was generated on the basis of the method for optimization
of single codon and/or codon pair bias for a predetermined host cell in accordance
with the methods defined herein and that was subsequently synthesized as a tangible
nucleic acid molecule. Preferably, the coding sequence has a
fitsc(g) of at least below 0.2, or more preferably below 0.1 and most preferably below 0.02
for a predetermined host cell. More preferably, the coding sequence has a
fitcp(
g) of at least below 0 for a predetermined host cell. Most preferably, the coding sequence
has a
fitcp(
g) of at least below - 0.1 for a predetermined host cell, or more preferably at least
below -0.2. Preferably the number of codon-pair in an optimized gene g contains at
least 60, 70, 75, 80, 85% codon pairs and most preferably at least 90% codon pairs
with associated negative codon-pairs for the specified host organisms
[0104] The predetermined amino acid sequence encoded by the coding sequence may be any polypeptide
of interest as herein defined above and also the predetermined host cell may be any
host cell as defined above herein.
[0105] In the nucleic acid molecule, the coding sequence preferably is operably linked to
an expression control sequence that are capable of directing expression of the coding
sequence in the predetermined host cell. In the context of the invention, a control
sequence is defined as a nucleotide sequence operatively associated to a coding sequence
when present together and which include all components necessary or advantageous for
expression of the nucleotide sequence encoding the polypeptide to be produced. Each
control sequence may be native or foreign to the nucleotide sequence encoding the
polypeptide to be produced. Such control sequences may include, but are not limited
to, a leader sequence, a polyadenylation sequence, a propeptide sequence, a promoter,
a translational initiator sequence, a translational initiator coding sequence, a translational
transcription terminator and a transcription terminator sequence. The control sequences
may be provided with linkers, e.g., for the purpose of introducing specific restriction
sites facilitating ligation of the control sequences with the coding region of the
nucleotide sequence encoding a polypeptide.
[0106] Expression control sequences will usually minimally comprise a promoter. As used
herein, the term "promoter" refers to a nucleic acid fragment that functions to control
the transcription of one or more genes, located upstream with respect to the direction
of transcription of the transcription initiation site of the gene, and is structurally
identified by the presence of a binding site for DNA-dependent RNA polymerase, transcription
initiation sites and any other DNA sequences, including, but not limited to transcription
factor binding sites, repressor and activator protein binding sites, and any other
sequences of nucleotides known to one of skill in the art to act directly or indirectly
to regulate the amount of transcription from the promoter. A "constitutive" promoter
is a promoter that is active under most environmental and developmental conditions.
An "inducible" promoter is a promoter that is active under environmental or developmental
regulation.
[0107] A DNA segment such as an expression control sequence is "operably linked" when it
is placed into a functional relationship with another DNA segment. For example, a
promoter or enhancer is operably linked to a coding sequence if it stimulates the
transcription of the sequence. DNA for a signal sequence is operably linked to DNA
encoding a polypeptide if it is expressed as a pre-protein that participates in the
secretion of the polypeptide. Generally, DNA sequences that are operably linked are
contiguous, and, in the case of a signal sequence, both contiguous and in reading
phase. However, enhancers need not be contiguous with the coding sequences whose transcription
they control. Linking is accomplished by ligation at convenient restriction sites
or at adapters, linkers, or PCR fragments by means know in the art.
[0108] The selection of an appropriate promoter sequence generally depends upon the host
cell selected for the expression of the DNA segment. Examples of suitable promoter
sequences include prokaryotic, and eukaryotic promoters well known in the art (see,
e.g.
Sambrook and Russell, 2001, "Molecular Cloning: A Laboratory Manual (3rd edition),
Cold Spring Harbor Laboratory, Cold Spring Harbor Laboratory Press, New York). The transcriptional regulatory sequences typically include a heterologous enhancer
or promoter that is recognized by the host. The selection of an appropriate promoter
depends upon the host, but promoters such as the trp, lac and phage promoters, tRNA
promoters and glycolytic enzyme promoters are known and available (see, e.g. Sambrook
and Russell, 2001,
supra). Examples of preferred inducible promoters that can be used are a starch-, copper-,
oleic acid-inducible promoters. Preferred promoters for filamentous fungal host cells
e.g. include the glucoamylase promoter of
A. niger or the TAKA amylase promoter of
A. oryzae and the promoters described in
WO2005/100573.
[0109] The nucleotide sequence of the invention may further comprise a signal sequence,
or rather a signal peptide-coding region. A signal sequence codes for an amino acid
sequence linked to the amino terminus of the polypeptide, which can direct the expressed
polypeptide into the cell's secretory pathway. Signal sequences usually contain a
hydrophobic core of about 4-15 amino acids, which is often immediately preceded by
a basic amino acid. At the carboxyl-terminal end of the signal peptide there are a
pair of small, uncharged amino acids separated by a single intervening amino acid
that defines the signal peptide cleavage site von
Heijne, G. (1990) J. Membrane Biol. 115: 195-201. Despite their overall structural and functional similarities, native signal peptides
do not have a consensus sequence. Suitable signal peptide-coding regions may be obtained
from a glucoamylase or an amylase gene from an
Aspergillus species, a lipase or proteinase gene from a
Rhizomucor species, the gene for the alpha-factor from
Saccharomyces cerevisiae, an amylase or a protease gene from a
Bacillus species, or the calf pre-pro-chymosin gene. However, any signal peptide-coding region
capable of directing the expressed protein into the secretory pathway of a host cell
of choice may be used in the present invention. Preferred signal peptide coding regions
for filamentous fungus host cells are the signal peptide coding region obtained from
Aspergillus oryzae TAKA amylase gene (
EP 238 023),
Aspergillus niger neutral amylase gene,
Aspergillus niger glucoamylase, the
Rhizomucor miehei aspartic proteinase gene, the
Humicola lanuginosa cellulase gene,
Humicola insolens cellulase,
Humicola insolens cutinase the
Candida antarctica lipase B gene or the
Rhizomucor miehei lipase gene and mutant, truncated, and hybrid signal sequence thereof. In a preferred
embodiment of the invention the nucleotide sequence encoding the signal sequence is
an integral part of the coding sequence that is optimized with respect to single codon
and/or codon pair bias for the predetermined host.
[0110] In the nucleic acid molecule of the invention, the coding sequence is further preferably
operably linked to a translational initiator sequence. In eukaryotes, the nucleotide
consensus sequence (6-12 nucleotides) before the initiator ATG-codon is often called
Kozak consensus sequence due to the initial work on this topic (
Kozak, M. (1987): an analysis of 5'-noncoding sequences from 699 vertebrate messenger
RNAs. Nucl. Acid Res. 15(20): 8125-47). The original Kozak consensus sequence CCCGCCGCCrCC(ATG)G, including a +4 nucleotide
derived by Kozak is associated with the initiation of translation in higher eukaryotes.
For prokaryote host cells the corresponding Shine-Delgamo sequence (AGGAGG) is preferably
present in the 5'-untranslated region of prokaryotic mRNAs to serve as a translational
start site for ribosomes.
[0111] In the context of this invention, the term "translational initiator sequence" is
defined as the ten nucleotides immediately upstream of the initiator or start codon
of the open reading frame of a DNA sequence coding for a polypeptide. The initiator
or start codon encodes for the amino acid methionine. The initiator codon is typically
ATG, but may also be any functional start codon such as GTG, TTG or CTG.
[0112] In a particularly preferred embodiment of the invention, the nucleic acid molecule
comprises a coding sequence coding for a predetermined amino acid sequence that is
to be expressed in a fungal host cell, i.e. the predetermined host cell is preferably
a fungus of which filamentous fungi are most preferred. Nucleic acid molecules comprising
coding sequences that are optimized for expression in fungi in accordance with the
invention may further comprise the one or more of the following elements: 1) a fungal
consensus translational initiator sequence; 2) a fungal translational initiator coding
sequence; and 3) a fungal translational termination sequence.
[0113] A consensus fungal translational initiator sequence preferably is defined by the
following sequences: 5'-mwChkyCAmv-3', using ambiguity codes for nucleotides: m (A/C);
r (A/G); w (A/T); s (C/G); y (C/T); k (G/T); v (A/C/G); h (A/C/T); d (A/G/T); b (C/G/T);
n (A/C/G/T). According to a more preferred embodiment, the sequences are: 5'-mwChkyCAAA-3';
5'-mwChkyCACA-3' or 5'-mwChkyCAAG-3'. Most preferably the translational initiation
consensus sequence is 5'-CACCGTCAAA-3' or 5'-CGCAGTCAAG-3'.
[0114] In the context of this invention, the term "consensus translational initiator coding
sequence" is defined herein as the nine nucleotides immediately downstream of the
initiator codon of the open reading frame of a coding sequence (the initiator codon
is typically ATG, but may also be any functional start codon such as GTG). A preferred
fungal consensus translational initiator coding sequence has the following nucleotide
sequence: 5'-GCTnCCyyC-3', using ambiguity codes for nucleotides y (C/T) and n (A/C/G/T).
This leads to 16 variants for the translational initiator coding sequence of which
5'- GCT TCC TTC -3' is most preferred. Using a consensus translational initiator coding
sequence, the following amino acids are allowed at the amino acid positions mentioned:
alanine at +2, alanine, serine, proline, or threonine at +3, and phenylalanine, serine,
leucine or proline at +4 position in the polypeptide that is encoded. Preferably in
the present invention, the consensus translational initiator coding sequence is foreign
to the nucleic acid sequence encoding the polypeptide to be produced, but the consensus
translational initiator may be native to the fungal host cell.
[0115] In the context of this invention, the term "translational termination sequence" is
defined as the four nucleotides starting from the translational stop codon at the
3' end of the open reading frame or coding sequence. Preferred fungal translational
termination sequence include: 5'-TAAG-3', 5'- TAGA-3' and 5'-TAAA-3', of which 5'-TAAA-3'
is most preferred.
[0116] A coding sequence coding for a predetermined amino acid sequence that is to be expressed
in a fungal host cell is further preferably optimized with respect to single codon
frequency such that at least one, two, three, four or five original codons, more preferably
at least 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%, 50%, 75%, 80%, 85%, 90%, or 95% of
the original codons have been exchanged with a synonymous codon, the synonymous codon
encoding the same amino acid as the native codon and having a higher frequency in
the codon usage as defined in the Table A than the original codon.
Table A: Optimal filamentous fungal codon frequency for synonymous codons in %.
| |
.T. |
.C. |
.A. |
.G. |
|
| |
Phe |
Ser |
Tyr |
Cys |
|
| T.. |
0 |
21 |
0 |
0 |
..T |
| |
Phe |
Ser |
Tyr |
Cys |
|
| T.. |
100 |
44 |
100 |
100 |
..C |
| |
Leu |
Ser |
Stop |
Stop |
|
| T.. |
0 |
0 |
100 |
0 |
..A |
| |
Leu |
Ser |
Stop |
Trp |
|
| T.. |
13 |
14 |
0 |
100 |
..G |
| |
Leu |
Pro |
His |
Arg |
|
| C.. |
17 |
36 |
0 |
49 |
..T |
| |
Leu |
Pro |
His |
Arg |
|
| C.. |
38 |
64 |
100 |
51 |
..C |
| |
Leu |
Pro |
Gln |
Arg |
|
| C.. |
0 |
0 |
0 |
0 |
..A |
| |
Leu |
Pro |
Gln |
Arg |
|
| C.. |
32 |
0 |
100 |
0 |
..G |
| |
Ile |
Thr |
Asn |
Ser |
|
| A.. |
27 |
30 |
0 |
0 |
..T |
| |
Ile |
Thr |
Asn |
Ser |
|
| A.. |
73 |
70 |
100 |
21 |
..C |
| |
Ile |
Thr |
Lys |
Arg |
|
| A.. |
0 |
0 |
0 |
0 |
..A |
| |
Met |
Thr |
Lys |
Arg |
|
| A.. |
100 |
0 |
100 |
0 |
..G |
| |
Val |
Ala |
Asp |
Gly |
|
| G.. |
27 |
38 |
36 |
49 |
..T |
| |
Val |
Ala |
Asp |
Gly |
|
| G.. |
54 |
51 |
64 |
35 |
..C |
| |
Val |
Ala |
Glu |
Gly |
|
| G.. |
0 |
0 |
26 |
16 |
..A |
| |
Val |
Ala |
Glu |
Gly |
|
| G.. |
19 |
11 |
74 |
0 |
..G |
[0117] A even more preferred coding sequence coding for a predetermined amino acid sequence
that is to be expressed in a fungal host cell is further preferably optimized with
respect to single codon frequency such that at least one, two, three, four or five
original codons, more preferably at least 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%,
50%, 75%, 80%, 85%, 90%, or 95% of the original codons have been exchanged with a
synonymous codon, the synonymous codon changing the codon frequency such that the
value of the absolute difference between the percentage for said codon in said frequency
and listed optimal percentage becomes smaller after modification, applying the following
list of optimal percentages: cysteine encoded by TGC (100%); phenylalanine by TTC
(100%); histidine by CAC (100%); lysine by AAG (100%); asparagine by AAC (100%); glutamine
by CAG (100%); tyrosine by TAC (100%); alanine by GCT (38.0%), GCC (50.7%), or GCG
(11.3%); aspartate by GAC (63.2%); glutamate by GAG (74.2%); glycine by GGT (49.0%),
GGC (35.9%), GGA (15.1%); isoleucine by ATT (26.7%), ATC (73.3%); leucine by TTG (12.7%),
CTT (17.4%), CTC (38.7%), CTG (31.2%); proline by CCT (35.6%), CCC (64.4%); arginine
by CGT (49.1%), CGC (50.9%); serine by TCT (20.8%), TCC (44.0%), TCG (14.4%), AGC
(20.8%); threonine by ACT (29.7%), ACC (70.3%) and/or valine by GTT (27.4%), GTC (54.5%),
GTG (18.1%); all other possible amino acid encoding codons (0%).
[0118] The above defined nucleic acid molecules comprising the coding sequences of the invention
(for expression in a predetermined host cell) may further comprise the elements that
are usually found in expression vectors such as a selectable marker, an origin of
replication and/or sequences that facilitate integration, preferably through homologous
recombination at a predetermined site in the genome. Such further elements are well
known in the art and need no further specification herein.
[0119] In a further aspect the invention pertains to a host cell comprising a nucleic acid
molecule as defined herein above. The host cell preferably is a host cell as herein
defined above.
[0120] In yet a further aspect the invention relates to a method for producing a polypeptide
having the predetermined amino acid sequence. The method preferably comprises culturing
a host cell comprising a nucleic acid molecule as defined herein above, under conditions
conducive to the expression of the polypeptide and, optionally, recovery of the polypeptide.
[0121] In again a further aspect the invention relates to method for producing at least
one of an intracellular and an extracellular metabolite. The method comprising culturing
a host cell as defined in herein above under conditions conducive to the production
of the metabolite. Preferably, in the host the polypeptide having the predetermined
amino acid sequence (that is encoded by the nucleic acid molecule as defined above)
is involved in the production of the metabolite. The metabolite (be it a primary or
secondary metabolite, or both; be it intra-, extracellular or both) may be any fermentation
product that may be produced in a fermentation process. Such fermentation products
e.g. include amino acids such as lysine, glutamic acid, leucin, threonin, tryptophan;
antibiotics, including e.g. ampicilline, bacitracin, cephalosporins, erythromycin,
monensin, penicillins, streptomycin, tetracyclines, tylosin, macrolides, and quinolones;
preferred antibiotics are cephalosporins and beta-lactams; lipids and fatty acids
including e.g. poly unsaturated fatty acids (PUFAs); alkanol such as ethanol, propanol
and butanol; polyols such as 1,3-propane-diol, butandiol, glycerol and xylitol; ketons
such as aceton; amines, diamines, ethylene; isoprenoids such as carotenoids, carotene,
astaxanthin, lycopene, lutein; acrylic acid, sterols such as cholesterol and ergosterol;
vitamins including e.g. the vitamins A, B2 B12, C, D, E and K, and organic acids including
e.g. glucaric acid, gluconic acid, glutaric acid, adipic acid, succinic acid, tartaric
acid, oxalic acid, acetic acid, lactic acid, formic acid, malic acid, maleic acid,
malonic acid, citric acid, fumaric acid, itaconic acid, levulinic acid, xylonic acid,
aconitic acid, ascorbic acid, kojic acid, and comeric acid; a preferred organic acid
is citric acid.
[0122] In this document and in its claims, the verb "to comprise" and its conjugations is
used in its non-limiting sense to mean that items following the word are included,
but items not specifically mentioned are not excluded. In addition, reference to an
element by the indefinite article "a" or "an" does not exclude the possibility that
more than one of the element is present, unless the context clearly requires that
there be one and only one of the elements. The indefinite article "a" or "an" thus
usually means "at least one".
Examples
1. Example 1: Analysis of codon pair bias
1.1 Material and methods
1.1.1 Data and software
[0123] Codon pair analysis may be performed on coding sequences (CDS) in whole genome sequence
data as well as partial groups derived of those (or a partial genome sequence, like
for example cDNA/EST libraries, or even partial genome data from multiple genomes
from related organisms). The tools used in the present invention read these data using
FASTA files as input. The vast majority of all calculations have been performed in
MATLAB 7.01 (The MathWorks, Inc.,
www.mathworks.com), but for some detailed analyses of the obtained results Spotfire DecisionSite 8.0
(Spotfire, Inc., http://www.spotfire.com/products/decisionsite.cfm) was used.
[0124] For
A. niger, a FASTA file with predicted cDNA sequences for the full genome of CBS513.88 (
Pel et al., 2007, Nat Biotech. 25: 221-231) and a group of 479 highly expressed genes were used. Furthermore, since usually
less than half of the >14,000 genes in
A. niger are expressed at the same time under pilot-scale fermentation conditions, data from
24 GeneChips obtained using such conditions was used to extract a second set of genes
that includes only genes that are actually expressed within various experiments (taking
only genes with at least 18 'present' calls into account, using Affymetrix MAS5.0
array analysis software; this set comprised 4,584 genes) and to rank them according
to observed mRNA level (since no other data was available at that time), so a set
of (presumably) highly expressed genes of any size can be identified easily. This
second set was created to be able to rank the data according to their expression level.
[0125] For this analysis we have used transcription levels of the genes. Alternatively one
can also apply quantitative protein expression data,
e.g. by two-dimensional gel electrophoresis of the proteins and subsequent identification
via mass spectrometry. However, generating protein expression on large sets of proteins
is still quite time consuming in comparison with determination of mRNA levels (
e.g. using genechips). Therefore, what is done here is to study the effect of codon bias
on translation
before translation has actually happened.
Gygi et al. (Yeast. Mol. Cel. Biol. 19(3):1720-30) actually found a "correlation of protein and mRNA expression levels with codon bias"
in
E. coli, even although the correlation of mRNA and protein expression levels was rather rudimentary
only. Hence, the term "expression level" will be used in this text when actually only
the effect on the transcription level has been determined.
[0126] For
Bacillus subtilis, an organism containing around 4,000 genes, a group of 300 highly expressed genes
was available and has been analyzed. See Table 1.1 for an overview of the basic properties
of the genomes of all organisms that have been taken into account in this study (however,
not all of them will be described in detail).
[0127] In every analysis, (putative) genes that included one or more stop codons at another
position than the end and sequences with a length not divisible by three (i.e. where
a frameshift might have occurred during sequencing) have been ignored. Also the first
five and the last five codons of every gene have not been taken into account because
these sites might be involved in protein binding and releasing efficiency and therefore
be subject to different selection pressures than the other parts of the sequence,
so codon and codon pair bias there might not be representative. ORFs (ORF = open reading
frame) shorter than 20 codons have also been omitted from the analysis. In Table 1.1
this is already taken into account.
Table 1.1 Nucleotide content of several organisms, including number of ORF's and genome
size in Megabasepair (Mbp).
| name of organism |
# of ORFs |
Mbp |
nucleotide content |
| A |
C |
G |
T |
| A. nidulans |
7,782 |
10.61 |
24% |
28% |
26% |
22% |
| A. niger |
13,962 |
18.41 |
24% |
27% |
26% |
22% |
| A. oryzae |
12,074 |
16,29 |
25% |
26% |
26% |
23% |
| B. amyloliquefaciens |
4,449 |
3.54 |
26% |
24% |
27% |
23% |
| B. subtilis |
4,104 |
3.66 |
30% |
20% |
24% |
26% |
| E. coli K12 |
4,289 |
4.09 |
24% |
25% |
27% |
24% |
| K. lactis |
5,336 |
7.52 |
32% |
19% |
21% |
28% |
| P. chrysogenum |
13,164 |
17.54 |
24% |
27% |
25% |
23% |
| S. cerevisiae |
6,449 |
9.01 |
33% |
19% |
20% |
28% |
| S. coelicolor |
7,894 |
7.62 |
14% |
37% |
35% |
13% |
| T. reesei |
8,331 |
11.45 |
23% |
30% |
28% |
20% |
1.1.2 Expected occurrences of codon pairs
[0128] In order to analyze codon pair usage, first the occurrences of every single codon
and every codon pair have been counted, below denoted by
nobs ((
ci,cj)), where
obs stands for
observed.
[0129] The double parenthesis are necessary to indicate that "observed number",
i.e. nobs, is a function with just one argument, which itself is a pair (in that case: a pair
of codons,
i.e. (
ci,cj)). The same applies to all functions on codon pairs defined below. The indices
i, j and also
k can be 1 to 64, indicating the number of the codon in the internal representation
(according to their alphabetical order). (
ci,cj) denoting a codon pair with
ci being the left codon (
i.e. the 5' triplet of the 6-nucleotide sequence) and right
cj one (
i.e. closer to the 3'-end), as well as the number of occurrences

for every codon
ck (where the subscript
sc stands for
single codon and the superscript all indicates that the number refers to the full genome, as opposed
to

, which will be used to denote codon ratios in a single gene g; functions of codon
pairs like

always refer to the number in the full genome or a larger group of genes). Single
codon ratios (Note that in some papers these ratios are also called frequencies. However,
codon frequencies may also refer to the number of occurrences of a codon divided by
the total number of all codons) were then calculated

where
syn(
ck) denotes the set of codons that encode for the same amino acid as
ck and are thus synonymous to
ck. Thus, the value of the sum below the fraction bar equals the number of occurrences
of the amino acid encoded by
ci in the whole proteome. See Appendix 1 for a concise list of the most important symbols
and formulas used here.
[0130] To reveal whether certain alleged codon pair preferences are only the result of preferences
of the individual codons, it is necessary to calculate expected values for every codon
pair based on individual codon frequencies. These have been calculated using the formula

[0131] The superscript
own is used to distinguish the values from those obtained using other methods mentioned
later. In the last factor of this equation, the actual numbers of occurrences of all
synonymous codon pairs are summed up. Thus, the expected amount of each codon pair
is the product of the individual codon usage ratios and the number of occurrences
of the respective amino acid pair.
[0132] Gutman and Hatfield (1989, Proc. Natl. Acad. Sci USA 86:3699-3703) proposed another method of calculating expected values. Their initial approach was
to calculate the codon frequencies (i.e. the amount of codons in a gene
g divided by the total number of codons in
g, denoted |
g|) for every gene individually, and then multiply these values pair wise and with
the number of codon pairs in that sequence (which is |
g|-1).

[0133] In this equation "
gh1" denotes Gutman and Hatfield method 1 (1989,
supra). This results in expected codon pair values for each gene (the part after the sum
operator in the equation above), which are then added up, resulting in final expected
values that are by definition adjusted for possible deviations in single codon usage
among different genes of the same genome, but do not take a possible bias in amino
acid pair usage into account. This means that if certain amino acids tend to be next
to each other more often than others, or, in other words, if the numbers of occurrences
of the amino acid pairs are not similar to what they would be in randomized sequences
with the same amino acid composition, the expected values would also be significantly
different in that codon pairs encoding rather rarely used amino acid pairs would have
too high expected values and those of more often used amino acid pairs too low ones.
[0134] Gutman and Hatfield (1989,
supra) also proposed a method of normalizing their expected values for amino acid pair
bias. Therefore, they simply compared the expected number of amino acid pairs according
to their methods with the observed ones and scaled the expected values of all affected
codon pairs accordingly to make the former match the latter:

[0135] In this equation "
gh2" denotes Gutman and Hatfield method 2 (1989,
supra).
1.1.3 Calculating codon pair bias
[0136] The actual codon pair bias
bias((ci,cj) should then result from the difference between the expected and actual (observed)
numbers of the codon pairs (where any of these methods for the expected values can
be used). The initial approach was to calculate it simply by

[0137] This way, the bias value would indicate how many percent more or less often than
expected the codon pair is actually used (if multiplied by 100%, that is). For amino
acid pairs not occurring in an analyzed set of genes, the bias value according to
the formula would be 0/0 for all corresponding codon pairs. In that case, it is defined
to be 0. The lower limit of the bias values would thus be -1, whereas there is no
clear upper limit. This was considered somewhat impractical, so instead

was used, where max(a,b) denotes the greater of the two values a and b, which always
results in a bias value in (-1,1). This means that the bias value can be -1, but not
+1. The former happens when a certain codon pair is not used at all to encode for
an amino acid pair that really occurs; the value +1 can not be reached because
nexp((
ci,
cj)) would have to be 0 then, but this is only possible when
nobs((
ci,cj)) is 0, too.
[0138] The interpretation given above is still valid for bias values <0 (which means that
nobs((
ci,
cj))
< nexp ((
ci,
cj)), so both formulas have the same result). If
nobs((
ci,
cj)) >
nexp((
ci,cj)), the bias values (which are >0 then) indicate how many percent lower than the observed
value the expected value is (i.e. in that case the baseline is changed).
1.1.4 Statistical significance of the bias
[0139] Gutman and Hatfield (1989,
supra) used a χ
2-test to determine the statistical significance of their results. This test is used
to check how likely it is that certain observed results occurred by chance under a
specific hypothesis. When examining codon pairs, this hypothesis would be that the
codon pair usage is the result of a random selection of every codon independently.
To test this hypothesis, a χ
2-value is calculated

(with CP denoting the set of all codon pairs not including a stop codon). The number
of degrees of freedom is then 3720 (61*61-1). If codon pair selection were random,
one would expect the χ
2-value to be around 3720 (equal to the number of degrees of freedom) with a standard
deviation equal to the square root of 2*degrees of freedom.
[0140] This way, the overall statistical significance of the observed bias can be tested.
However, one can also deduce the statistical significance of the bias of individual
codon pairs. As for the method of calculating expected values proposed earlier, the
number of occurrences of a codon pair is considered to be the result of a sequence
of independent yes/no experiments (yes: these two codons are selected for encoding
the respective amino acid pair; no: another codon pair is selected), so it follows
a binomial distribution, which can be approximated by a normal distribution if the
set of analyzed genes is sufficiently large. This is considered a good approximation
if n*p>4, where n stands for the number of experiments and p for the probability of
"yes", which is also the expected value. Therefore, for every codon pair a standard
deviation can be calculated according to the formula

[0141] Then, the standard scores, also referred to as z-scores, can be calculated

[0142] The absolute value of the z-score tells how many standard deviations away from the
expected value the actual (observed) value is. Assuming a normal distribution, approximately
95% of all observations should be within two standard deviations from the expected
value and >99% within three.
1.2 Results
1.2.1 Existence of codon pair bias
[0143] Using the above methods we have found that significant codon pair biases exist. For
all investigated organisms, the χ
2-test delivered χ
2-values several times as high as the number of degrees of freedom and thus also many
standard deviations above the expected value. As for the bias of individual codon
pairs, the finding of Moura et al. that in yeast "about 47% of codon-pair contexts
fall within the interval -3 to +3" standard deviations away from the expected values
(although they calculated the expected values in a different way), which corresponds
to the z-scores in our analysis, could be confirmed. Overall, there are significantly
more codon pairs with rather high z-scores than there should be if codon pair usage
were random. See Table 1.2: with a random selection, which would result approximately
in a normal distribution, for example only about 5% of all codon pairs should have
a z-score greater than 2 or less than -2, but in the whole genome of the selected
four organisms, this actually applies to more than two thirds.
Table 1.2. Z-scores in different organisms
| |z-score| |
>1 |
>2 |
>3 |
| normal distribution |
68.3% |
5.0% |
0.3% |
| A. nidulans |
86.1% |
73.7% |
60.4% |
| A. niger |
89.2% |
79.1% |
69.7% |
| A. oryzae |
88.4% |
76.7% |
65.1% |
| B. amyloliquefaciens |
88.1% |
76.4% |
64.0% |
| B. subtilis |
86.1% |
72.0% |
59.3% |
| E. coli K12 |
86.1% |
74.8% |
64.0% |
| K. lactis |
82.6% |
67.0% |
53.4% |
| P. chrysogenum |
89.3% |
79.1% |
69.0% |
| S. cerevisiae |
82.7% |
67.6% |
52.1% |
| S. coelicolor |
82.0% |
66.5% |
53.5% |
| T. reesei |
89.0% |
79.8% |
71.0% |
[0144] Note that these values are somewhat correlated with genome size (see Table 1.1 for
a comparison), i.e. organism with larger genomes tend to have codon pairs with more
extreme z-scores. Especially when analyzing smaller groups of genes (e.g. 479 highly
expressed ones in
A. niger), the values are lower (for this example: 65.1%, 37.2% and 19.7%, respectively),
as smaller numbers of occurrences lead to higher standard deviations (compared to
the expected values) and thus to less statistical significance of the results. This
leads to the conclusion that codon pair usage is not the result of a random selection
of the codons according to the single codon ratios.
[0145] The distribution of the bias values themselves differs from one organism to another.
This can be explained with reference to Figure 3 which shows the distribution of codon
pair bias values for the 3,721 sense:sense codon pairs in different organisms. The
numbers in the top right corner of each histogram in Figure 3 are the standard deviations
for the observed distribution; the mean values (not shown) are between -0.06 and -0.01
for all organisms. In the histograms shown in Figure 3, one can see that out of the
ten tested organisms, the bacteria
E. coli, B. subtilus, B. amiloliquefaciens and
S. coelicolor have the most extreme codon pair bias, whereas bias in the fungi
A. niger, A. oryzae, A. terreus, A. nidulans, P. chrysogenum and yeasts
S. cerevisiae and
K. lactis, is less extreme.
[0146] Another interesting observation can be made when comparing codon pair bias of different
organisms. Bias values from related organisms show a higher correlation than those
from unrelated organisms. This is explained with reference to Figure 4. Figure 4 shows
correlation in codon pair bias of various organisms. A correlation coefficient is
shown in the top right corner of each subplot. In this analysis, the highest correlations
could be observed between
A. niger vs.
P. chrysogenum, and
A. niger vs.
A. oryzae, the lowest,
i.e. effectively no correlation could be observed between
B. subtilis and
S. coelicolor. Interestingly, no negative correlations have been observed. This means that although
organisms with a high GC-content (like
S. coelicolor) mostly prefer those codons that are the less used ones in AT-rich organisms (like
S. cerevisiae or, although not extremely AT-rich,
B. subtilis), there are no two organism where the preferred pairs of one organism were likely
to be rejected in the other and vice versa. This could mean that although bias of
almost every single codon is organism-dependent, there are several codon pairs that
are preferred and/or rejected in almost every organism (
e.g. because of their likeliness to cause frameshifts or tRNAs with not matching structure).
1.2.2 Patterns in codon pair bias
[0147] In order to visualize the observed codon pair bias, so-called maps can be drawn as
has been done by Moura et al. (2005) (they refer to these maps as "codon context maps").
This can be most easily explained with reference to colored images that consist of
colored rectangles for every codon pair, with the rows representing the first and
the columns representing the second codon of the pair. Red colors indicate a negative
and green ones a positive bias. White represents codon pairs that really have a bias
equal 0 (which is the case for ATG-ATG, for example, since that is the only way to
encode the amino acid pair Met-Met) and pairs incorporating a stop codon.
[0148] However, colored images cannot be part of the disclosure of a patent application.
For black & white visualization, the image will be split in two images in this example.
Figure 5A displays the positive codon pairs for
A.
niger, while Figure 5B displays the negative codon pairs for
A.
niger (see also Appendix 3, Table C1). The more biased the codon-pair, the more black the
corresponding rectangle. The bias values here range from -0.67 to 0.54, where in other
organisms they might even get slightly above +/-0.9 (see also figure 3). The highest
intensities of black (original green (top) and black (original red (bottom)) in these
diagrams represent values of 0.9 and -0.9, respectively (not reached here; mostly,
the absolute values of the maximum bias are slightly lower than those of the minimum
bias.
[0149] In addition, we refer to CPW matrix-tables in Appendix 3, which contain the numerical
values of the bias of the codon pairs and we refer to Figure 5 as a black and white
example of the colored image, whereby the skilled person may reconstruct a colored
version using the numerical values from the tables in Appendix 3.
[0150] The first approach to these codon pair maps was to have the rows and columns sorted
according to their alphabetical order (as this is the order of their internal representation).
What could be seen in that map was that the diagonals seemed to contain slightly more
green than red spots, which indicates that many codons have a preference for the same
codon as its neighbor. Furthermore, most neighboring columns were somewhat similar
where neighboring rows were mostly not (data not shown) see Figures 5A and 5B and
Appendix 3, Table C1. However most rows were similar to a row separated by three others,
i.e. there was some similarity of every fourth row.
[0151] Since the common property of every fourth row is the last nucleotide of the first
codon of the pairs, it is more preferred to sort rows sorted according to the alphabetical
order of the third position as first sorting criterion and the middle position as
second. What can then be seen in the map for
A.
niger (Figures 5C and D, and Appendix 3, Table C1) is that bias seems to correlate indeed
mainly with the last nucleotide of the first (5') and the first nucleotide of the
second (3') codon, as most values of the respective blocks of 16*16 codon pairs have
the same color. For example, a general rule that can be identified in Aspergillus
is that codon pairs like xxT-Axx (x denoting any nucleotide, indicating that the one
at the respective position is not important for the specified rule) are rejected (red
block in the lower left corner), whereas the pattern xxA-Txx characterizes preferred
codons (green block in the top right corner), again indicating that codon pair bias
is directional. However, not all bias can be explained just with patterns in the two
neighboring nucleotides in the "middle" of the codon pair. xxC-Axx codon pairs, for
example (see second block from top on the very left), are not generally preferred
or rejected, but there is a clear preference for pairs of the pattern xxC-AAx (note
the four green columns on the left of the block just mentioned). Bias can also depend
on not neighboring nucleotides (e.g. the strong rejection of CxA-Gxx pairs in
B. subtilis; see Figures 6A and 6B and Appendix 3, Table C4). Unfortunately, codon pair bias cannot
always be attributed to such "simple" patterns (see for example the rather chaotic
map for
E. coli in Figures 7A and B and Appendix 3, Table C5) - even when performing a cluster analysis
using Spotfire DecisionSite 8.0 (http://www.spotfire.com/products/decisionsite.cfm)
no general properties could be found (data not shown), i.e. the identified clusters
consisted mostly of unrelated codons (i.e. no common nucleotides at the same position).
1.2.3 Relation of bias and expression level
[0152] Looking at the bias map for the genes with high expression level (or better: presumably
high expression level, since they were identified by looking at transcription levels
only) of
A. niger (see Figure 8), the existence of larger groups, i.e. blocks in the diagram, is not
as obvious (or, in other words, simple rules as described above might not exist at
all). Yet since two thirds of all codon pairs occur 36 or less times in this group,
and because of the on average much lower z-scores as mentioned above, one can attribute
this to a large extent to random fluctuations.
[0153] Figure 9 shows a scatter plot of bias in a group of 479 highly expressed genes (vertical
axis) versus the bias in all genes (horizontal) of
A.
niger. All 3,721 codon pairs not involving stop codons are shown.
[0154] Shading from light gray to black were assigned according to the absolute values of
the z-scores in the overall genome, i.e. light dots in the plot do not have a significant
bias in all genes), as were sizes according to the absolute z-scores in the highly
expressed group, i.e. very small dots do not have a significant bias there (here |z-score|<1.9).
The solid black line indicates where both bias values are equal; the dashed black
line shows the best linear approximation of the actual correlation (identified by
principal component analysis); its slope is around 2.1.
[0155] When comparing the two bias values of each codon pair in the highly expressed group
and in the full genome (see the scatter plot in Figure 9), one can see that for most
pairs the bias in the highly transcribed group is more extreme, i.e. lower if it is
below 0 and higher if it is positive, but there are some pairs where the bias values
are quite different and even have a different sign. However, these are mostly codon
pairs with a small number of occurrences in the top group, and most pairs where the
bias is highly significant (blue, large circles) have similar biases in both groups
(i.e. they are close to the blue line that indicates where both bias values are equal).
[0156] No specific patterns regarding similar bias differences of codons that share two
of the three nucleotides could be found (neither for
A. niger nor for
B. subtilis), i.e. in plots of the bias difference analogous to the one above there were no larger
groups with similar bias difference.
1.3. Details of the identification of codon pair weights for gene adaptation
[0157] Codon pair weight for adaptation can be determined now according the described methods
(Appendix 1: Codon pair weights - method one sequence group (or genome)):.
- 1. based on the full set of genes; based on a subset of 1.
- 2. being identified as the fraction of highly expressed genes.
[0158] In addition, we started a search to identify codon pair weights that clearly relate
to a higher transcription level, which is required for a improved method for adaptation
of codon pair usage, the following methods have been applied: In
A. niger, where a complete ranking extracted from GeneChip data was available for the aforementioned
set of 4,584 actually expressed genes (see "Data" in "Materials and Methods"), the
mean codon pair weights of each gene (i.e. the equivalent of the fitcp(g) values)
were calculated. Then the genes were sorted according to fitness values (ascending
order) and expression level (descending order). Since highly expressed genes are supposed
to have low codon pair fitness values, these two rankings would be equal when using
ideal codon pair weights, so a comparison of these two rankings can give information
about the quality of the weights used in the fitness function (where slightly more
attention was given to the "correct" ranking of the highly expressed genes than to
the ranking of the mediocre ones). Additionally, the correlation coefficient (covariance
divided by the standard deviation of each variable) between ranking and average codon
pair weights of the 4,584 genes was calculated.
[0159] Several possible sets of weights have been examined, including
- i.
- bias values from the whole genome,
- ii.
- bias values of the highly expressed group,
- iii.
- bias with all the values that do not have a certain minimum z-score set to zero
- iv.
- bias values raised to the power of 2 (and some other values) to give highly preferred
or rejected codons a lower/higher influence
- v.
- combinations thereof
- vi.
- z-scores themselves
- vii.
- difference of bias values/z-scores from the highly expressed group and the full genome.
[0160] For the genetic algorithm (GA), their negations have been used, since preferred codon
pairs had been identified with positive values (rather arbitrarily), but the GA performs
minimization. This applies to all weights mentioned.
[0161] Out of these, the "best" weight matrix turned out to be a combination of item ii
to iv, however, an even better one could be obtained - as described above - by calculating
the codon pair "bias" in the highly expressed group using expected values calculated
based on the codon ratios of the whole genome. Figure 10 shows the correlation that
is observed.
[0162] Unlike all other weight sets tested, codon pairs involving codons that are more underrepresented
in the highly expressed group get a slight disadvantage here. Thus, these weights
are the only ones that also reflect the different single codon bias of the highly
expressed group and all genes. Using these weights carries the risk of rejecting some
codon pairs that actually have a positive bias in the highly expressed group, but
consist of (in the highly expressed group) rarely used codons. However, since our
desired single codon ratios are usually not identical to those in the group of genes
with high expression, but more "extreme" than these, single codon optimization would
replace these underrepresented anyway, so we can consider the weights described above
very convenient for codon pair optimization.
[0163] Concluding, a potentially improved codon pair weight matrix for gene adaptation has
been identified as described above. The equation is given in Appendix 1: Codon pair
weights - method highly expressed group with reference group (or genome).
1.4. Single codon and codon pair optimization in silico
1.4.1 Material and methods
[0164] The developed MATLAB toolbox for analyzing and optimizing genes consists of several
functions that have been organized in different directories according to their capacities.
In order to use them, it is therefore necessary to make all of them known to the MATLAB
environment. To do this, select "Set Path" from the File menu and then click "Add
with subfolders" and select the path where the toolbox is installed (usually called "Matlab-bio").
Also add the location of FASTA and other files that should be analyzed. All individual
MATLAB functions are briefly described in "contents.m" (type "help Matlab-bio" to
display this file in the MATLAB environment and use "help" followed by a function's
name to get detailed information about it). For gene optimization focusing on codon
pair usage, the two important functions are "fullanalysis" and "geneopt".
[0165] If the full genome of an organism you want to adapt a gene to is located in the file,
say, "
Aniger_ORF.fasta" and the identifiers of its highly expressed genes are in "
an-high.txt" type "fullanalysis ('Aniger_ORF.fasta', 'an-high.txt', 'an') ;" and you will get
(i) a codon pair bias map for the full genome, (ii) a codon pair bias map for the
group of genes in the second file and (iii) several variables (
i.e. sets of temporarily stored data) in the MATLAB workspace for further use. The third
parameter of "fullanalysis" determines only how these variables are named and can
be omitted if only one genome is to be analyzed at the same time. Among the mentioned
variables are: (i) codon pair usage and bias data for the full genome (named "
cpan" in this example), (ii) the same for the special group of genes specified by the
second parameter (named "
cpans") and (iii) structure with target single codon ratios and codon pair weights that
can be used for the genetic algorithm. "fullanalysis ('Xyz_ORF.fasta');" will only
show the codon pair bias map and store the bias data for the respective genome.
[0166] Although the second parameter may be any file that includes gene identifiers (e.g.
a set of genes with low expression or genes with a certain common function), it is
always treated like a set of highly expressed genes regarding this (potential) parameter
(named "
optparamforan" in the example, which stands for
the optimization parameter for the specified organism). Note that the single codon ratios here are simply calculated

, which is an acceptable approximation. Target ratios might be as well identified
by other methods that include the details of the single codon distribution (see main
text) in order to further improve specification of desired ratios. In addition, target
ratios may be left empty when no specific bias is found, in order to give the codon-pair
algorithm more freedom in finding solutions with a higher codon-pair fitness. Several
of such pre-determined single-codon target vectors are given in Appendix 1, for various
host organisms.
[0167] To use pre-specified single-codon target ratio's for the genetic algorithm, change
the field "
cr" of the parameter by typing "optparamforan.cr = [", then paste the single codon ratios
(
e.g. copied from an Excel sheet; note that they should be in alphabetical order of the
codons), type"] ; " if the ratios are available as a 64-element row or "]'; " if they
are copied from a column and press enter (note the additional single quotation mark
or apostrophe following the closing bracket in the latter case). Unimportant codon
ratios,
i.e. codons where no specific target ratio is desired, may be assigned the "value" NaN
(
not
a number) and they will be ignored when single codon fitness is calculated.
[0168] To exclude certain short sequences from the optimized gene, set the parameter "
rs" in the same way, where each sequence must be enclosed by single quotation marks
and all sequences together must be enclosed in braces,
e.g. (without the line break) "optparamforan.rs = {'CTGCAG' 'GCGGCGCC'};". Finally, the
field
cpi of the parameter might be changed to give single codon optimization or codon pair
optimization a higher importance in the combined fitness function (see the subsection
"performing codon pair optimization" in "results and discussion"). The default value
is 0.2. Set it to a lower value if the results of the experiments with codon pair
optimized genes reveal little improvement of codon pair optimized genes compared to
single codon optimized ones; in the opposite case, a higher
cpi might be better.
[0169] The actual optimization of the gene using the genetic algorithm can then be performed
using the function geneopt. The only parameters needed are the sequence to be optimized
and the structure containing codon pair weights, target ratios and restriction sites
as described above, so geneopt ('MUVARNEQST*', optparamforan); could for example be
used to optimize the given (rather short) protein sequence for high expression in
A.
niger; the '*' is used to denote that the resulting genetic sequence should have a stop
codon at the end (however, as the optimal stop signal in
A.
niger is believed to be the tetramer TAAA, this is not necessary). Note that the sequence
to be optimized must again be enclosed in single quotation marks; if the sequence
contains only the letters A, C, G, T or U and its length is a factor of 3, it is automatically
regarded a nucleotide sequence. The genetic algorithm then runs for 1000 generations
with a population size of 200, of which 80 each are kept for the generation (the 79
best and one randomly picked) and used to generate new individuals, where 40% of the
new individuals are generated using crossover and 60% using the mutation operator.
These default values turned out to be very convenient for the optimization, i.e. changes
in these parameters will only, if at all, lead to very slightly "better" genes, but
they can be changed as well, for example if significantly more or less calculation
time should be spent on the optimization (an average run of geneopt with a gene of
about 500 codons takes about 15 minutes on a 1.4 GHz Pentium M Processor). geneopt
(seq, optparamforan, [50 750 5 0 0.6]) will, for example, let the genetic algorithm
calculate 750 generations of a population where 50 individuals are kept for each new
generation and 250 are newly generated (5*50;
i.e. 300 individuals are examined in each generation), only the best (and no randomly
picked) individuals are kept and 60% of the recombinations are performed using the
crossover operator. For more details on how to specify these parameters, type help
geneopt and help geneticalgorithm.
[0170] Note that although the procedure of generating codon pair weights from analyzing
the corresponding FASTA files is shown and described here for
A.
niger and B. subtilis, just for these two organisms this is not necessary because these
calculations have already been performed for previous gene optimizations. For easier
use, the respective parameters for the genetic algorithms have been stored (type "load
gadata_for_an" or "load gadata_for_bs", respectively; note that the parameters there
are now just simply called an_param and bs_param.
1.4.2 Results
[0171] Figure 11 shows fitness values of five optimized versions each for different values
of
cpi (see legend of the diagram in Figure11). The protein is a fungal α-amylase (FUA;
also referred to as AmyB) that was optimized for the host
A.
niger (see Example 2). Additionally, the results of "pure" single codon optimization (black
dots on the right) and codon pair optimization are shown (group top left). The optimized
versions were obtained by running the genetic algorithm for around 1000 generations
with a population size of 400, which took about 17 minutes for each run on a 1.4 GHz
Pentium M. Note that pure single codon optimization and pure codon pair optimization
took only about 60% of that time.
[0172] In Figure11 the wild type (
fitsc(
gfua)=0.165,
fitcp(
gfua)=0.033) does not fit on this plot (it would be far to the right and above). The optimal
gene is always the one with the lowest values for
fitsc and
fitcp. Given the position of the dots, it is therefore not clear for which value of
cpi the most improved gene could be obtained, since we do not know yet whether single
codon usage or codon pair usage is more important. However, a fare trade-off seems
to appear in case of
cpi = 0.2.
[0173] The improvement in single codon and codon pair usage can be visualized in so-called
sequence quality plots proposed in this work. Figure 12 illustrates two diagrams which
show the sequence quality of the first 20 (out of 499) codons of the aforementioned
FUA (see also Example 2).
[0174] Note that these sequence quality diagrams not only depend on the sequence itself,
but also on the set of weights and the desired single codon ratios and thus on the
organism. Note also that it is possible to define target single codon ratios as "don't
care" for those codons with low or no codon bias, i.e. the usage of a certain codon
is not considered positive or negative for expression compared to its synonymous codons.
In that case, only the blue x-mark is shown for the actual ratio of the respective
codon in the gene and that particular position is ignored when calculating single
codon fitness (see 1.4. Single codon and codon pair optimization
in silico).
1.5 Conclusions
[0175] A significant correlation of codon pair usage and transcription levels has been established
in a wide range of organisms. It was demonstrated that this bias cannot only be explained
by dinucleotide bias around the reading frame site. Since possible explanations for
preference or rejection of certain codon pairs all focus on the translation, it should
be assumed that both are caused by natural selection acting at the same time on characteristics
that affect translation and other characteristics that affect transcription in order
to minimize the cell's efforts to produce enzymes or at least the more important of
them.
[0176] Optimizing codon pair usage in polypeptide coding sequences can thus be considered
for achieving improved overexpression, in addition to classic single-codon optimization
or single codon harmonization, where only single codons frequencies are considered
for optimization. Codon pair adaptation and single codon adaptation of the same gene
interfere only slightly for the investigated fungal host class and the bacilli in
this example,
i.e. both can be performed at the same time and the result will have "better" single
codon usage and "better" codon pair usage than the wild-type gene, and any of the
two aspects can only be improved slightly when ignoring the other one.
[0177] To read the FASTA files and perform the analysis and optimization, user-friendly
MATLAB functions have been designed. New methods of visualizing codon pair bias and
codon pair usage of single genes have been introduced as well, see Example 2 and Example
4. The genetic algorithm designed for the optimization allows effective dealing with
the constraints imposed by interdependence of adjacent codon pairs while the specially
designed mutation operators that always improve one of the two aspects of sequence
quality (single codon an codon pair fitness) help to circumvent the inefficiency usually
accompanying genetic algorithms because of their trait of generating many bad possible
solutions in the recombination step after the first few generations.
[0178] The proper codon pair usage influences enzyme production, which will be shown experimentally
in the following examples. Codon pair optimized variants of three genes to be expressed
in
B. subtilis have been prepared, of which one each will be compared to a synthetic gene that has
adapted single codon usage only and another one to a synthetic gene that has gone
through the optimization process using the negation of the presumably positive weights,
but still been optimized for single codon usage the same way as before, see Example
4 and Example 5. This way, the notion of Irwin
et al. (1995) that underrepresented codons stimulate translation, which was rejected here,
will also put to the test. For A.
niger, a codon pair optimized version of the aforementioned amyB will be tested and compared
to the wild-type and synthetic gene with single codon harmonization, see Examples
2 and 3.
2. Example 2: Use of a method of the invention for construction of improved DNA sequences
for improving production of the Aspergillus niger fungal amylase enzyme in A. niger.
[0179] Below, the method of the invention is applied to design novel nucleotide sequences
for the AmyB (FUA) gene of
A. niger, which are optimized in single codon and / or codon pair usage for improved expression
in
A. niger. This method can be applied the same way for the improvement of codon use of any nucleotide
sequence.
2.1 Introduction
[0180] A concept of single-codon optimization by means of codon-harmonization was previously
developed by the applicants of this invention and reported in the main text (see also
example 3). In this example we show how one applied the method of the invention to
design a gene that were optimized for both single codon and codon pair usage. In this
specific case weight matrices are applied that have been created by applying two subsets
of 2% and 4% of highly expressed genes of the full
A. niger genome that contains 14,000 genes. For the single-codon usage the algorithm has driven
the solution to a gene with synonymous codon-frequencies as defined by Table B.1 (=
column 3 of Table 2.1), while for the codon-pair usage, it will optimized toward an
optimal set of codon-pairs with a high frequency of them having associated negative
weights (in Table C.2), being the codon-pairs that are overrepresented with respect
to its expected values in the set of 4% highly expressed genes. Note that in case
one does not have a defined list of highly expressed genes for a specified host, one
can also (i) apply the weight matrices of a similar host organism, for example the
P. chrysogenum matrices can be applied for
A niger; or (ii) apply the full genome sequence data or a subset of it to derive good, but
less optimal weight matrices.
2.2 Materials and Methods
2.2.1 Wild-type amyB coding sequence encoding A. niger alpha-amylase AmyB
[0181] The DNA sequence of the
amyB gene encoding the alpha-amylase protein was disclosed in
J. Biochem. Mol. Biol. 37(4):429-438(2004) (Matsubara T., Ammar Y.B., Anindyawati
T., Yamamoto S., Ito K., Iizuka M., Minamiura N. "Molecular cloning and determination
of the nucleotide sequence of raw starch digesting alpha-amylase from Aspergillus
awamori KT-11.") and also can be retrieved from EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/index.html)
under accession number AB083159. The genomic sequence of the native
A. niger amyB gene is shown as SEQ ID NO.1. The corresponding coding or cDNA sequence of
amyB is shown as SEQ ID NO.2. The translated sequence of SEQ ID NO. 2 is assigned as
the SEQ ID NO. 3, representing the
A. niger alpha-amylase protein AmyB. This sequence has also a 100% similarity with the
A. oryzae alpha-amylase protein (
Wirsel S., Lachmund A., Wildhardt G., Ruttkowski E., "Three alpha-amylase genes of
Aspergillus oryzae exhibit identical intron-exon organization."; Mol. Microbiol. 3:3-14(1989, UniProt accession nr. P10529, P11763 or Q00250). Optimization according a method
of the invention has been performed on the amyB cDNA sequence.
2.3 Design procedure
[0182] The optimized coding nucleotide sequence SEQ ID NO 6 is the result of a run with
the described software method. The applied parameters were: population size = 200;
number of iterations = 1000;
cpi = 0.20, CPW matrix = "Table C.2. CPW: Aspergillus niger - highly expressed sequences"
and the CR matrix = "Table B.1 column 4: CR table ANS: Aspergillus niger - highly
expressed sequences". Moreover, a penalty value of +1 is added to
fitcombi for each occurrence of a
PstI (CTGCAG) and NotI (GCGGCGCC) site.
[0183] Convergence of the solution toward a minimal value for
fitcombi is shown in Figure 13. The obtained objective values for SEQ ID NO. 6 are given in
Table 2.2, together with those for SEQ ID NO. 2 and SEQ ID NO. 5. Figure 14 explains
the single codon statistics for these genes as is shown in Figure 15 and 16, and Table
2.2 gives the actual values for the codons in the three sequences. Figure 18-20 show
both single-codon and codon pair statistics for the three gene variants. This type
of graph is explained in detail in Figure 17 and its description. From these graphs
it is clear that single-codon statistics are highly similar for SEQ ID NO. 5 and SEQ
ID NO.6. However, the method of the invention leads to a gene with an improved number
of codon pairs with associated negative weights (
wcp(
g) ≤ 0), 93% vs. 74%, and also a further reduction in
fitcp from -0.18 to -0.34 indicating a more optimal usage of codon pairs having more negative
weights associated with them.
Table 2.1 Codon optimization for amyB.
| AA |
Codon |
Optimal codon distribution [%] |
amyB w.t. [# codons] |
amyB w.t. [% codons / AA] |
amyB sc optimized [# codons] |
amyB sc & cp optimized [# codons] |
| A |
Ala_GCT |
38 |
5 |
11.9 |
16 |
18 |
| Ala_GCC |
51 |
15 |
35.7 |
21 |
23 |
| Ala_GCA |
0 |
12 |
28.6 |
0 |
0 |
| Ala_GCG |
11 |
10 |
23.8 |
5 |
1 |
| C |
Cys_TGT |
0 |
7 |
77.8 |
0 |
0 |
| Cys_TGC |
100 |
2 |
22.2 |
9 |
9 |
| D |
Asp_GAT |
36 |
20 |
47.6 |
15 |
15 |
| Asp_GAC |
64 |
22 |
52.4 |
27 |
27 |
| E |
Glu_GAA |
26 |
5 |
41.7 |
3 |
3 |
| Glu_GAG |
74 |
7 |
58.3 |
9 |
9 |
| F |
Phe_TTT |
0 |
3 |
20.0 |
0 |
0 |
| Phe_TTC |
100 |
12 |
80.0 |
15 |
15 |
| G |
Gly_GGT |
49 |
10 |
23.3 |
21 |
22 |
| Gly_GGC |
35 |
18 |
41.9 |
15 |
15 |
| Gly_GGA |
16 |
10 |
23.3 |
7 |
6 |
| Gly_GGG |
0 |
5 |
11.6 |
0 |
0 |
| H |
His_CAT |
0 |
3 |
42.9 |
0 |
0 |
| His_CAC |
100 |
4 |
57.1 |
7 |
7 |
| I |
Ile_ATT |
27 |
7 |
25.0 |
7 |
7 |
| Ile_ATC |
73 |
19 |
67.9 |
21 |
21 |
| Ile_ATA |
0 |
2 |
7.1 |
0 |
0 |
| K |
Lys_AAA |
0 |
7 |
35.0 |
0 |
0 |
| Lys_AAG |
100 |
13 |
65.0 |
20 |
20 |
| L |
Leu_TTA |
0 |
1 |
2.7 |
0 |
0 |
| Leu_TTG |
13 |
10 |
27.0 |
5 |
4 |
| |
Leu_CTT |
17 |
4 |
10.8 |
6 |
7 |
| Leu_CTC |
38 |
13 |
35.1 |
14 |
15 |
| Leu_CTA |
0 |
3 |
8.1 |
0 |
0 |
| Leu_CTG |
32 |
6 |
16.2 |
12 |
11 |
| M |
Met_ATG |
100 |
10 |
100.0 |
10 |
10 |
| N |
Asn_AAT |
0 |
3 |
11.5 |
0 |
0 |
| Asn_AAC |
100 |
23 |
88.5 |
26 |
26 |
| P |
Pro_CCT |
36 |
6 |
27.3 |
8 |
8 |
| Pro_CCC |
64 |
8 |
36.4 |
14 |
14 |
| Pro_CCA |
0 |
3 |
13.6 |
0 |
0 |
| Pro_CCG |
0 |
5 |
22.7 |
0 |
0 |
| Q |
Gln_CAA |
0 |
5 |
25.0 |
0 |
0 |
| Gln_CAG |
100 |
15 |
75.0 |
20 |
20 |
| R |
Arg_CGT |
49 |
1 |
10.0 |
5 |
5 |
| Arg_CGC |
51 |
2 |
20.0 |
5 |
5 |
| Arg_CGA |
0 |
2 |
20.0 |
0 |
0 |
| Arg_CGG |
0 |
2 |
20.0 |
0 |
0 |
| Arg_AGA |
0 |
0 |
0.0 |
0 |
0 |
| Arg_AGG |
0 |
3 |
8.1 |
0 |
0 |
| S |
Ser_TCT |
21 |
4 |
10.8 |
8 |
8 |
| Ser_TCC |
44 |
9 |
24.3 |
16 |
17 |
| Ser_TCA |
0 |
4 |
10.8 |
0 |
0 |
| Ser_TCG |
14 |
10 |
27.0 |
5 |
4 |
| Ser_AGT |
0 |
4 |
10.8 |
0 |
0 |
| Ser_AGC |
21 |
6 |
16.2 |
8 |
8 |
| T |
Thr_ACT |
30 |
9 |
22.5 |
12 |
12 |
| Thr_ACC |
70 |
13 |
32.5 |
28 |
28 |
| Thr_ACA |
0 |
10 |
25.0 |
0 |
0 |
| Thr_ACG |
0 |
8 |
20.0 |
0 |
0 |
| V |
Val_GTT |
27 |
5 |
16.1 |
8 |
9 |
| |
Val_GTC |
54 |
12 |
38.7 |
17 |
17 |
| Val_GTA |
0 |
4 |
12.9 |
0 |
0 |
| Val_GTG |
19 |
10 |
32.3 |
6 |
5 |
| W |
Trp_TGG |
100 |
12 |
100.0 |
12 |
12 |
| Y |
Tyr_TAT |
0 |
11 |
31.4 |
0 |
0 |
| Tyr_TAC |
100 |
24 |
68.6 |
35 |
35 |
Table 2.2 Codon optimization for amyB.
| Sequence |
Type |
fitsc |
fitcp |
wcp(g) |
fitcombi |
| ≤ 0 |
(cpi=0.2) |
| SEQ ID NO. 2 |
WT |
0.1652 |
0.0329 |
37.3% |
0.090 |
| SEQ ID NO. 5 |
sc optimized |
0.0046 |
-0.1765 |
73.9% |
-0.862 |
| SEQ ID NO. 6 |
sc + cp optimized |
0.0109 |
-0.3420 |
92.6% |
-1.621 |
[0184] All three sequences listed in Table 2.2 are coding sequences of which the translated
sequence is assigned as SEQ ID NO.3.
3. Example 3: Testing of the method of the invention for construction of improved
DNA sequences for providing improved production of the Aspergillus niger fungal amylase enzyme in A. niger.
[0185] The method of the invention is below applied to the improvement of single codon and
codon pair use of the AmyB gene of A.
niger. This method can be applied the same way for the improvement of codon use and improved
expression of any nucleotide sequence.
3.1 Material and Methods
3.1.1 Strains
[0186] WT 1: This A.
niger strain is used as a wild-type strain. This strain is deposited at the CBS Institute
under the deposit number CBS 513.88.
[0187] WT 2: This A.
niger strain is a WT 1 strain comprising a deletion of the gene encoding glucoamylase (
glaA). WT 2 was constructed by using the "MARKER-GENE FREE" approach as described in
EP 0 635 574 B1. In this patent it is extensively described how to delete
glaA specific DNA sequences in the genome of CBS 513.88. The procedure resulted in a
MARKER-GENE FREE Δ
glaA recombinant
A. niger CBS 513.88 strain, possessing finally no foreign DNA sequences at all.
[0188] WT 3: This
A. niger strain is a WT 2 strain comprising a mutation which results in an oxalate deficient
A. niger strain. WT 3 was constructed by using the method as described in
EP1590444. In this patent application, it is extensively described how to screen for an oxalate
deficient
A. niger strain. Strain WT3 was constructed according to the methods of Examples 1 and 2 of
EP1590444, strain WT 3 is mutant strain 22 of
EP1590444 (designated FINAL in
EP1590444).
[0189] WT 4: This
A. niger strain is a WT 3 strain comprising the deletion of three genes encoding alpha-amylases
(
amyB
, amyBI and
amyBII) in three subsequent steps. The construction of deletion vectors and genomic deletion
of these three genes has been described in detail in
W02005095624. The vectors pDEL-AMYA, pDEL-AMYBI and pDEL-AMYBII, described in
WO2005095624, have been used according the "MARKER-GENE FREE" approach as described in
EP 0 635 574 B1. The procedure described above resulted in an oxalate deficient, MARKER-GENE FREE
Δ
glaA, Δ
amyA, Δ
amyBI and ΔamyBII amylase-negative recombinant
A. niger CBS 513.88 strain, possessing finally no foreign DNA sequences at all. As such, WT
4 is more optimized for alpha-amylase expression compared to WT1.
3.1.2 A. niger shake flask fermentations
[0190] A. niger strains were pre-cultured in 20 ml pre-culture medium as described in the Examples:
"A. niger shake flask fermentations" section of
WO99/32617. After overnight growth, 10 ml of this culture was transferred to fermentation medium
1 (FM1) for alpha-amylase fermentations. Fermentation is performed in 500 ml flasks
with baffle with 100 ml fermentation broth at 34°C and 170 rpm for the number of days
indicated, generally as described in
WO99/32617.
[0191] This FM1 medium contains per liter: 52.570 g glucose, 8.5 g maltose, 25 g Caseinhydrolysate,
12.5 g Yeast extract, 1 g KH2PO4, 2 g K2SO4, 0.5 g MgSO4.7H2O, 0.03 g ZnCl2, 80.02
g CaCl2, 0.01 g MnSO4.4H2O, 0.3 g FeSO4.7H2O, 10 ml Pen-Strep (Invitrogen, cat. nr.
10378-016), 48 g MES, adjusted to pH 5.6 with 4 N H2SO4.
3.1.3 Fungal alpha-amylase activity
[0192] To determine the alpha-amylase activity in
A. niger culture broth, the Megazyme cereal alpha-amylase kit is used (Megazyme, CERALPHA
alpha amylase assay kit, catalogue. ref. K-CERA, year 2000-2001), according protocol
of the supplier. The measured activity is based on hydrolysis of non-reducing-end
blocked p-nitrophenyl maltoheptaoside in the presence of excess glucoamylase and α-glucosidase.
The amount of formed p-nitrophenol is a measure for alpha-amylase activity present
in a sample.
3.2 Construction of an Aspergillus expression construct for the wild-type amyB coding
sequence encoding A. niger alpha-amylase AmyB
[0193] The DNA sequence of the wild-type amyB gene have been described under 2.2.1. For
expression analysis in
Aspergillus species of
A. niger amyB constructs, the strong amyB promoter is applied for over-expression of the alpha
amylase enzyme in
A.
niger using pGBFIN-based expression constructs (as described in
WO99/32617). The translational initiation sequence of the
amyB promoter including ATG start codon of PamyB is 5'-GGCATTTATG ATG-3' or 5'-GAAGGCATTT
ATG-3', dependent on which ATG is selected as start codon. This translational initiation
sequence of PamyB has been modified into 5'-CACCGTCAAA ATG-3' in all subsequent
amyB expression constructs generated below.
[0194] Appropriate restriction sites were introduced at both ends to allow cloning in an
expression vector. The native
amyB gene contains a 'TGA' stop codon. In all
amyB constructs made below, the 5'-TGA-3' translational termination sequence was replaced
by 5'-TAAA-3' followed by the 5'-TTAATTAA-3' of the
PacI restriction site. At the 5'-end an
XhoI site was introduced and at the 3'-end a
PacI site. Therefore, a fragment comprising a modified genomic
amyB promoter and amyB cDNA sequence was completely synthesized, cloned and the sequence
was confirmed by sequence analysis.
[0195] This fragment comprising the alpha-amylase promoter with modified translational initiation
sequence and amyB cDNA sequence with modified translational termination sequence was
digested with
XhoI and
PacI and introduced in an
XhoI and
PacI digested pGBFIN-12 vector (construction and layout as described in
WO99/32617), generating pGBFINFUA-1 (Figure 21). The sequence of the introduced PCR fragment
was confirmed by sequence analysis and its sequence is presented in SEQ ID NO. 4.
3.3 Improvement of the single-codon usage for the alpha-amylase coding sequence amyB
for expression in A. nigger
[0196] A method of single-codon optimization is applied below for the improvement of codon
use of the amyB gene of
A. niger. The nucleotide coding sequence of the native amyB is shown as SEQ ID NO.2.
[0197] The codon use of the native
amyB gene of
A. niger and the synthetic optimized variant are given in Table 2.1 below. For the native
and single-codon optimized synthetic
amyB gene, the exact numbers for each codon are given as well as the distribution per
amino acid. Additionally, the third column provides the proposed optimal distribution,
which is the target for optimization.
[0198] For the group 1 amino acids, there is only one possibility. Group 1 consists of methionine
that is always encoded by ATG and tryptophane that is always encoded by TGG.
[0199] The group 2 amino acids are subject to optimization according to the extreme frequency
of 0% or 100%, the strategy is clear. All codons for a group 2 AA are specifically
changed into the optimal variant of the two possible codons. More specifically for
cysteine, a codon, TGT is replaced by TGC; for phenylalanine, TTT by TTC; for histidine,
CAT by CAC; for lysine, AAA by AAG, for asparagine, AAT by AAC; for glutamine, CAA
by CAG; for tyrosine, TAT by TAC.
[0200] The group 3 amino acids can be encoded by several codons as indicated in Table 3.1;
each codon being present in a preferred codon frequency: for alanine GCT, GCC, GCA,
GCG; for aspartate, GAT, GAC; for glutamate, GAA, GAG; for glycine, GGT, GGC, GGA,
GGG; for isoleucine, ATT, ATC, ATA; for leucine, TTA, TTG, CTT, CTC, CTA, CTG; for
proline, CCT, CCC, CCA, CCG; for arginine, CGT, CGC, CGA, CGG, AGA, AGG; for serine,
TCT, TCC, TCA, TCG, AGT, AGC; for threonine, ACT, ACC, ACA, ACG; for valine, GTT,
GTC, GTA, GTG, are optimized according the following methodology:
For the group 3 amino acids (AA) and their encoding codons, the calculation of the
optimal occurrence of each possible codon within a given coding sequence is performed
according to the following methodology:
- i.
- sum for each of the respective group 3 AA, the total number of residues encoded in
the given sequence, see column A1 (Table 3.1),
- ii.
- for each AA and codon encoding that AA, multiply the total number for that AA by the
optimal codon distribution in Table 2.1, resulting in a raw codon distribution, which
generally may contain decimal numbers, see column A2 (Table 3.2),
- iii.
- round off the values of the raw codon distribution (ii), by removing the digits, resulting
in a rounded off codon distribution, see column A3 (Table 3.2),
- iv.
- sum for each of the AA, the total number of AA represented in the rounded off codon
distribution (iii), see column A4 (Table 3.1),
- v.
- calculate the total missing number of residues for each of the respective AA in the
rounded off codon distribution, by subtracting the total number of residues encoded
in
- vi.
- the given sequence (i) with the total number of AA represented in the rounded off
codon distribution (iv), see column A5 (Table 3.1), calculate for each codon, the
decimal difference between the raw codon distribution (ii) and the rounded off codon
distribution (iii) by subtraction, see column A6 (Table 3.2),
- vii.
- multiply for each codon, the decimal difference (vi) and the optimal codon distribution
in table 1, giving a weight value for each codon, see column A7 (Table 3.2),
- viii.
- for each of the respective AA, select for the amount of missing residues (v), the
respective amount of codons that have the highest weight value (vii), see column A8
(Table 3.2),
- ix.
- the calculation of the final optimal codon distribution within a given sequence encoding
a polypeptide is calculated by summing the rounded off codon distribution (iii) and
the selected amount of missing residues (viii) for each codon, see column A9 (Table
3.2).
Table 3.1
| AA(i) |
I |
A1 |
A4 |
A5 |
| Ala |
1 |
42 |
40 |
2 |
| Asp |
2 |
42 |
41 |
1 |
| Glu |
3 |
12 |
11 |
1 |
| Gly |
4 |
43 |
42 |
1 |
| Ile |
5 |
28 |
27 |
1 |
| Leu |
6 |
37 |
35 |
2 |
| Pro |
7 |
22 |
21 |
1 |
| Arg |
8 |
10 |
9 |
1 |
| Ser |
9 |
37 |
35 |
2 |
| Thr |
10 |
40 |
40 |
0 |
| Val |
11 |
31 |
29 |
2 |
Table 3.2
| Codon |
A2 |
A3 |
A6 |
A7 |
A8 |
A9 |
| Ala_GCT |
15.96 |
15 |
0.96 |
0.365 |
1 |
16 |
| Ala_GCC |
21.42 |
21 |
0.42 |
0.014 |
1 |
21 |
| Ala_GCA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Ala_GCG |
4.62 |
4 |
0.62 |
0.068 |
0 |
5 |
| Asp_GAT |
15.12 |
15 |
0.12 |
0.043 |
0 |
15 |
| Asp_GAC |
26.88 |
26 |
0.88 |
0.563 |
1 |
27 |
| Glu_GAA |
3.12 |
3 |
0.12 |
0.031 |
0 |
3 |
| Glu_GAG |
8.88 |
8 |
0.88 |
0.651 |
1 |
9 |
| Gly_GGT |
21.07 |
21 |
0.07 |
0.034 |
0 |
21 |
| Gly_GGC |
15.05 |
15 |
0.05 |
0.018 |
0 |
15 |
| Gly_GGA |
6.88 |
6 |
0.88 |
0.141 |
1 |
7 |
| Gly_GGG |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Ile_ATT |
7.56 |
7 |
0.56 |
0.151 |
0 |
7 |
| Ile_ATC |
20.44 |
20 |
0.44 |
0.321 |
1 |
21 |
| Ile_ATA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Leu_TTA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Leu_TTG |
4.81 |
4 |
0.81 |
0.105 |
1 |
5 |
| Leu_CTT |
6.29 |
6 |
0.29 |
0.049 |
0 |
6 |
| Leu_CTC |
14.06 |
14 |
0.06 |
0.023 |
0 |
14 |
| Leu_CTA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Leu_CTG |
11.84 |
11 |
0.84 |
0.269 |
1 |
12 |
| Pro_CCT |
7.92 |
7 |
0.92 |
0.331 |
1 |
8 |
| Pro_CCC |
14.08 |
14 |
0.08 |
0.051 |
0 |
14 |
| Pro_CCA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Pro_CCG |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Arg_CGT |
4.9 |
4 |
0.9 |
0.441 |
1 |
5 |
| Arg_CGC |
5.1 |
5 |
0.1 |
0.051 |
0 |
5 |
| Arg_CGA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Arg_CGG |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Arg_AGA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Arg_AGG |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Ser_TCT |
7.77 |
7 |
0.77 |
0.162 |
1 |
8 |
| Ser_TCC |
16.28 |
16 |
0.28 |
0.123 |
0 |
16 |
| Ser_TCA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Ser_TCG |
5.18 |
5 |
0.18 |
0.025 |
0 |
5 |
| Ser_AGT |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Ser_AGC |
7.77 |
7 |
0.77 |
0.162 |
1 |
8 |
| Thr_ACT |
12 |
12 |
0 |
0.000 |
0 |
12 |
| Thr_ACC |
28 |
28 |
0 |
0.000 |
0 |
28 |
| Thr_ACA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Thr_ACG |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Val_GTT |
8.37 |
8 |
0.37 |
0.100 |
0 |
8 |
| Val_GTC |
16.74 |
16 |
0.74 |
0.400 |
1 |
17 |
| Val_GTA |
0 |
0 |
0 |
0.000 |
0 |
0 |
| Val_GTG |
5.89 |
5 |
0.89 |
0.169 |
1 |
6 |
[0201] Subsequently, a completely new nucleotide coding sequence was created by random distribution
of the proposed number of synonymous codons (Table 2.1) for each amino acid in the
original amyB peptide. The synthetic
amyB sequence, resulting from the process described above, is indicated in SEQ ID NO.5.
Secondary structures in the modified coding sequence were checked using the Clone
Manager 7 program (Sci. Ed. Central: Scientific & Educational software, version 7.02)
for possible occurrence of harmful secondary structures.
3.4 Optimization of the coding sequence according a the combined single-codon and
codon-pair method of the invention for the alpha-amylase coding sequence amyB for expression in A. niger
[0202] A method of the invention is applied for the improvement of the coding sequence of
the amyB gene of
A. niger. The optimized
amyB sequence, resulting from the process described in Example 2, is indicated in SEQ
ID NO.6. Secondary structures in the modified coding sequence were checked using the
Clone Manager 7 program (Sci. Ed. Central: Scientific & Educational software, version
7.02) for possible occurrence of harmful secondary structures.
3.5 Construction of modified amyB expression vectors for expressing_A. niger alpha-amylase AmyB encoded by coding sequences described in examples 3.2 and 3.3
[0203] The DNA sequence of the
XhoI -
PacI fragment of pGBFINFUA-1 (Figure 21) is shown as SEQ ID NO. 4 and comprises the amyB
promoter and wild-type amyB cDNA sequence with a modified translational initiation
sequence and modified translation stop sequence. The DNA sequence comprising a variant
of the translational initiation sequence of the alpha-amylase promoter combined with
a codon optimized coding sequence for the alpha-amylase encoding amyB gene, as described
in Example 1.2, is shown as SEQ ID NO. 7. The DNA sequence comprising a variant of
the translational initiation sequence of the alpha-amylase promoter combined with
an optimized coding sequence according the combined single-codon and codon-pair method
of the invention for the alpha-amylase encoding amyB gene, as described in Example
3.3, is shown as SEQ ID NO. 8.
[0204] For cloning these modified sequence variants in an expression vector, the two synthetic
gene fragments were digested with
XhoI and
PacI and introduced in the large fragment of an
XhoI and
PacI digested pGBFINFUA-1 vector (Figure 21), generating variant expression vectors.
After checking the integration of the correct fragment, the variant expression constructs
were named pGBFINFUA-2 and pGBFINFUA-3, as described below in Table 3.3.
Table 3.3: Modified expression constructs for alpha-amylase expression in
A. niger
| Plasmid name |
SEQ ID NO |
Translation initiation sequence |
Coding sequence |
Translation stop sequence |
| pGBFINFUA-1 |
4 |
Modified |
w.t. |
Modified |
| |
|
(CACCGTCAAA ATG) |
|
(TAAATA) |
| pGBFINFUA-2 |
7 |
Modified |
Single-codon optimized |
Modified |
| |
|
(CACCGTCAAA ATG) |
(TAAATA) |
| pGBFINFUA-3 |
8 |
Modified |
Modified according invention |
Modified |
| |
|
(CACCGTCAAA ATG) |
(TAAATA) |
[0205] The translated sequences of the amyB coding sequences of plasmid pGBFINFUA-1 to pGBFINFUA-3
are according to the amino acid sequence as depicted in SEQ ID NO 3, representing
the wild-type
A. niger alpha-amylase enzyme.
3.6 Expression in A. niger of modified pGBFINFUA- expression constructs of A. niger alpha-amylase
[0206] The pGBFINFUA-1, -2 and -3 expression constructs, prepared as described above, were
introduced in
A. niger by transformation as described below and according to the strategy depicted in Figure
22.
[0207] In order to introduce the three pGBFINFUA-1, -2 and -3 vectors (Table 3.3) in WT
4, a transformation and subsequent selection of transformants was carried out as described
in
WO98/46772 and
WO99/32617. In brief, linear DNA of the pGBFINFUA- constructs was isolated and used to transform
A. niger. Transformants were selected on acetamide media and colony purified according standard
procedures. Colonies were diagnosed for integration at the glaA locus and for copy
number using PCR. Ten independent transformants of each of the pGBFINFUA-1, -2 and
-3 constructs with similar estimated copy numbers (low copy: 1-3) were selected and
numbered using the name of the transforming plasmid, as for example FUA-1-1 (for the
first pGBFINFUA-1 transformant) and FUA-3-1 (for the first pGBFINFUA-3 transformant),
respectively.
[0208] The selected FUA-strains and
A. niger WT 4 were used to perform shake flask experiments in 100 ml of the medium and under
conditions as described above. After 3 and 4 days of fermentation, samples were taken.
[0209] The production of alpha-amylase enzyme was measured in all three different
A. niger FUA-transformants. As can be learned from Figure 23, optimization of the coding sequence
according the method of the invention shows a higher improvement on expression of
AmyB compared to the other method tested called single-codon optimization. These figures
have been summarized in Table 3.4 below.
Table 3.4. Relative average alpha-amylase activities of transformants with wild-type
construct compared to those with modified
amyB coding sequences (as concluded from Figure 23).
| Strain type |
SEQ ID NO |
Coding sequence |
Alpha-amylase activity |
| FUA-1 |
4 |
w.t. |
100% |
| FUA-2 |
7 |
Single-codon optimized |
200% |
| FUA-3 |
8 |
Modified according invention |
400% |
[0210] These results indicate clearly that the method of the invention can be applied to
improve protein expression in a host, although the expression construct and host has
already several other optimizations, such as for example a strong promoter, an improved
translation initiation sequence, an improved translation stop sequence, an optimal
single-codon usage and / or an improved host for protein expression.
4. Example 4: Design of improved DNA sequences for expression of three heterologous
enzymes in Bacillus species: Bacillus subtilis and Bacillus amiloliquefaciens.
4.1. Introduction
[0211] Example 4 describes the experiment design and application of a method of the invention
described in this patent for (improved) expression of heterologous proteins in both
Bacillus species, more specifically in this example
Bacillus subtilis and
Bacillus amiloliquefaciens. A preferred expression host is
Bacillus amiloliquefaciens.
[0213] In this example, the full sequence of
B. subtilis was chosen as the basis for calculating single-codon frequencies and codon-pair weights.
Comparison of GC-content and tRNAs provided a similar picture for the
Bacillus species mentioned (
vide supra). This is an indication that the same statistics are applicable for other related
Bacillus species. Moreover, from example 1 (see also Figure 4), it was already clear that
related species show similar codon-pair frequencies.
[0214] In Figure 4 (see also example 1), a codon-pair comparison plot, based on full genome
statistics for
B. subtilis vs. B. amyloliquefaciens can be found. A good correlation between both data sets is observed. Moreover, it
seems that
B. amyloliquefaciens is more versatile, since there is a subgroup of codon-pair combinations that is well
accepted in
B. amiloliquefaciens, while it has highly negative values for
B. subtilis; the opposite is not observed.
4.2. Experiment design
[0215] Three proteins sequences were selected for expression in both
Bacillus subtilis and
[0216] Bacillus amiloliquefaciens:
Protein 1: Xylose (glucose) isomerase xylA (EC.5.3.1.5) from Bacillus stearothermophilus;
Protein 2: Xylose (glucose) isomerase xylA (EC.5.3.1.5) from Streptomyces olivochromogenes;
Protein 3: L-arabinose isomerase (EC 5.3.1.4) from Thermoanaerobacter mathranii.
Table 4.1 Overview gene constructs; Protein 2 was chosen to further explore the codon-pair
concept in broader sense.
| |
Gene |
Protein |
Single codon-optimization |
Single codon & positive codon-pair optimization |
Single codon & negative codon-pair optimization |
| Protein 1 |
|
SEQ ID NO. 9 |
SEQ ID NO. 16 |
SEQ ID NO. 13 |
|
| Protein 2 |
|
SEQ ID NO. 10 |
SEQ ID NO. 17 |
SEQ ID NO. 14 |
SEQ ID NO. 18 |
| Protein 3 |
SEQ ID NO. 11 |
SEQ ID NO. 12 |
|
SEQ ID NO. 15 |
|
[0217] Table 4.1 provides an overview of the methods applied to the 3 genes described above.
For Protein 1, Protein 2 and Protein 3, the codon-pair optimization of the method
of the invention is applied in addition to the single codon optimization developed
before.
[0218] As a control, the effect of single codon optimization and negative codon pair optimization
was tested experimentally by including 2 additional constructs for protein 2. One
variant (SEQ.ID. 18) is designed where it is 'optimized' toward bad codon pairs (i.e.
negative codon-pair optimization), and a second one with only single-codon optimization
(SEQ.ID. 17). Protein 2 was chosen, since
Streptomyces species show highly different codon-pair bias, see example 1 and Figure 4.
[0219] All designed
B. amyloliquefaciens genes avoided the occurrence of
NdeI (CATATG) and
BamHI (GGATTC) restriction sites. Additionally, they contained a single restriction site
for removing the
E. coli part of the cloning vector pBHA12.
4.3. Single codon optimization
[0220] Single-codon optimized variants for Protein 1 and Protein 2 were designed using the
method described in Example 3.3 for single-codon optimization, resulting in SEQ ID
NO. 16 and SEQ ID NO. 17, respectively. The applied single-codon distribution table
(Table 4.2) was determined using the 50 most-highly expressed genes as determined
by 24 Affymetrix GeneChips for
B. subtilus 168 using 6 independent fermentation time-series. All GeneChips were normalized with
respect to their arithmetic mean. The expression list excludes those genes that were
deliberately over expressed in strain engineering, and hence their measured expression
level cannot be correlated with their codon usage.
[0221] Determination of single codon distribution table 4.2 is done by visual inspection
of codon frequency histograms of 50, 100, 200, 400 highest expressed sequences and
of all
B. subtilis sequences. In case of a clear trend toward either 0% or 100% for the most highly
expressed genes, an assignment of 0% and 100% was made, respectively. For the other
codons that were not assigned, the average usage was calculated and normalized to
the set of synonymous codons, by leaving out the assigned codons. The resulting target
single-codon frequencies are given in Table 4.2, column 3.
Table 4.2 Codon-usage distribution for synthetic gene design on the basis of the 50
most highly-expressed genes and visual inspection of single codon usage histograms,
e.g. Figure 24; Don' t care terms can be applied during codon-pair optimization to let
the choice for those codons free, thus not taking into account single-codon optimization
for these codons.
| |
|
Single codon distribution |
Don't care = 0 |
| |
|
% |
care = 1 |
| A |
Ala_GCT |
50 |
0 |
| Ala_GCC |
0 |
1 |
| Ala_GCA |
50 |
0 |
| Ala_GCG |
0 |
1 |
| C |
Cys_TGT |
51 |
0 |
| Cys_TGC |
49 |
0 |
| D |
Asp_GAT |
63 |
1 |
| Asp_GAC |
37 |
1 |
| E |
Glu_GAA |
100 |
1 |
| Glu_GAG |
0 |
1 |
| F |
Phe_TTT |
55 |
0 |
| Phe_TTC |
45 |
0 |
| G |
Gly_GGT |
31 |
1 |
| Gly_GGC |
34 |
1 |
| Gly_GGA |
35 |
1 |
| Gly_GGG |
0 |
1 |
| H |
His_CAT |
71 |
0 |
| His_CAC |
29 |
0 |
| I |
Ile_ATT |
60 |
0 |
| Ile_ATC |
40 |
0 |
| Ile_ATA |
0 |
1 |
| K |
Lys_AAA |
100 |
1 |
| Lys_AAG |
0 |
1 |
| L |
Leu_TTA |
39 |
0 |
| Leu_TTG |
24 |
0 |
| |
Leu_CTT |
37 |
0 |
| Leu_CTC |
0 |
1 |
| Leu_CTA |
0 |
1 |
| Leu_CTG |
0 |
1 |
| M |
Met_ATG |
100 |
1 |
| N |
Asn_AAT |
45 |
0 |
| Asn_AAC |
55 |
0 |
| P |
Pro_CCT |
35 |
0 |
| Pro_CCC |
0 |
1 |
| Pro_CCA |
22 |
0 |
| Pro_CCG |
43 |
0 |
| Q |
Gln_CAA |
100 |
1 |
| Gln_CAG |
0 |
1 |
| R |
Arg_CGT |
38 |
0 |
| Arg_CGC |
34 |
0 |
| Arg_CGA |
0 |
1 |
| Arg_CGG |
0 |
1 |
| |
Arg_AGA |
28 |
0 |
| Arg_AGG |
0 |
1 |
| S |
Ser_TCT |
34 |
0 |
| Ser_TCC |
0 |
1 |
| Ser_TCA |
34 |
0 |
| Ser_TCG |
0 |
1 |
| |
Ser_AGT |
0 |
1 |
| Ser_AGC |
32 |
0 |
| T |
Thr_ACT |
33 |
0 |
| Thr_ACC |
0 |
1 |
| Thr_ACA |
46 |
0 |
| Thr_ACG |
22 |
1 |
| V |
Val_GTT |
47 |
1 |
| Val_GTC |
0 |
1 |
| Val_GTA |
23 |
1 |
| Val_GTG |
30 |
1 |
| W |
rp_TGG |
100 |
1 |
| Y |
Tyr_TAT |
62 |
0 |
| Tyr_TAC |
38 |
0 |
| |
Stop_TGA |
0 |
1 |
| Stop_TAG |
0 |
1 |
| Stop_TAA |
100 |
1 |
4.4. Codon pair optimization
[0222] Codon pair optimization was performed according the method of the invention. The
optimized coding nucleotide sequences SEQ ID NO. 13-15 are the result of a run with
the described software method. The applied parameters were: population size = 200;
number of iterations = 1000;
cpi = 0.20, CPW matrix = "Table C.4. CPW:
Bacillus subtilis - highly expressed sequences" and the CR matrix = "Table B.1 column 5: CR table BAS:
Bacillus subtilis - highly expressed sequences" (also in Table 4.2) and 'don't care elements as in Table
4.2. Moreover, a penalty value of +1 is added to
fitcombi for each occurrence of a
NdeI (CATATG) and
BamHI (GGATTC) restriction site.
[0223] The optimized coding nucleotide sequences SEQ ID NO. 18 is the result of a run with
the described software method. The applied parameters were: population size = 200;
number of iterations = 1000;
cpi = 0.20, CPW matrix = -1 times "Table C.4. CPW:
Bacillus subtilis - highly expressed sequences" (for obtaining codon-pair optimization toward bad codon
pairs) and the CR matrix = "Table B.1 column 5: CR table BAS:
Bacillus subtilis - highly expressed sequences" (also in Table 4.2) and 'don't care elements as in Table
4.2. Moreover, a penalty value of +1 is added to
fitcombi for each occurrence of a
Nde1 (CATATG) and
BamHI (GGATTC) restriction site.
[0224] 'Don't care' elements in Table 4.2 are chosen for those codons that do not show codon
bias. This was done by visual inspection of the single-codon bias graph, see 4.3.
The usage of such elements provides additional freedom to the codon-pair part of the
optimization.
[0225] All optimizations converged toward a minimal value for
fitcombi. The obtained objective values for SEQ ID NO. 13-15 and SEQ ID NO18 are given in Table
4.2, together with those for SEQ ID NO. 11, SEQ ID NO. 16 and SEQ ID NO. 17. From
that data it is clear that single codon statistics are highly similar for SEQ ID NO.
16 and SEQ ID NO. 17 in comparison with SEQ ID NO. 14 and SEQ ID NO. 15. However,
the method of the invention leads to a gene with an improved number of codon pairs
with associated negative weights, indicating a more optimal usage of codon pairs having
more negative weights associated with them, see Table 4.3.
[0226] 'Optimizing' using maximization of
fitcp leads to a gene with an increased number of codon pairs with associated positive
weights, indicating an increased usage of codon pairs having more positive weights
associated with them, thus bad influence on translation characteristics is expected.
For SEQ ID NO. 18 (
wcp(
g) ≤ 0) is 24 % vs. 85% for SEQ ID NO. 14, and also
fitcp increased from 1.20 to -1.43.
Table 4.3 Codon optimization; objective fitness values for genes for expression in
B. subtilis and
B. amyloliquefaciens.
| Sequence |
Type |
fitsc |
fitcp |
wcp(g) |
fitcombi |
| ≤ 0 |
(cpi=0.2) |
| SEQ ID NO. 11 |
WT |
0.078 |
0.097 |
41.1% |
0.350 |
| SEQ ID NO. 13 |
sc + cp optimized |
0.004 |
-0.293 |
89.1% |
-1.439 |
| SEQ ID NO. 14 |
sc + cp optimized |
0.004 |
-0.292 |
84.8% |
-1.431 |
| SEQ ID NO. 15 |
sc + cp optimized |
0.003 |
-0.303 |
89.2% |
-1.493 |
| SEQ ID NO. 16 |
sc optimized |
0.002 |
-0.023 |
56.9% |
-0.114 |
| SEQ ID NO. 17 |
sc optimized |
0.003 |
0.087 |
44.3% |
0.428 |
| SEQ ID NO. 18 |
sc + negative cp optimized |
0.015 |
0.257 |
23.5% |
1.196 |
5. Example 5: Testing the method of the invention for expression of three heterologous
enzymes in Bacillus subtilis and Bacillus amyloliquefaciens.
5.1 Introduction
[0227] Example 5 describes the experiment and results of the expression of 3 heterologous
genes with sequence variants of these in both
Bacillus subtilis and
Bacillus amiloliquefaciens hosts cells. Variants are made according the method of the invention, as described
in Example 4.
5.2 Materials and Methods
5.2.1 Bacillus growth media
[0228] 2*TY (per L): tryptone peptone 16 g, yeast extract Difco 10 g, NaCl 5 g.
5.2.2 Transformation of B. subtilis
[0230] 2x Spizizen medium: 28 g K
2HPO
4;12 g KH
2PO
4; 4 g (NH
4)
2SO
4; 2.3 g Na
3-citrate.2H
2O; 0.4 g MgSO
4.7 H
2O; H
2O to 900 ml and adjust to pH 7.0-7.4 with 4N NaOH. Add H
2O to 1 liter.
[0231] Autoclave 20 minutes at 120°C.
[0232] 1x Spizizen-plus medium: add to 50 ml 2x Spizizen medium 50 ml milliQ;1 ml 50% glucose and 100 µl casamino
acids (20 µg/ml final concentration).
[0233] A single
Bacillus colony (or an aliquot from a deep freeze vessel) from a non-selective 2xTY agar plate
was inoculated in 10 ml 2xTY broth in a 100 ml shake flask. Cells were grown overnight
in an incubator shaker at 37°C and ± 250 rpm. The OD was measured at 600 nm and the
culture was diluted with 1x Spizizen-plus medium till OD
600≈0.1. Cells were grown at 37°C and 250-300 rpm till the culture OD
600 is 0.4 - 0.6. The culture was diluted 1:1 with 1x Spizizen medium supplemented with
0.5% glucose (starvation medium) and it was incubated for 90 min at 37°C and 250-300
rpm. The culture was centrifuged at 4500 rpm in a tabletop centrifuge for 10 minutes.
90% of the supernatant was removed and pellet was suspended in rest volume. DNA (1
- 5 µg in a maximum of 20 µl) was mixed with 0.5 ml competent cells in a universal
and incubated for 1 hour at 37°C in a rotary shaking water bath under firm shaking
(≈5/6). Cells were plated (20 to 200 µl) on selective 2xTY agar plates containing
25-µg/ml kanamycin and incubated over night at 37°C.
5.2.3 Preparation of cell-free extract
[0234] The pellet obtained from 1 ml culture was resuspended in buffer A containing 10 mM
Thris-HCl (pH 7.5), 10 mM EDTA, F50 mM NaCl, 1mg/ml lysozyme and protease inhibitors
(Complete EDTA-free protease inhibitor cocktail, Roche). The resuspended pellets were
incubated for 30 min at 37°C, for protoplastation and subsequently sonicated as follows:
30 sec, 10 amplitude microns (3 cycles), with 15 sec. cooling between cycles. After
sonification cell debris was spun down by centrifugation (10 min, 13000 rpm at 4°C),
and the clear lysates were used for further analysis.
5.2.4 Selection of glucose isomerase and L-arabinose isomerase encoding genes and
design of synthetic genes for expression in Bacillus amyloliquefaciens and Bacillus subtilis
[0235] Three enzymes selected are:
- 1. Bacillus stearothermophilus xylose isomerase (P54272 Swissprot); protein sequence SEQ ID NO. 9,
- 2. Streptomyces olivochromogenes xylose isomerase (P15587 Swissprot); protein SEQ ID NO. 10,
- 3. Thermoanaerobacter mathranii L-arabinose isomerase (AJ 582623.1 EMBL, and also US2003/012971A1), protein SEQ ID NO. 11, nucleotide SEQ ID NO. 12.
[0236] As seen above the selected enzymes have different microbial origin. With the aim
to overproduce these enzymes in
Bacillus subtilis or
Bacillus amyloliquefaciens we have optimized the nucleotide sequence for each protein in such a way that it
is suitable for expression in
Bacillus species, see Example 4.
[0237] We have optimized the nucleotide sequences that encode the above mentioned enzymes.
The sequences are listed in the sequence list under the SEQ ID NO. 13. (
Bacillus stearothermophilus glucose (xylose) isomerase), SEQ ID NO. 14. (
Streptomyces olivochromogenes glucose (xylose) isomerase), SEQ ID NO. 15. (
Thermoanaerobacter mathranii L-arabinose isomerase). As a control, one variant with a single-codon optimization
without codon-pair optimization, SEQ ID NO. 16-17, and one with single-codon optimization
with "negative codon-pair optimization" SEQ ID NO. 18, were generated, see example
4 and Table 4.1.
5.3 Cloning of the glucose isomerase and L-arabinose isomerase encoding genes in the
E.coli/Bacillus shuttle vector and transformation to Bacilli
[0238] For the expression of the selected genes in
Bacilli we have used the pBHA12
E.coli/
Bacillus shuffle vector (Figure 26). This vector is essentially derived from the expression
vector pBHA-1 (
EP 340878) in which a promoter derived from the
amyQ gene of
Bacillus amyloliquefaciens replaced the
HpaII promoter. The pBHA12 plasmid contains two multiple cloning sites (Figure 26). All
selected and optimized genes were made synthetically (DNA 2.0, Menlo Park, CA, U.S.A.)
as two fragments (A and B). The A fragment corresponding to the 5' end of the gene
was clone behind the
amyQ promoter. Both fragments have been extended with specific restriction endonuclease
sites in order to allow direct cloning in the multiple cloning sites 1 and 2 (see
Figure 27). The 3' end of the fragment A and 5' end of the fragment B overlap by a
unique restriction endonuclease site that allows excision of the
E. coli part of the vector and back ligation prior to the transformation of
Bacillus subtilis (CBS 363.94). During the procedure of cloning and transformation of
B. subtilis,
E. coli was used as an intermediate host. The two-step cloning approach in pBHA12 was chosen
in order to avoid possible problems during cloning and propagation of the expression
vectors in
E. coli. In Table 5.1 the restriction enzyme recognition sites added to fragments A and B
are listed as well as the unique restriction site that allows back ligation and as
such reconstruction of an entire and functional gene. All the 5' ends of the A fragments
contain
NdeI site (recognition sequence CATATG) that allows cloning of genes as a fragment starting
exactly at their respective start codon (ATG).
Table 5.1. The summary of the restriction endonuclease (RE) cloning sites that have
been added to the gene fragments to facilitate the cloning in pBHA12.
| Gene/RE |
Fragment A |
Fragment B |
Unique RE site |
| 5' end |
3' end |
5' end |
3' end |
(position in the gene) |
| B. stearothermophilus GI |
NdeI |
BamHI |
SmaI |
KpnI |
PvuII (496 bp) |
| S. olivochromogenes GI |
NdeI |
MluI |
EcoRV |
KpnI |
ClaI (372 bp) |
| T. mathranii ARAA |
NdeI |
MluI |
SacI |
KpnI |
ClaI (708 bp) |
[0239] The A and B fragments of 5 genes have been cloned in two steps in the MCS1 and 2,
respectively, as shown for the SEQ ID NO. 13 in Figure 27, using the standard molecular
biology methods (
Sambrook & Russell, Molecular Cloning: A Laboratory Manual, 3rd Ed., CSHL Press, Cold
Spring Harbor, NY, 2001; and
Ausubel et al., Current Protocols in Molecular Biology, Wiley InterScience, NY, 1995). The transformation was performed in the
E.coli TOP10 (Invitrogen) or INV110 (Invitrogen) in the case of using methylation sensitive
restriction endonucleases in a further step. Several
E. coli ampicilline resistant transformants for each expression construct were isolated using
the mini or midi plasmid isolation kits (Macherey-Nagel and Sigma, respectively).
The correct ligation of the corresponding A and B fragments in the pBHA12 vector was
confirmed by restriction analysis. In the next step the pBHA12 plasmids that contained
the A and B fragments of the genes were digested with the unique restriction endonuclease
(see Table 5.1) to excise the
E. coli part of the vector. The
Bacillus part of the vector that contained the interrupted gene was isolated from the agarose
gel using gel extraction kit (Macherey-Nagel) and back ligated. The ligation mixture
was transformed to
B. subtilis CBS 363.94 strain by competent cell transformation. Several
B. subtilis kanamycin resistant transformants for each expression construct were isolated using
the mini or midi plasmid isolation kits (Macherey-Nagel and Sigma, respectively).
The expression constructs were checked by restriction analysis for the correct pattern
after the excision of the
E. coli part and the back ligation of the
Bacillus part of the pBHA12 vector. For each construct three
B. subtilis transformants were selected for analysis of the cell free extract.
5.4 Detection of overproduced enzymes in Bacilli
[0240] Three
B. subtilis transformants and three
B. amyloliquefaciens transformants for each construct were used to analyze the cell free extract for the
presence of the corresponding protein - glucose or L-arabinose isomerase. The 2xTY
fermentation media were used to grow the strains. Samples (1ml) were taken at 24 hours
of fermentation (in shake flask) and the cell free extract was prepared including
protease inhibitors in the extraction buffer. 13 µl of the cell free extract were
analyzed on SDS-PAGE (Invitrogen). For several transformants a clear band corresponding
to the expected Mw of the overexpressed protein was detected. A visual comparison
of the bands is given in Table 5.2. It is clear that the method of the invention improved
protein production for
Bacillus stearothermophilus xylose isomerase,
Streptomyces olivochromogenes xylose isomerase and
Thermoanaerobacter mathranii L-arabinose isomerase, by using the codon-pair method,
i.e. this results in improved protein production in comparison with either the WT reference
gene or the single-codon optimized variants. Moreover, if negative codon-pair optimization
was applied together with single-codon optimization, no product was detected.
Table 5.2 Overexpression of three heterologous genes in
Bacilli. WT: Wild type; sc: single codon optimization; cp: codon pair optimization; cp
-: negative codon pair optimization.
| |
B. subtilis |
B. amyloliquefaciens |
| |
WT |
sc |
sc & cp |
sc & cp- |
WT |
sc |
sc & cp |
sc & cp- |
| Bacillus stearothermophilus xylose isomerase (SEQ ID NO. 16, 13) |
|
+ |
+++ |
|
|
+ |
+++ |
|
| Streptomyces olivochromogenes xylose isomerase (SEQ ID NO. 17, 14, 18) |
|
+ |
++ |
0 |
|
+ |
++ |
0 |
| Thermoanaerobacter mathranii L-arabinose isomerase (SEQ ID 12, 15) |
0/+ |
|
++ |
|
0 |
|
++ |
|
REFERENCES
[0246] Hatfield, G.W. & Gutman, G.A. (1992). Codon pair utilization. United States Patent
No
5,082,767
[0255] Punt, P.J., van Biezen, N., Conesa, A., Albers, A., Mangnus, J. & van den Hondel,
C. (2005). Filamentous fungi as cell factories for heterologous protein production.
Trends Biotechnol. 20(5):200-206