Blaise-g commited on
Commit
460ca19
1 Parent(s): ef9d398

Create new file

Browse files
examples/evolutionary_molecular_biology_paper.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ evolutionary molecular biology is mostly concerned with the forces affecting individual genes. however, observations of variable proportions of guanine and cytosine in different species and in different genomic regions of vertebrates have prompted the analysis of forces that may affect the evolution of complete genomes. one particular hypothesis concerns adaptation to high temperatures, proposing that high gc content results from selection favouring g:c pairs over less stable a:t pairs. against initial expectations, there seems to be no direct relationship between the gc content of prokaryotic protein-coding genes and optimal growth temperature. similarly, in the case of vertebrates, it was argued convincingly that the'isochore' structure of high- and low-gc regions is not due to selection, but reflects varying fixation biases of gc over at pairs in the presence of recombination. a clear picture of selection at work emerges only in the study of structured rnas. the ribosomal rnas and transfer rnas of prokaryotes living at high temperatures contain a much larger gc-fraction in their stem regions compared to homologs from prokaryotes living at more moderate temperatures, likely because g-c pairs are more stable to thermal fluctuations than a-u pairs. a similar effect is seen in vertebrates: the ribosomal rna of endothermic animals has a higher gc-content compared to that of ectothermic vertebrates. thus, rnas that require a specific three-dimensional structure to perform their function appear to be under selection for increased thermostability in cellular environments with elevated temperatures, consistent with the thermal adaptation hypothesis. however, a higher gc-content in structural rnas of thermophiles and hyperthermophiles may also have arisen through reasons unrelated to environmental temperatures, e.g., random genetic drift or mutational biases. closely related species often have similar nucleotide composition and similar habitats simply due to their descent from a common ancestor; a statistically significant relationship between gc content and temperature across species might thus reflect nothing more than a close phylogenetic relationship of these species. this is not the case: even after controlling for phylogenetic relationships, the gc content of structural rna remains strongly correlated with optimal growth temperature. thus, genomic effects of thermal adaptation appear to exist at the structural but not the sequence level. just like structural rnas, proteins need to retain their three-dimensional structure in the presence of thermal fluctuations. it hence appears likely that the proteins of thermophilic organisms show corresponding signs of thermal adaptation. several studies indeed report a correlation between amino acid usage and optimal growth temperature of bacteria; however, these studies are based on amino acid usage patterns and not directly on protein thermostability. two further analyses are based directly on large datasets of compositional comparisons that took protein structure into account. in a careful study of biophysical properties of a subset of proteins, glyakina et al. confirmed that those amino acids that lead to stronger electrostatic interactions in protein surfaces are enriched among thermophiles, while certain amino acids that tend to de-stabilise proteins are depleted. in another large scale study of the surfaces of hyperthermophilic proteins, claverie et al. found solvent accessible charged residues to be strongly overrepresented, concluding that the resulting measure of cvp-bias was "the sole criterion that is able to clearly discriminate hyperthermophilic from mesothermophilic microorganisms on a global genomic basis". the measures of amino acid composition derived in the two studies are strongly correlated, as they aim to measure the same phenomenon; they differ only in the treatment of three amino acids. thus, amino acid sequence composition is correlated with temperature. however, just as for the gc content of structural rnas, these correlations could simple be due to the close phylogenetic relationships of some thermophiles and hyperthermophiles. using the comparative phylogenetic method, we show here that patterns of amino acid usage between thermophiles, hyperthermophiles and mesophiles are indeed strongly affected by phylogenetic relationships. consequently, previous results from direct sequence comparisons are partly misleading. reassuringly, the two measures of amino acid bias that are derived from studies taking into account the known structure of protein subsets are strongly correlated with optimal growth temperature when extended to complete prokaryotic proteomes, even after controlling for phylogenetic non-independence. can similar effects of thermal adaptation be seen in higher eukaryotes? the proteins of mammals and birds, which are endothermic species, operate at a species-dependent constant temperature of 35-42° celsius. this temperature is significantly higher than the average temperature in fish or reptiles, which are ectothermic species. thus, the same trends observed in prokaryotes may also operate on vertebrate proteins: we hypothesize that compared to ectothermic vertebrates, endothermic animals have proteins with an amino acid composition biased in the same direction as in thermophilic prokaryotes. physiological constraints on multi-cellular animals mean that they cannot live at the temperatures in which prokaryotic thermophiles thrive, and thus we expect their amino acid compositions to be less biased. however, the relationship between amino acid composition and thermal stability is approximately linear between 7°c and 103°c. thus, if thermal adaptation indeed occurred in endothermic animals, it appears likely that the same amino acids as in thermophilic prokaryotes are involved, even if the relevant temperature differences in eukaryotes are substantially smaller than in prokaryotes. here, we test this prediction using data from fully sequenced endothermic and fully sequenced ectothermic vertebrates. we first demonstrate that the erk measure of biased amino acid composition shows a strong correlation with optimal growth temperature when applied to genome-scale prokaryotic data, even after controlling for phylogenetic relatedness. we then proceed to show that the same measures indicate a weak but statistically significant adaptation of protein thermostability to elevated body temperature also in endothermic vertebrates. genome-wide bias in amino acid composition of thermophilic prokaryotes based on careful structural alignments of proteins, glyakina et al. showed that among the external residues of proteins from thermophilic prokaryotes, three amino acids are enriched, while seven amino acids are depleted compared to mesophilic prokaryotes. this effect is quantified by the combined proportion erk = e + r + k - d - n - q - t - s - h - a. erk is elevated for the exterior regions of proteins from thermophiles compared to mesophiles. it has not yet been tested if this measure, which was developed from an analysis of external residues, is correlated with the optimal growth temperature of individual species when applied to the full amino acid sequences of complete proteomes. applying it to complete amino acid sequences dilutes the signal from the surfaces, but is not expected to lead to any systematic biases. to test this, we used a large set of whole genome sequence data and optimal growth temperature that was previously compiled by zeldovich et al.. this data set contains species, of which are hyperthermophiles, are thermophiles, and are mesophiles. in agreement with earlier observations on the surface regions of a subset of genes, we find a strong correlation between the mean erk of complete proteomes and optimal growth temperature. analogous results are obtained using cvp-bias, a very similar alternative measure of temperature-related amino acid usage. as evident from figure this correlation can mostly be attributed to strong differences between hyperthermophiles, thermophiles, and mesophiles. however, despite large variation in amino acid composition among mesophiles, we do still see a significant correlation of erk with optimal growth temperature among prokaryotes living at moderate temperatures. this is in agreement with a detailed study on the properties of six proteins from microorganism living at temperatures ranging from 7°c to 103°c, which found that compositional features related to thermo-adaptation increase almost linearly with temperature. amino acid usage patterns are strongly affected by phylogeny an appreciable number of species in each of the hyperthermophile, thermophile, and mesophile categories are very closely related to each other. the corresponding data points in figure are thus not statistically independent, and simple correlation statistics as reported above may be misleading. we thus employed the comparative phylogenetic method, which calculates statistically independent contrasts; this eliminates correlations due to common descent. controlling for phylogenetic relatedness indeed leads to very different patterns of amino acid enrichment/depletion compared to simple correlations. in particular, the amino acids a, e, h, i, k, w, and y, which show significant positive or negative correlations with temperature in a naïve analysis, do not show any significant correlations after controlling for phylogenetic non-independence. in contrast, c, m, n, p, and s, which do not show significant correlations with temperature in the naïve analysis, show significant correlations after including phylogeny into the statistical model. in the naïve analysis, there are amino acids which are correlated negatively with growth temperature, while amino acids are correlated positively with growth temperature. after controlling for phylogenetic non-independence, amino acids are correlated negatively with growth temperature, while only amino acids are enriched at high temperatures. thus, the temperature-related patterns seen for individual amino acids depend strongly on evolutionary history. however, we found that erk and cvp- bias, which were both derived including consideration of the protein structure, are still strongly correlated with temperature even after controlling for phylogeny. these results further underline the importance of structural rather than sequence properties in thermal adaptation. organisms living at different ambient temperatures may have different protein repertoires, and thus comparisons of complete genomes are potentially misleading. to circumvent this problem, we performed a complementary analysis restricted to groups of orthologous proteins. we collected amino acid sequence data from species each of hyperthermophiles, thermophiles, and mesophiles. using reciprocal best blast hits, we retained only proteins in each group that had orthologs in at least one of the other groups. comparing the remaining proteins among groups, it is again clear that erkhyperthermophiles >erkthermophiles >erkmesophiles. thus, we confirmed that erk, even when applied to complete amino acid sequences on a genomic scale, is a useful predictor of temperature adaptation in prokaryotes. in the remainder of this paper, we use erk to test for a corresponding effect in vertebrates. we repeated all analyses using the cvp-bias, in each case obtaining qualitatively very similar results; however, as erk and cvp-bias differ only in the treatment of three amino acids, these two measures are not statistically independent. endothermic vertebrates have biased amino acid usage just as in prokaryotes, the body temperatures of fish, amphibians, and reptiles are closely linked to ambient temperatures. consequently, the proteins of these ectothermic or'cold-blooded' vertebrates usually operate below 30° celsius. in contrast, endothermic or'warm-blooded' vertebrates have a thermoregulation system which keeps their body temperatures at a species-specific constant 35-42° celsius. does this relatively small difference in temperature between endothermic and ectothermic vertebrates result in a discernible selection pressure for increased thermal stability of proteins? if so, we expect to see compositional biases in the same direction as in prokaryotes, as the rules connecting amino acid usage and thermostability appear to apply across the complete temperature range encountered by life. p-values are for comparison of endothermic to ectothermic vertebrates, treating each genomic average as a single data point. the two last columns list the proportion of at-rich amino acids and gc-rich amino acids. to test this hypothesis of vertebrate thermo-adaptation, we obtained a total of protein sequences from sequenced species. this included four mammals: human, rat, mouse, and cow; one bird; one reptile, two amphibia, and three fish. analysing the combined amino acid composition of the complete proteomes, we indeed find a small but statistically highly significant shift in erk of endothermic compared to ectothermic vertebrates. as shown in figure there is a strong correlation between the compositional bias of amino acids and the temperature at which the proteins of the species typically act. consistent with our hypothesis, this correlation is mostly due to a systematic difference between ectothermic and endothermic vertebrates. again, we confirmed this result by restricting the analysis to orthologous proteins. among the ectothermic species considered, anolis carolinensis is the closest relative to the endothermic animals and was thus chosen as the reference genome. we identified orthologous proteins in each of the other genomes as reciprocal best blast hits against anolis carolinensis. in pair-wise comparisons, all five endothermic species show a significantly higher average erk compared to orthologous proteins in anolis carolinensis, while this is not the case for any of our amphibia or fish. however, individual proteins in a single species are not truly independent data points, as species-specific compositional biases unrelated to temperature may exist. we thus performed an additional analysis, which treated the average erk across orthologs as a single data point for each of our species. erk is significantly higher for the mammal/bird group compared to the ectothermic group. just as in the prokaryotic analysis, treating closely related species as independent data points could be misleading: similar compositional biases might be due to common descent rather than common physiology. we thus repeated the genome-wide analysis of amino acid bias using the comparative phylogenetic method of independent contrasts. despite the small sample size, we still find a statistically significant correlation between amino acid bias and temperature after controlling for phylogenetic relatedness. chicken have elevated erk compared to reptiles of all ectothermic animal classes, reptiles - which are paraphyletic due to the exclusion of birds - are the closest living relatives to endothermic vertebrates. thus, we wanted to confirm that the elevated erk values are indeed restricted to endothermic animals, by comparing the chicken genome to several hundred recently published protein segments of three further reptilia. based on best blast hits of the segments against the chicken genome, we constructed protein segment alignments between alligator mississippiensis and chicken, segment alignments between chrysemys picta and chicken, and segment alignments between anolis smaragdinus and chicken. erk in chicken protein segments is significantly higher than in each of the three reptilia species. elevated erk is not due to biased gc content the strongest known predictor of amino acid composition at the genomic scale is the gc content of the coding dna sequences. thus, it is conceivable that the biased amino acid composition in endothermic vertebrates is due to gc content variation between the genomes of endothermic and ectothermic vertebrates. however, for the co-orthologs studied here, there are no differences in the usage of at-rich or of gc-rich codons between endothermic and ectothermic genomes. to further exclude gc content as a confounding factor, we investigated aligned orthologous coding sequences of human and danio rerio in more detail. as expected, the human genes encoded proteins with significantly higher erk values than their danio orthologs. if these differences in erk could be fully explained by variation in gc content, we would not expect to see different erk values if we restrict our analysis to those aligned codons that have the same gc content in human and danio. contrary to this expectation, we still see higher erk in the human sequences on these gc-neutral codons. thus, the differences in amino acid composition cannot be simply explained by differences in gc content. elevated erk is not due to purine loading secondary structures of rna sequences are built by the formation of hydrogen bonds between purines and their complementary pyrimidines. purine loading, i.e., the over-representation of purines in coding sequences, thus reduces the potential for self-interactions of the mrna. as self-interactions can interfere with translation, purine loading may be a selected molecular trait. purine loading is found in almost all prokaryotes, and is positively correlated with optimal growth temperature. as biased nucleotide composition can lead to biased amino acid composition, it is conceivable that the observed elevated erk levels in endothermic vertebrates may be a consequence of purine loading. to exclude purine loading as a confounding factor, we employed an analogous strategy as for gc content. when we restrict the alignments of the human - danio rerio orthologs to those codons with the same purine content, we still observe a significantly higher erk value in the human sequences. thus, the biased amino acid composition of proteins from endothermic vertebrates cannot be attributed to purine loading alone. discussion building upon earlier results on aligned structures of prokaryotic protein pairs, we show that genome-wide amino acid usage biases correlates strongly with the optimal growth temperature of bacteria. that erk and cvp measures are derived directly from physicochemical considerations strengthens the notion that it is indeed selection on thermostability which is responsible for this long-recognised trend. while the enrichment or depletion of individual amino acids in thermophilic species is strongly affected by phylogenetic non-independence, the overall biases measured by erk and cvp are robust. applying the same methodology to vertebrate species, we find that mammalian and bird proteomes show a weak but significant increase in erk and cvp-bias compared to ectothermic fish, amphibia and reptilia. this increase cannot simply be explained by biases in nucleotide composition, and remains statistically significant when controlling for phylogenetic non-independence. while the examined dataset of genome sequences is necessarily small and not evenly sampled across vertebrates, we thus have strong evidence for a direct relationship between amino acid bias and the temperature at which vertebrate proteins operate. analogous to the situation in prokaryotes, our findings are most parsimoniously explained by selection for increased stability against thermal fluctuations in endothermic vertebrates. why then do we not see a correlation of amino acid usage bias with environmental temperature when considering only ectothermic vertebrates? apart from an issue of small sample size, this lack of a correlation may be due to the fact that ectothermic vertebrates can rapidly switch between habitats of different temperatures during evolution. this is evident, e.g., from the two xenopus species in our study, which thrive at 18- and 23-28°c, respectively. it should be pointed out that ectotherms are not necessarily cold-blooded, i.e., body temperature in some ectothermic species can reach temperatures as high or higher as in endotherms. furthermore, internal temperature can vary between different body regions of an ectotherm, and can be above the outside temperature. however, the temperatures listed in table are'optimal' temperatures for these species, and internal temperatures will indeed be close to these values. on average, body temperature in endotherms is higher than in ectotherms, and has likely remained stable since the last common ancestors of mammals and of birds. a shift towards stability-increasing amino acids in proteins of endothermic vertebrates mirrors similar effects seen for the nucleotide composition of structural rnas. while the effect for structural rnas appears to be much stronger, this may not be surprising: rna structures are formed by direct bonds between complementary bases, g-c bonds being more stable than a-t bonds. thus, thermostability of rnas is directly related to the gc fraction of sites involved in bond formation. the effect of individual amino acids on the thermostability of proteins is much more subtle: the relevance of different physicochemical properties of amino acids depends on their three-dimensional context within the protein structure. the subtleness of this effect was already seen in prokaryotic proteins, where we found only a weak correlation of amino acid usage bias with optimal growth temperature among mesophiles. taken together, our results indicate weak but significant genome-wide positive selection on protein structure during the change from ectothermic to endothermic life styles in vertebrates. this molecular process may have been very similar to the adaptation of microorganisms that switch from mesophilic to thermophilic life styles, except that the temperature differences involved were much smaller.