Information

G>T transversion VS. T>G transversion?

G>T transversion VS. T>G transversion?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

So I'm reading about how mutations in DNA can be caused by oxidative damage. An example of a product of oxidative damage is given: 8-oxo-7-hydrodeoxyguanosine

My textbook says that this product frequently mispairs with A, resulting in a high level of G-->T transversions.

What I'm not understanding is why it's a G-->T transversion. If the 8-oxo-G is pairing with A, resulting an an A-G pair, isn't this a T-->G transversion? Or does the order not matter?


Originally the position had a G:C pair. After the mutation, there is an oxoG:C pair. Upon replication, the strand with C will pair to G and the original pair will be created as expected. However, since oxoG can also pair with A, the strand with oxoG may form an oxoG:A pair. Another round of replication gives the products T:A and oxoG:A. Thus the G in the original strand is replaced by T (ie it is a G$ce{->}$T transversion).


Genome-wide DNA mutations in Arabidopsis plants after multigenerational exposure to high temperatures

Elevated temperatures can cause physiological, biochemical, and molecular responses in plants that can greatly affect their growth and development. Mutations are the most fundamental force driving biological evolution. However, how long-term elevations in temperature influence the accumulation of mutations in plants remains unknown.

Results

Multigenerational exposure of Arabidopsis MA (mutation accumulation) lines and MA populations to extreme heat and moderate warming results in significantly increased mutation rates in single-nucleotide variants (SNVs) and small indels. We observe distinctive mutational spectra under extreme and moderately elevated temperatures, with significant increases in transition and transversion frequencies. Mutation occurs more frequently in intergenic regions, coding regions, and transposable elements in plants grown under elevated temperatures. At elevated temperatures, more mutations accumulate in genes associated with defense responses, DNA repair, and signaling. Notably, the distribution patterns of mutations among all progeny differ between MA populations and MA lines, suggesting that stronger selection effects occurred in populations. Methylation is observed more frequently at mutation sites, indicating its contribution to the mutation process at elevated temperatures. Mutations occurring within the same genome under elevated temperatures are significantly biased toward low gene density regions, special trinucleotides, tandem repeats, and adjacent simple repeats. Additionally, mutations found in all progeny overlap significantly with genetic variations reported in 1001 Genomes, suggesting non-uniform distribution of de novo mutations through the genome.

Conclusion

Collectively, our results suggest that elevated temperatures can accelerate the accumulation, and alter the molecular profiles, of DNA mutations in plants, thus providing significant insight into how environmental temperatures fuel plant evolution.


Introduction

Although thyroid carcinoma represents only 1% of all malignant diseases, it is the most common malignancy of the endocrine system, accounting for the majority of deaths from endocrine cancers (Mazzaferri, 1993 Farid et al., 1995). Thyroid carcinomas are histologically classified as papillary thyroid carcinoma (PTC), follicular thyroid carcinoma (FTC), anaplastic thyroid carcinoma (ATC), and medullary thyroid carcinoma (MTC), accounting for approximately 80, 15, 2, and 4% of all thyroid malignancies, respectively (Hundahl et al., 1998). PTC, FTC, and ATC are derived from follicular thyroid epithelial cells, whereas MTC is derived from calcitonin-secreting parafollicullar C cells. MTC occurs in either a sporadic or a familial form (75–80 and 20–25% of cases, respectively) (Girelli et al., 1998 Massoll and Mazzaferri, 2004), and is a consistent feature of multiple endocrine neoplasia MEN2A, MEN2B, and familial non-MEN MTC (FMTC). Germline mutation of the RET proto-oncogene is found in the majority of patients with MEN2 and FMTC, and is considered the primary genetic cause of the disease (Marsh et al., 1997). RET somatic point mutations have also been reported in 40–50% of sporadic MTCs (Hansford and Mulligan, 2000 Elisei et al., 2004). Successful treatment of MTC depends heavily on early diagnosis and treatment. The current clinical strategies in the diagnosis and management of hereditary MTC rely on the routine plasma calcitonin assay and genetic testing for RET proto-oncogene mutations (Massoll and Mazzaferri, 2004). Plasma calcitonin assay is a sensitive and specific marker for the presence of MTC and provides an accurate estimate of tumour burden. Genetic testing for RET proto-oncogene mutations provides an accurate diagnosis of gene carriers in risky family members and offers the best chance of cure for MTC by prophylactic total thyroidectomy and lymphadenectomy (Massoll and Mazzaferri, 2004). It is estimated that the probability is 0.18% for a first-degree relative to inherit mutant RET gene from an individual who has sporadic MTC without germline RET mutation (Massoll and Mazzaferri, 2004).

Germline mutations in nuclear genes that encode mitochondrial enzymes have been implicated in hereditary neoplasias (Eng et al., 2003). For example, germline heterozygous mutations in autosomally encoded mitochondrial enzyme fumarate hydratase are associated with hereditary predisposition to papillary renal-cell carcinoma and leiomyomatosis (Tomlinson et al., 2002 Kiuru and Launonen, 2004), whereas homozygous fumarate hydratase mutations cause neurodegeneration (Bourgeron et al., 1994). Germline mutations in autosomally encoded mitochondrial enzyme succinate dehydrogenase (SDH) subunits SDHD, SDHC, and SDHB (mitochondrial respiratory chain complex II) are associated with hereditary predisposition to phaeochomocytoma and paraganglioma, both of which are arising from chromaffin cells derived from neural crest precursors (Astuti et al., 2001 Benn et al., 2003). Although thyroid C-cells are derived from neural crest cells and both MTC and phaeochomocytoma are part of MEN2A, mtDNA mutation has not been reported in the literature. In the present study, we investigated 26 MTC specimens and one MTC cell line for mtDNA mutation by sequence analysis of the entire coding region of mtDNA.


Discussion

Based on a collection of eight studies that report fitnesses for replacement mutations, we have assessed the prospects for the hypothesis that the conservativeness of replacements via transition accounts for their increased frequency in evolution. Even small studies reveal predictable patterns of amino acid exchangeability, and most have sufficient power to distinguish a binary conservative-vs-radical distinction. However, the same studies typically do not show significant conservativeness of transitions. Overall, the chance of a transition mutation being more fit than a transversion is 53 % (95 % CI 50 to 56). This effect size is not large compared with that of most biochemical predictors, and is not large enough to explain the several-fold bias toward transition replacements observed in evolutionary studies.

The finding that the conservativeness of transitions is a rather weak effect increases the prospects for the alternative mutational explanation, in which the rate at which new alleles are introduced by transition mutations is several-fold higher than for transversions, and this bias predisposes evolutionary change to happen via transitions (for a general explanation, see Stoltzfus and Yampolsky 2009).

Although this idea may be familiar, it relates to a rather substantial and unresolved issue in evolutionary genetics, which is the extent to which evolution in nature happens in the “gene pool” regime supposed by the architects of the Modern Synthesis, in the kind of mutation-driven regime supposed by early mutationists and later molecular evolutionists, or something in between ( McCandlish and Stoltzfus 2014). The idea of mutation and selection as opposing forces suggests that mutation bias will be influential only when selection is absent, thus hypotheses that invoke mutation bias are often interpreted as neutral models (as noted by Yampolsky and Stoltzfus 2001). Presumably, this is why researchers have pursued selective explanations for transition:transversion bias among amino acid changes, even while accepting a mutational explanation for noncoding changes in the same genes: the proteins are assumed to be “under selection” and thus not susceptible to mutation bias. However, this way of depicting mutation and selection as opposing forces is only justified under the special conditions of the gene pool regime. Outside of this regime, mutation and selection can both contribute to orientation or direction in evolution ( Yampolsky and Stoltzfus 2001 McCandlish and Stoltzfus 2014).

The results presented here also prompt the question of how it came to be so widely supposed that transitions are conservative. In a survey of the literature, we found that, when the alleged conservativeness of transitions is attributed to a source, the source is often Zhang (2000), or early works such as Fitch (1967), Grantham (1974), or Vogel and Kopun (1977). Grantham (1974) does not address this issue explicitly, but a genetic code-based calculation shows that the mean Grantham distance for transition-mediated replacements is lower than that for transversions, for example, as indicated in table 2 of Xia et al (1998). The study by Vogel and Kopun is often cited as evidence for the conservative transitions hypothesis, because they present a calculation that, for three different biochemical measures, suggests that transitions are more conservative.

These prior studies are inconclusive for two general reasons. The first is that none reports an effect size sufficient to account for the evolutionary bias. Indeed, Vogel and Kopun themselves favored a mutational explanation for the evolutionary bias on the grounds that the effect size for conservativeness of transitions seemed to be too small (see hypothesis 3 on p. 179). Zhang’s ( 2000) analysis of three possible conservative:radical distinctions finds that the distinction based on Miyata et al (1979) yields the largest evolutionary effect size, which is a 2-fold effect, that is, radical replacements are roughly half as likely to accumulate, relative to null expectations. However, although the effect of conservativeness is 2-fold, the link reported between transitions and conservativeness is weak. According to Zhang (2000), the chance that a transition is conservative by Miyata’s measure is 35 %, compared with 33 % for transversions, a proportional difference of only 6 % (i.e., 2/33 = 0.06). Miyata-conservativeness may be a 2-fold evolutionary effect, but if transitions are only 6 % more Miyata-conservative than transversions, the overall bias will be far less than 2-fold.

Second, none of these works escapes the kind of logical circularity pointed out by Di Giulio (2001), see also Yampolsky and Stoltzfus (2005), in which a measure of evolutionary tendencies is invoked to argue for effects of selection rather than mutation, ignoring the possibility that the pattern of evolution is itself influenced by mutational effects. This is an indirect (and thus presumably unintended) form of the Panglossian fallacy, that is, it is formally a fallacy of arguing that transitions are better simply because they happen more often, without inquiring into why they happen more often.

The circularity is not avoided by invoking biochemical factors. The popular composite indices of “biochemical” distance constructed by Grantham (1974) and Miyata et al (1979) are based on choosing biochemical factors that fit well with observed evolutionary patterns from earlier protein comparisons. Likewise, all three biochemical measures used by Vogel and Kopun (1977) are based on fitting to protein comparisons. The problem with this approach is suggested in figure 4, which shows the conservativeness of transitions for biochemical indices in the AAindex database ( Kawashima and Kanehisa 2000). About 3/5 make transitions seem conservative, and the other 2/5 make them seem radical.

The advantage of transitions implied by various biochemical factors. The 245 biochemical factors from AAindex were used to compute a pairwise similarity measure for amino acids indicating their biochemical similarity, then these measures were used to assess whether transitions are more conservative than transversions. AUC is the chance that a replacement due to a transition has a higher similarity score than a randomly chosen transversion (where the random sampling of transitions and transversions is based on the pool of actual mutants from the eight studies). The resulting distribution indicates that transitions are more conservative according to about 3/5 of biochemical factors (AUC > 0.5), and less conservative according to the other 2/5 of factors (AUC < 0.5).

The advantage of transitions implied by various biochemical factors. The 245 biochemical factors from AAindex were used to compute a pairwise similarity measure for amino acids indicating their biochemical similarity, then these measures were used to assess whether transitions are more conservative than transversions. AUC is the chance that a replacement due to a transition has a higher similarity score than a randomly chosen transversion (where the random sampling of transitions and transversions is based on the pool of actual mutants from the eight studies). The resulting distribution indicates that transitions are more conservative according to about 3/5 of biochemical factors (AUC > 0.5), and less conservative according to the other 2/5 of factors (AUC < 0.5).

As figure 2 indicates, this is not because biochemical indices are generally poor predictors of exchangeability. Instead, among many moderately powerful predictors, there are ones that make transitions seem favorable, and others that make transversions seem favorable. Thus, converting evolutionary patterns into biochemical descriptors before reapplying them to the analysis of evolutionary patterns does not allow one to escape a logical circularity: some biochemical factors can be invoked to rationalize the conservativeness of transitions, whereas others can be invoked to rationalize the conservativeness of transversions.


Results

Cox1 reference database

We created a reference database of 624 aligned, partial, 351-bp-long cox1 sequences. The database included a mixture of sequences from our target species (378 sequences from 149 species), companion species (226 sequences from 139 species), and model species (20 sequences from 20 species). Target species were intended to trace cross-contamination among samples. Companion species were introduced as negative controls. Model species were introduced to search for contamination by standard laboratory organisms. In our reference databases, 31 of our target species were not represented at all, 98 were represented by a single cox1 sequence, and six were represented by more than ten cox1 sequences, implying that our ability to detect the occurrence of a given species in a given sample varied among species.

Patterns of between-species contamination

Short sequence reads from each of 446 samples (individuals) from 116 species were aligned to our reference cox1 database using BWA. The number of hits to each reference sequence was recorded and divided by the number of millions of reads of the considered sample. For each sample, we calculated the prevalence of cox1 hits to a reference sequence from the expected species, and the prevalence of cox1 hits to a reference sequence from an unexpected species – that is, a species differing from the expected species by >5% of cox1 divergence. Hits to a species different but <5% divergent from the expected one were not counted.

Figure 2 shows an overview of the contamination pattern in this large-scale data set. Figure 2a shows the across-samples distribution of the prevalence of expected (gray) vs. unexpected (red) cox1 reads, while Fig. 2b plots these two variables. The across-samples median prevalence of expected cox1 reads was 674 cox1 reads per million. The prevalence of expected cox1 reads was sometimes low: it was <10 per million in 86 samples, and zero in 52 samples, of which 13 were from a species that was represented in our reference cox1 database. This is quite surprising, given that cox1 is considered a generally high-expressed gene. This result might be explained by insufficient/inappropriate species representation in the reference database for these particular samples. It might also be that in some taxa mitochondrial transcripts lack a polyA tail (or use it as a degradation signal, as in plants [39]) and were therefore excluded at the retrotranscription stage in our protocol.

Overall pattern of between-species contamination. a Among-sample distribution of the prevalence of reads mapping to a cox1 reference from the expected (gray) or an unexpected (red) species. Prevalence is defined as the number of cox1 reads per million reads. b Relationship between the prevalence of cox1 reads mapping to the expected (x-axis) vs. an unexpected (y-axis) species, again per million reads. Each dot represents a sample. Plain line: ratio of unexpected to expected cox1 reads is one. Dotted lines: ratio of unexpected to expected cox1 reads is 0.1 (respectively, 0.01). Samples from species not represented in our cox1 reference database are not shown

We found at least one hit to an unexpected species in 353 of the 446 samples. The prevalence of unexpected cox1 hits was >50 per million in 22 samples, and >500 per million in seven samples. One species, woodlouse Armadillidium vulgare, was particularly affected by unexpected hits – six individuals out of ten showed >50 per million unexpected hits. Twelve samples for which the prevalence of expected hits was >100 per million had a ratio of unexpected to expected hits >0.1, and two samples, GA24O (earthworm Allolobophora chlorotica L1) and GA17L (brine shrimp Artemia tibetiana), had a ratio >1.0. In summary, expected cox1 reads clearly dominated but contaminant reads were common and reached a high prevalence in a substantial number of samples.

The vast majority (99.54%) of the 385,597 unexpected cox1 reads originated from target species. Only 0.11% of the unexpected hits were assigned to a companion species, and 0.35% to a model species. The low prevalence of companion species was expected and confirmed that unexpected cox1 hits result almost uniquely from contamination. Regarding model species, we detected human cox1 reads in ten samples from nine distinct species, but always at very low prevalence – the total number of reads hitting a human cox1 sequence was 92. Mus musculus and Bos taurus were more prevalent in terms of total reads (507 and 447, respectively), but concerned a smaller number of samples (five and three) and species (three and three, respectively).

Among the 446 analyzed samples, 353 included at least one read mapping to an unexpected species – that is, showed evidence for between-species contamination. Of these, 205 were contaminated by at least two species, and we detected up to eight contaminant species in samples GA08R (Glanville fritillary Melitaea cinxia) and GA34L (mosquito Culex hortensis). Summing contaminant species across samples, we found that the data set had been affected by at least 782 distinct events of between-species contamination. This is an underestimate, due to the incompleteness of our reference database, our inability to detect contamination between closely related species, and the possibility of multiple events of contaminations of a given sample by a given species. The number of expected cox1 reads, unexpected cox1 reads, and contaminant species per sample are available in Additional file 2: Table S2. Reversely, 94 of the 180 species we processed in this project did contaminate at least one sample from another species. Among these, four species contaminated more than 15 distinct samples, and one, king penguin Aptenodytes patagonicus, contaminated samples from as many as 11 distinct species (Additional file 4: Figure S1). We found that the mean prevalence of expected cox1 reads of a species was significantly correlated with the number of individuals it contaminated (r = 0.35, p < 10 −3 ) and with the total number of contaminant reads it contributed (r = 0.45, p < 10 −4 , log-transformed number of contaminant reads).

Dubious samples

Two samples resulted in unexpected patterns. Sample GA36K, assigned to species Mytilus trossulus (bay mussel), yielded a single cox1 read that mapped to a M. trossulus reference, but >18,000 cox1 reads that mapped to a sequence from either M. edulis or M. galloprovincialis, two interbreeding species of European mussels (Fig. 2b, top left dot). By contrast, 99% of cox1 reads from the other M. trossulus sample that we analyzed, GA36L, mapped to a M. trossulus reference. The GA36K sample was collected in Seattle, WA, USA, a state in which invasive populations of European mussels are documented [40, 41]. Sample GA36K therefore probably results from an identification error, or reflects M. galloprovincialis/edulis mtDNA introgression into M. trossulus.

Similarly, sample GA08F, assigned to Glanville fritillary Melitaea cinxia (Lepidoptera), did not yield a single cox1 read that mapped to a M. cinxia reference, but >26,000 cox1 reads that mapped to a reference from the Spanish fritillary Euphydryas desfontainii. This species is quite divergent from M. cinxia, both morphologically and molecularly (cox1 divergence >25%), so mtDNA introgression and misidentification appear unlikely in this case. According to our records, the GA08F sample came from Aland, Finland, a place where E. desfontainii does not occur. We did, however, sample E. desfontainii, together with M. cinxia, in Morocco. The problem, therefore, probably resulted from sample mislabeling. The GA08F sample very likely belongs to E. desfontainii and was mistaken for an M. cinxia individual in our published analyses. We checked, however, that our main results are robust to these problems (see final paragraph of the “Results” section).

Analysis of laboratory metadata

We created a between-species contamination matrix M in which cell m ij contained zero in the absence of evidence for contamination of species j by species i, one in case of the detected contamination of species j by species i, and missing data if species i and j were <5% divergent cox1-wise, such that contamination detection was assumed to be unreliable. Here, a single read from any individual of species i hitting a reference sequence from species j was considered sufficient to attest for an event of contamination of i by j. Requiring at least ten unexpected reads, instead of just one, yielded qualitatively similar results. The 39 samples from species not represented in our reference cox1 database were here disregarded, so that sample size was 407 in this analysis. The total number of ones in M was 362, and the total number of pairs of species sufficiently divergent such that contamination detection was possible was 27,251, so that the proportion of species pairs for which an event of contamination was detected was p = 0.0133.

We focused on five predictors of the probability for two species to be connected by contamination, namely lab_overlap, same_technician, same_shipment, same_flowcell, and same_lane. To calculate the lab_overlap variable, we first defined the processing period of any given species as the period from date of entry into our laboratory to date of last shipment to a sequencing center. For any given pair of species, lab_overlap was defined as the length, in days, of the intersection between the processing periods of the two species. The same_technician variable was a Boolean variable set to one if at least one sample of each of the two considered species was treated by the same person in our laboratory, and to zero otherwise. Similarly, the same_shipment, same_flowcell, and same_lane variables indicated whether at least one sample of each of the two considered species had been shipped on the same day to the same sequencing center, or sequenced on the same flowcell/same lane, respectively.

We calculated the average value of these variables across all pairs of species for which an event of contamination was attested (Fig. 3, red vertical bars), and compared these to null distributions obtained by shuffling zeros and ones in the contamination matrix (Fig. 3, white histograms, 1000 replicates). More precisely, each cell of a randomized matrix was assigned one with probability p, or zero with probability (1 − p), with missing data being left unchanged, where p = 0.0133 was the overall probability of contamination (see above). We detected a strong and significant effect of each of the five variables: compared to the average species pair, species contaminating each other tended to have a longer period of overlap in our laboratory, to be handled by the same technician, and to be sent the same day and sequenced on the same flowcell. The effect of sequencing center-associated variables was particularly strong. For instance, the probability for two species that were shipped together to be connected by an event of contamination was 0.13, that is, more than ten times the unconditional probability. The same_lane pattern was very similar to same_flowcell and is not shown in Fig. 3.

Effect of laboratory metadata on the probability of between-species contamination. Four statistics are shown: lab_overlap (top left), same_technician (top right), same_shipment (bottom left), same_flowcell (bottom right). x-axis: average value of each statistics. Vertical red line: actual data set. y-axis: number of randomized data sets (out of 1000). White histograms: expected distribution assuming random probability of contamination. Blue histograms: expected distribution assuming that contamination is dependent on same_shipment. Green histograms: expected distribution assuming that contamination is dependent on lab_overlap and same_technician

The five analyzed variables were significantly correlated with each other. We tried to disentangle their effects, and particularly distinguish the influence of our laboratory from that of sequencing centers. To this aim, we compared the observed value of lab_overlap and same_technician to null distributions obtained by reshuffling M in a way that controls for the effects of same_shipment (Fig. 3, top, blue histograms). In this analysis, each (i, j) cell of a randomized matrix was assigned one with probability p ij, or zero with probability (1 − p ij), again leaving missing data unchanged, where p ij was the probability of contamination knowing same_shipment(i, j). These were obtained by calculating the proportion of ones in M conditional on values 0 or 1 for same_shipment. Similarly, the null distributions of same_shipment and same_flowcell conditional on lab_overlap and same_technician were generated (Fig. 3, bottom, green histograms). The effects of the five variables were still significant in these control analyses: a laboratory effect was detected when controlling for sequencing center-associated variables and a sequencing center effect was detected when controlling for laboratory-associated variables.

To analyze this effect more deeply, we created two synthetic variables summarizing the effect of laboratory (LAB) and sequencing center (CENTER), respectively. The LAB variable was positive when same_technician was true and lab_overlap was >200 days, but negative otherwise. The CENTER variable was negative for pairs of species shipped on distinct dates, but positive otherwise. Regarding species pairs that were sent together, we distinguished pairs sequenced on distinct flowcells (CENTER+), the same flowcell but distinct lanes (CENTER++), and the same lane (CENTER+++). In this analysis we focused on the 97 species for which information on shipment dates, flowcell, and lane numbers was available for all individuals. As far as species sent on distinct dates were concerned (CENTER-), the contamination probability was very low regardless of LAB (Table 1, first line). This seems to be incompatible with the hypothesis of a substantial level of contamination in our laboratory. In contrast, the probability that two species shipped on the same day were connected by an event of contamination was as high as 0.2, and further increased in case of shared flowcell and shared lane (Table 1, lines 2 to 4), reaching values >0.5.

Surprisingly, we detected a strong and significant interaction between the LAB and CENTER variables (Table 1). Two species being shipped the same day (CENTER+), overlapping in our laboratory, and being handled by the same technician (LAB+) substantially increased the probability of contamination. We suggest that this is an induced effect resulting from the fact that tubes in shipped boxes were ordered by technician, so that samples processed by the same technician in our laboratory were presumably more likely to be processed together by sequencing centers, and therefore to contaminate each other. To test this hypothesis, we subsampled species in such a way that a single species per technician per shipment was kept, so that no induced effect of same_shipment on same_technician was possible. We found eight events of contamination between the 24 species of the subsample. There was still a significant effect of same_shipment on contamination probability in this subsample, but no effect of lab_overlap or same_technician was detected (Additional file 5: Figure S2), suggesting that the LAB effect conditional on CENTER+ reported in Table 1 was an induced effect. These analyses therefore indicate that the vast majority of the events of between-species contamination we detected occurred in sequencing centers. The results were qualitatively unchanged when a 10% threshold was used, instead of 5%, for the minimal cox1 divergence between contaminant and contaminated species (Additional file 6: Table S3).

Laboratory contamination: detailed analysis

Eight events of contamination were detected between species that were not shipped on the same date. Of these, four involved Glanville fritillary M. cinxia. This is the one species in our data set that included samples for which data on shipment date are missing (GA08B to GA08F, Additional file 2: Table S2). The three species that contaminated or were contaminated by M. cinxia but lacked an attested shared shipment date with M. cinxia – Iberian hare Lepus granatensis, mountain hare L. timidus and ascidian Ciona intestinalis A – were shipped the same day, May 26, 2010. It seems therefore possible, not to say probable, that samples GA08B to GA08F were actually sent out for sequencing on May 26, 2010, and that contamination occurred in the sequencing center in this case, too.

Besides these four cases, one detected event of contamination between species not shipped on the same date involved gorgonian Eunicella cavolini and European blue mussel M. galloprovincialis. E. cavolini, however, shares a shipment date (January 23, 2013) with M. edulis, the other species of European mussel, which hybridizes with M. galloprovincialis – the two species have very similar haplotypes in our reference cox1 database. A closer inspection of the data revealed that the single E. cavolini sample, GA31L, affected by contamination from M. galloprovincialis is the single E. cavolini sample that was shipped on January 23, 2013. Eight cox1 reads from this sample mapped to a M. edulis reference and two mapped to a M. galloprovincialis reference. In conclusion, only three events of between-species contamination out of 782 can be unambiguously assigned to our laboratory: contamination of European pond turtle Emys orbicularis by ascidian Ciona intestinalis A and of seahorses Hippocampus hippocampus and H. guttulatus by each other.

Within-species contamination

The above analyses suggest that there was substantial contamination in this project, and primarily involves samples that were shipped together. This is worrisome because samples from distinct individuals of the same species, between which contamination is most problematic and difficult to detect, were typically sent together. To quantify the amount of within-species contamination, we examined the prevalence as the minor state (“errors”) at homozygous genotypes of alleles segregating in the sample. First focusing on homo-quartets (i.e., positions at which the read count for the major state was >40 and the read count for the minor state equaled 1) that occurred at monoallelic positions, we determined P, the error matrix in the absence of contamination. This was done separately for each of the 39 species of the sample in which at least four individuals were sequenced. Note that in this study we did not use strand information, so we could not distinguish between X → Y and X* → Y* errors, where X* is the complementary of base X.

Error matrices revealed two main features. First, the A → C or T → G errors were often more frequent than the other three transversion-type errors, namely A → T or T → A, C → G or G → C, and C → A or G → T. The ratio of A → C or T → G to other transversion-type errors varied between 0.29 and 0.79 among species (correcting for base composition), when a ratio of 0.67 would be expected under random error. This is consistent with documented error biases of the Illumina technology [42, 43]. Second, transition-type errors, C → T or G → A and T → C or A → G, were typically more numerous than expected. The ratio of transition-type to transversion-type errors varied from 0.47 to 1.14 among species (correcting for base composition, median = 0.79), when the expected ratio would be 0.5 under random error, and <0.5 according to [43]. Knowing that DNA polymerases typically generate more transition-type than transversion-type errors, this result suggests that a fraction of the sequencing errors affecting our data was introduced prior to sequencing, presumably at the PCR step during library construction.

We then considered homo-quartets occurring at biallelic positions, where two alleles segregate at substantial frequency. Here, we only considered the 33 species in which at least 50 such homo-quartets were found. We asked whether the minor state at such homo-quartets tended to correspond with the other segregating allele more often than expected based on P. We found that the relative prevalence of the other segregating allele was above its expected value in all 33 species. The index of allele leakage, λ, varied from 0.19 to 8.5, when λ = 0 would be expected in the absence of contamination. This analysis therefore indicates that within-species contamination is widespread in our dataset and probably affects all the sequenced species.

We investigated the influence of laboratory metadata, and particularly the date of shipment to sequencing centers, on the prevalence of within-species contamination. To this end, we focused on the 12 species of our data set in which not all samples were shipped the same day – that is, most often at two distinct dates, and up to four dates in the blue tit Parus caeruleus. In these species, we measured λ', the index of allele leakage between samples sent on different dates. This was achieved by only considering homo-quartets occurring at positions that were biallelic across the whole sample of individuals, but monoallelic in the subsample of individuals shipped the same day as the focal individual (Additional file 7: Figure S3). This analysis could not be performed species by species due to the small number of relevant homo-quartets per species. We therefore pooled homo-quartets across the 12 species, still accounting for species-specific error matrices P, and obtained an index of allele leakage between samples sent on different dates of λ' = 0.59. This figure was twice as small as the index calculated as above, that is, irrespective of shipment date, which for these 12 pooled species was λ = 1.21, demonstrating an effect of same_shipment on the prevalence of within-species contamination.

Contamination-aware SNP calling

To assess the robustness of our published results to the problem of within-species contamination, we re-called SNPs and genotypes using a modified method accounting for allele leakage between individuals. Compared to our original SNP-calling method, a parameter γ was added, which represents the probability that a read originates from another individual of the sample. Three arbitrary values of γ were used: 0.05, 0.1, and 0.2. Contamination-aware SNP calling was applied to the 39 species of our sample in which at least four individuals were available. Classical population genomic statistics were calculated from this data set using the same pipeline as in [18]. To save computational time, SNP calling was applied to reduced data sets consisting of exactly one million positions per species, instead of the 1.8–27 million positions in full data sets.

We found that the number of called SNPs and the estimate of πS, the genetic diversity at synonymous positions, decreased with increasing γ (Fig. 4a). This was expected: contamination spuriously increases heterozygosity by moving alleles around. The relative bias was substantial – the median ratio of corrected to uncorrected πS was 0.90 when γ was 0.1, and 0.81 when γ was 0.2. The relative bias, however, was fairly constant across species, and much smaller that the between-species differences in πS, suggesting that our published comparative analyses of πS across species [17, 19, 21, 22] are robust to within-species contamination. We checked that the correlation reported by Romiguier et al. [21] between πS and species life history traits were still valid after control for contamination. We found that the correlation coefficient between log-transformed πS and log-transformed longevity was very similar in all four analyses, that is, between −0.517 and −0.524, the most negative coefficient being obtained when γ = 0.1. Similarly, the relationship between log-transformed πS and log-transformed propagule size [21] was very robust to changes in γ (correlation coefficient between and −0.772 and −0.758, minimal value when γ = 0).

Robustness of population genomic estimates to contamination-aware single-nucleotide polymorphism (SNP) calling. a Synonymous diversity πS b ratio of non-synonymous to synonymous diversity, πNS c FIT d Tajima’s D, synonymous SNPs only. Each dot represents a species. x-axis: estimates obtained assuming no contamination. y-axis: estimates obtained from contamination-aware SNP calling. Black dots: γ = 0.05 blue dots: γ = 0.1 red dots: γ = 0.2 synonymous diversity πS top right: πNS ratio bottom left: FIT bottom right: Tajima’s D, synonymous SNP’s only

The ratio of non-synonymous to synonymous diversity, πNS, was only slightly modified when we controlled for contamination (Fig. 4b), the median relative bias being close to 0.96 for all three positive values of γ. The synonymous (Fig. 4d) and non-synonymous Tajima’s D, a statistic measuring the departure of the distribution of minor allele frequency from the standard coalescent, were also only moderately affected. These two results suggest that published inferences based on πNS and site-frequency spectra [18, 27] are presumably robust enough to within-species contamination.

The FIT statistics measures the excess of individual homozygosity compared to Hardy-Weinberg expectations. A positive FIT is expected in cases of inbreeding and/or population substructure. Figure 4c shows that our FIT estimate is particularly sensitive to contamination issues. Controlling for contamination resulted in a substantial increase in FIT in all the analyzed species, reflecting the fact that within-species contamination tends to increase individual heterozygosity. In our uncorrected analysis (γ = 0), a negative estimate of the genome-average FIT was obtained in nine species [21]. This is an unexpected result, given that processes leading to heterozygote excess, such as balancing selection, are presumably limited to a small fraction of the genome [44]. In our contamination-aware analyses, a negative FIT was obtained in just four, two, and one species when γ was set to 0.05, 0.1, and 0.2, respectively, suggesting that within-species contamination might explain, at least partly, our previously unexpected report of negative estimates of FIT [21]. Harvest ant Messor barbarus was not included in this analysis because the genome-average FIT is very negative in this species as a consequence of its peculiar mating system, such that worker individuals are highly heterozygous [45].

We have not commented on FIT estimates in our published analyses, with the exception of [19], in which the lack of detectable population substructure (i.e., low FIT) in the giant Galapagos tortoise Chelonoidis nigra provided evidence against the definition of as many as 12 species in this taxon [46]. This result was here corroborated: C. nigra is one of the two species still showing a slightly negative FIT estimate after correction for contamination. We have, however, published a couple of analyses assessing the prevalence of hybridization and gene flow between diverged species or populations [20, 28, 30]. These results should be confirmed by reproducing the analyses using contamination-corrected data.

We compared for each species the likelihoods of the four considered values of γ. The maximally likely γ, which we called γ*, was 0 in ten species, 0.05 in 15 species, 0.1 in five species, and 0.2 in nine species. We detected a strong effect of species diversity on γ* : the median πS was 0.034 among species for which γ* was 0, but 0.003 among species for which γ* was 0.2. This was unexpected and probably reflects the existence of factors that confound contamination detection (see section 3 of the Discussion "Modeling contamination").

Finally, we reproduced the analyses of Romiguier et al. (2014) [21], accounting for the dubious GA36K and GA08F samples. The published relationships between genetic diversity and species life history traits were robust to the exclusion of M. trossulus and M. cinxia: the correlation coefficient between πS and propagule size was almost unchanged compared to the uncorrected analysis (0.766 vs. 0.771), whereas the correlation coefficient between πS and longevity was slightly increased (0.594 vs. 0.569), as was the case for correlations between the πNS ratio and life history traits. We recalculated population genomics statistics in M. cinxia after excluding individual GA08F, that is, based on just nine individuals instead of ten. Excluding GA08F resulted in a substantial decrease in genome-average πS (0.025 vs. 0.034), πN (0.0027 vs. 0.0032), and FIT (0.38 vs. 0.52). Correlation coefficients with life history traits, however, were hardly affected by this correction.


Author information

Affiliations

Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA, USA

Andrew V. Anzalone, Peyton B. Randolph, Jessie R. Davis, Alexander A. Sousa, Luke W. Koblan, Jonathan M. Levy, Peter J. Chen, Christopher Wilson, Gregory A. Newby, Aditya Raguram & David R. Liu

Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA

Andrew V. Anzalone, Peyton B. Randolph, Jessie R. Davis, Alexander A. Sousa, Luke W. Koblan, Jonathan M. Levy, Peter J. Chen, Christopher Wilson, Gregory A. Newby, Aditya Raguram & David R. Liu

Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA

Andrew V. Anzalone, Peyton B. Randolph, Jessie R. Davis, Alexander A. Sousa, Luke W. Koblan, Jonathan M. Levy, Peter J. Chen, Christopher Wilson, Gregory A. Newby, Aditya Raguram & David R. Liu

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Contributions

A.V.A. designed the research, performed experiments, analysed data, and wrote the manuscript. P.B.R., J.R.D., A.A.S., and G.A.N. performed human cell experiments and analysed data. L.W.K. and J.M.L. performed neuron experiments. P.J.C. and C.W. performed and analysed RNA-seq experiments. A.R. analysed ClinVar data. D.R.L designed and supervised the research and wrote the manuscript.

Corresponding author


G>T transversion VS. T>G transversion? - Biology

We continue our study of properties of trinucleotide circular codes [ 1 – 5 ], trinucleotide comma-free codes [ 1 , 6 ], strong trinucleotide circular codes [ 7 ], and the common trinucleotide circular code X 0 identified in genes [ 8 ] (see also the recent statistical analysis by [ 9 ]) which could be a translation code [ 10 ]. A trinucleotide is a word of three letters (triletter) on the genetic alphabet < A , C , G , T >. The set of 64 trinucleotides is a code (called genetic code), more precisely a uniform code but not a circular code (see Remark 2 ). In the past 50 years, codes, comma-free codes, and circular codes have been mathematical objects studied in theoretical biology, mainly to understand the structure and the origin of the genetic code as well as the reading frame (construction) of genes, for example, [ 11 – 13 ]. In order to have an intuitive meaning of these notions, codes are written on a straight line while comma-free codes and circular codes are written on a circle, but in both cases, unique decipherability is required. Circular codes only belong to some subsets of the 64 trinucleotide set while comma-free codes are even more constrained subsets of circular codes [ 1 ].

Before the discovery of the genetic code, Crick et al. [ 11 ] proposed a maximum comma-free code of 20 trinucleotides for coding the 20 amino acids. This comma-free code turned out to be invalid (see, e.g., [ 14 ]). In 1996 , a maximum circular code X 0 of 20 trinucleotides was identified statistically on a large gene population of eukaryotes and also on a large gene population of prokaryotes [ 8 ] (1) X 0 = < A A C , A A T , A C C , A T C , A T T , C A G , C T C ,   C T G , G A A , G A C , G A G , G A T , G C C , G G C ,   G G T , G T A , G T C , G T T , T A C , T T C >. This code X 0 has remarkable mathematical properties as it is a C 3 self-complementary maximum circular code (see the following). Since 1996 , its properties have been studied in detail by different authors, for example, [ 9 , 15 – 21 ]. Transition and transversions I and II are classical molecular evolution processes, for example, [ 22 ]. By using an algorithm based on the necklace, we perform here a comprehensive computer analysis of these three evolution processes in the code X 0 . Some new results are identified with the code X 0 by computer analysis in particular (i) transversion I on the 2nd position of any subset of trinucleotides of X 0 generates trinucleotide circular codes which are always C 3 and (ii) transversion II on the three positions of any subset of trinucleotides of X 0 yields no trinucleotide circular codes.

The classical notions of language theory and codes can be found in [ 23 , 24 ]. Let 𝒜 4 = < A , C , G , T >denote the genetic alphabet, lexicographically ordered by A < C < G < T . The set of words (nonempty words, resp.) on 𝒜 4 is denoted by 𝒜 4 * ( 𝒜 4 + , resp.). The set of the 16 words of length 2 (dinucleotides or diletters) on 𝒜 4 is denoted by 𝒜 4 2 = < A A , A C , … , T T >. The set of the 64 words of length 3 (trinucleotides or triletters) on 𝒜 4 is denoted by 𝒜 4 3 = < A A A , A A C , … , T T T >.

A subset X ⊂ 𝒜 4 + is a code on 𝒜 4 if for each x 1 , … , x n , x 1 ′ , … , x m ′ ∈ X ,   n , m ≥ 1 , the condition x 1 ⋯ x n = x 1 ′ ⋯ x m ′ implies n = m and x i = x i ′ for i = 1 , … , n .

Any nonempty subset of 𝒜 4 3 is a code called here trinucleotide code.

A trinucleotide code X ⊂ 𝒜 4 3 is circular if, for each x 1 , … , x n , x 1 ′ , … , x m ′ ∈ X ,   n , m ≥ 1 ,    p ∈ 𝒜 4 * ,   s ∈ 𝒜 4 + , the conditions s x 2 ⋯ x n p = x 1 ′ ⋯ x m ′ and x 1 = p s imply n = m , p = ε (empty word) and x i = x i ′ for i = 1 , … , n .

A trinucleotide circular code is noted C .

𝒜 4 3 is not a trinucleotide circular code.

Let l 1 , l 2 , … , l n - 1 , l n be letters in 𝒜 4 , d 1 , d 2 , … , d n - 1 , d n diletters in 𝒜 4 2 , and n an integer satisfying n ≥ 2 .

We say that the ordered sequence l 1 , d 1 , l 2 , d 2 , … , d n - 1 , l n , d n , l n + 1 is an ( n + 1 ) LDCN (Letter Diletter Continued Necklace) for a subset X ⊂ 𝒜 4 3 if (2) l 1 d 1 , l 2 d 2 , … , l n d n ∈ X , d 1 l 2 , d 2 l 3 , … , d n - 1 l n , d n l n + 1 ∈ X .

Only a few trinucleotide codes are circular. Two propositions based on the necklace concept allow to determine if a trinucleotide code is circular or not [ 2 , 18 ].

X     is a trinucleotide circular code

We say that the ordered sequence l 1 , d 1 , l 2 , d 2 , … , d n - 1 , l n , d n , l n + 1 is an ( n + 1 ) LDCCN (Letter Diletter Continued Closed Necklace) for a subset X ⊂ 𝒜 4 3 if (3) l 1 d 1 , l 2 d 2 , … , l n d n ∈ X , d 1 l 2 , d 2 l 3 , … , d n - 1 l n , d n l 1 ∈ X .

X     is a trinucleotide circular code

A trinucleotide circular code X ⊂ 𝒜 4 3 is maximal if, for each x ∈ 𝒜 4 3 , x ∉ X , X ∪ < x >is not a trinucleotide circular code.

A trinucleotide circular code containing exactly k elements is called a k -trinucleotide circular code.

A 20 -trinucleotide circular code is maximum as no trinucleotide circular code can contain more than 20 words.

A maximum trinucleotide circular code is noted MC .

A 20 -trinucleotide circular code is both maximal and maximum.

We recall two classical genetic maps: complementary and circular permutation.

The complementary genetic map 𝒞 : 𝒜 4 + → 𝒜 4 + is defined by (4) 𝒞 ( A ) = T ,     𝒞 ( C ) = G , 𝒞 ( G ) = C ,     𝒞 ( T ) = A and for all u , v ∈ 𝒜 4 + by (5) 𝒞 ( u v ) = 𝒞 ( v ) 𝒞 ( u ) .

𝒞 ( A C G ) = C G T . This map 𝒞 is associated with the property of the complementary and antiparallel double helix (one DNA strand chemically oriented in a 5 ′ - 3 ′ direction and the other DNA strand in the opposite 3 ′ - 5 ′ direction).

The complementary map 𝒞 on a trinucleotide x is naturally extended to a trinucleotide code X as follows: (6) 𝒞 ( X ) = < y ∣ x , y ∈ 𝒜 4 3 ,   x ∈ X ,   y = 𝒞 ( x ) >.

The circular permutation genetic map 𝒫 :   𝒜 4 3 → 𝒜 4 3 permutes circularly a trinucleotide l 1 l 2 l 3 , l 1 , l 2 , l 3 ∈ 𝒜 4 , as follows: (7) 𝒫 ( l 1 l 2 l 3 ) = l 2 l 3 l 1 .

The circular permutation map 𝒫 on a trinucleotide x is naturally extended to a trinucleotide code X as follows: (8) 𝒫 ( X ) = < y ∣ x , y ∈ 𝒜 4 3 ,     x ∈ X ,   y = 𝒫 ( x ) >.

The k th iterate of 𝒫 is denoted by 𝒫 k .

The trinucleotide codes 𝒫 ( X ) and 𝒫 2 ( X ) are the conjugated classes of the trinucleotide code X .

A trinucleotide circular code X is self-complementary if, for each x ∈ X , 𝒞 ( x ) ∈ X .

A self-complementary trinucleotide circular code is noted SC .

A k -trinucleotide circular code for k odd cannot be self-complementary.

A trinucleotide circular code X is C 3 if X , 𝒫 ( X ) , and 𝒫 2 ( X ) are trinucleotide circular codes.

A C 3 trinucleotide circular code X is noted C 3 .

A trinucleotide circular code X is C 3 self-complementary maximum if X is maximum, X = 𝒞 ( X ) (self-complementary), and 𝒫 ( X ) and 𝒫 2 ( X ) are trinucleotide circular codes satisfying 𝒞 ( 𝒫 ( X ) ) = 𝒫 2 ( X ) .

A C 3 self-complementary maximum circular code is noted MS C 3 .

The set X 0 of 20 trinucleotides identified in the gene populations of both eukaryotes and prokaryotes is a C 3 self-complementary maximum circular code MS C 3 [ 8 ] that is, X 0 is maximum, X 0 = 𝒞 ( X 0 ) , 𝒫 ( X 0 ) = X 1 , and 𝒫 2 ( X 0 ) = X 2 are trinucleotide circular codes, and 𝒞 ( X 1 ) = X 2 .

We recall three classical evolution genetic maps: transition and transversions I and II, for example, [ 22 ] and extend their definitions to the positions of a trinucleotide.

The transition evolution genetic map 𝒯 : 𝒜 4 + → 𝒜 4 + is defined by (9) 𝒯 ( A ) = G ,     𝒯 ( C ) = T , 𝒯 ( G ) = A ,     𝒯 ( T ) = C .

The transition map 𝒯 on a letter l can be applied in different positions of a trinucleotide x = l 1 l 2 l 3 : 𝒯 i , i ∈ < 1,2 , 3 >, is the transition on the position i of x , 𝒯 i , j , i , j ∈ < 1,2 , 3 >with i < j , is the transition on the two positions i and j of x , and 𝒯 1,2 , 3 is the transition on the three positions of x .

𝒯 1 ( A C G ) = G C G , 𝒯 2 ( A C G ) = A T G , 𝒯 3 ( A C G ) = A C A , 𝒯 1,2 ( A C G ) = G T G , 𝒯 1,3 ( A C G ) = G C A , 𝒯 2,3 ( A C G ) = A T A , and 𝒯 1,2 , 3 ( A C G ) = G T A .

The transition maps 𝒯 i ,     𝒯 i , j ,     𝒯 1,2 , 3 on a trinucleotide x are also extended to a trinucleotide code X , in a similar way to the genetic maps 𝒞 and 𝒫 .

The transversion I evolution genetic map 𝒱 I : 𝒜 4 + → 𝒜 4 + is defined by (10) 𝒱 I ( A ) = T ,     𝒱 I ( C ) = G , 𝒱 I ( G ) = C ,     𝒱 I ( T ) = A .

The transversion I map 𝒱 I on a letter l can also be applied in different positions of a trinucleotide x = l 1 l 2 l 3 : 𝒱 I i , i ∈ < 1,2 , 3 >, is the transversion I on the position i of x , 𝒱 I i , j , i , j ∈ < 1,2 , 3 >with i < j , is the transversion I on the two positions i and j of x , and 𝒱 I 1,2 , 3 is the transversion I on the three positions of x .

𝒱 I 1 ( A C G ) = T C G , 𝒱 I 2 ( A C G ) = A G G , 𝒱 I 3 ( A C G ) = A C C , 𝒱 I 1,2 ( A C G ) = T G G , 𝒱 I 1,3 ( A C G ) = T C C , 𝒱 I 2,3 ( A C G ) = A G C , and 𝒱 I 1,2 , 3 ( A C G ) = T G C .

The transversion I maps 𝒱 I i ,     𝒱 I i , j ,     𝒱 I 1,2 , 3 on a trinucleotide x are also extended to a trinucleotide code X , in a similar way to the genetic maps 𝒞 and 𝒫 .

The transversion II evolution genetic map 𝒱 II : 𝒜 4 + → 𝒜 4 + is defined by (11) 𝒱 II ( A ) = C ,     𝒱 II ( C ) = A , 𝒱 II ( G ) = T ,     𝒱 II ( T ) = G .

The transversion II map 𝒱 II on a letter l can also be applied in different positions of a trinucleotide x = l 1 l 2 l 3 : 𝒱 II i , i ∈ < 1,2 , 3 >, is the transversion II on the position i of x , 𝒱 II i , j , i , j ∈ < 1,2 , 3 >with i < j , is the transversion II on the two positions i and j of x , and 𝒱 II 1,2 , 3 is the transversion II on the three positions of x .

𝒱 II 1 ( A C G ) = C C G , 𝒱 II 2 ( A C G ) = A A G , 𝒱 II 3 ( A C G ) = A C T , 𝒱 II 1,2 ( A C G ) = C A G , 𝒱 II 1,3 ( A C G ) = C C T , 𝒱 II 2,3 ( A C G ) = A A T , and 𝒱 II 1,2 , 3 ( A C G ) = C A T .

The transversion II maps 𝒱 II i ,     𝒱 II i , j ,     𝒱 II 1,2 , 3 on a trinucleotide x are also extended to a trinucleotide code X , in a similar way to the genetic maps 𝒞 and 𝒫 .

The evolution genetic maps in l trinucleotides of a trinucleotide circular code are defined by 𝒯 ( l ) for transition, 𝒱 I ( l ) for transversion I, and 𝒱 II ( l ) for transversion II.

An evolution genetic map, that is, 𝒯 ( l ) , 𝒱 I ( l ) , and 𝒱 II ( l ) , in l trinucleotides of the common trinucleotide circular code X 0 leads to S ( l ) = ( 20 l ) trinucleotide codes which are potentially circular. Table 1 gives these numbers S ( l ) .

Number S ( l ) = ( 20 l ) of trinucleotide codes after an evolution genetic map (transition 𝒯 ( l ) , transversion I 𝒱 I ( l ) , and transversion II 𝒱 II ( l ) ) in l trinucleotides of the common trinucleotide circular code X 0 .

Based on Proposition 6 allowing to test if a trinucleotide code is circular or not (algorithm not detailed, see, e.g., [ 2 ]), computer analyses of a great number of trinucleotide codes allow to identify here new properties with the common trinucleotide circular code X 0 observed in genes under evolution by transition and transversion.

3.1. Transition Map 3.1.1. Transition Map 𝒯 i Result 1 (Table 2 ).

For l = 1 , … , 20 (12) c ( 𝒯 1 ( l ) ) = c ( 𝒯 3 ( l ) ) , mc ( 𝒯 1 ( l ) ) = mc ( 𝒯 3 ( l ) ) , sc ( 𝒯 1 ( l ) ) = sc ( 𝒯 3 ( l ) ) , c 3 ( 𝒯 1 ( l ) ) = c 3 ( 𝒯 3 ( l ) ) , ms c 3 ( 𝒯 1 ( l ) ) = ms c 3 ( 𝒯 3 ( l ) ) . As expected, the lists of trinucleotide circular codes C associated with c ( 𝒯 1 ( l ) ) and c ( 𝒯 3 ( l ) ) are different for l = 1 , … , 13 (not shown). No trinucleotide code is circular after a certain number of transitions 𝒯 i in the trinucleotides of the common trinucleotide circular code X 0 . Precisely, for l = 14 , … , 20 (13) c ( 𝒯 1 ( l ) ) = c ( 𝒯 3 ( l ) ) = 0 and for l = 10 , … , 20 (14) c ( 𝒯 2 ( l ) ) = 0 . The transition 𝒯 i generates a maximum number of trinucleotide circular codes C for (15) max < c ( 𝒯 i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >       = c ( 𝒯 1 ( 7 ) ) = c ( 𝒯 3 ( 7 ) ) = 1436 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (16)    max < ms c 3 ( 𝒯 i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >    = ms c 3 ( 𝒯 1 ( 6 ) ) = ms c 3 ( 𝒯 3 ( 6 ) ) = 20 .

Transition map 𝒯 i ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒯 i ( l ) ) of circular codes C , number mc ( 𝒯 i ( l ) ) of maximum circular codes MC , number sc ( 𝒯 i ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒯 i ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒯 i ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 20 (17) c ( 𝒯 1,2 ( l ) ) = c ( 𝒯 2,3 ( l ) ) , mc ( 𝒯 1,2 ( l ) ) = mc ( 𝒯 2,3 ( l ) ) , sc ( 𝒯 1,2 ( l ) ) = sc ( 𝒯 2,3 ( l ) ) , c 3 ( 𝒯 1,2 ( l ) ) = c 3 ( 𝒯 2,3 ( l ) ) , ms c 3 ( 𝒯 1,2 ( l ) ) = ms c 3 ( 𝒯 2,3 ( l ) ) . The lists of trinucleotide circular codes C associated with c ( 𝒯 1,2 ( l ) ) and c ( 𝒯 2,3 ( l ) ) are different for l = 1 , … , 14 (not shown). No trinucleotide code is circular after a certain number of transitions 𝒯 i , j in the trinucleotides of the common trinucleotide circular code X 0 . Precisely, for l = 15 , … , 20 (18) c ( 𝒯 1,2 ( l ) ) = c ( 𝒯 2,3 ( l ) ) = 0 and for l = 12 , … , 20 (19) c ( 𝒯 1,3 ( l ) ) = 0 . The transition 𝒯 i , j generates a maximum number of trinucleotide circular codes C for (20) max < c ( 𝒯 i , j ( l ) ) ,     i , j = 1,2 , 3 ,     i < j ,     l = 1 , … , 20 >    = c ( 𝒯 1,3 ( 6 ) ) = 598 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (21) max < ms c 3 ( 𝒯 i , j ( l ) ) ,   i , j = 1,2 , 3 ,     i < j ,     l = 1 , … , 20 >    = ms c 3 ( 𝒯 1,3 ( 4 ) ) = ms c 3 ( 𝒯 1,3 ( 6 ) ) = 10 . The numbers c 3 ( 𝒯 1,2 ( l ) ) = c 3 ( 𝒯 2,3 ( l ) ) of circular codes C 3 have a particular growth function (22) c 3 ( 𝒯 1,2 ( 14 ) ) = c 3 ( 𝒯 2,3 ( 14 ) ) = 1 , c 3 ( 𝒯 1,2 ( l ) ) = c 3 ( 𝒯 2,3 ( l ) ) = 0   for     l = 10 , … , 13 .

Transition map 𝒯 i , j ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒯 i , j ( l ) ) of circular codes C , number mc ( 𝒯 i , j ( l ) ) of maximum circular codes MC , number sc ( 𝒯 i , j ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒯 i , j ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒯 i , j ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

The transition 𝒯 1,2 , 3 always generates trinucleotide circular codes. Indeed, for l = 1 , … , 20 (23) c ( 𝒯 1,2 , 3 ( l ) ) > 0 . The lists of trinucleotide circular codes C associated with c ( 𝒯 1,2 , 3 ( l ) ) and c ( 𝒯 1,2 , 3 ( 20 - l ) ) are different for l = 1 , … , 9 (not shown). The transition 𝒯 1,2 , 3 generates a maximum number of trinucleotide circular codes C for (24) max < c ( 𝒯 1,2 , 3 ( l ) ) ,   l = 1 , … , 20 >    = c ( 𝒯 1,2 , 3 ( 8 ) ) = c ( 𝒯 1,2 , 3 ( 12 ) ) = 72 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (25)    max < ms c 3 ( 𝒯 1,2 , 3 ( l ) ) ,   l = 1 , … , 20 >    = ms c 3 ( 𝒯 1,2 , 3 ( 4 ) ) = ms c 3 ( 𝒯 1,2 , 3 ( 16 ) ) = 6 .

Transition map 𝒯 1,2 , 3 ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒯 1,2 , 3 ( l ) ) of circular codes C , number mc ( 𝒯 1,2 , 3 ( l ) ) of maximum circular codes MC , number sc ( 𝒯 1,2 , 3 ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒯 1,2 , 3 ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒯 1,2 , 3 ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 20 (26) c ( 𝒱 I 1 ( l ) ) = c ( 𝒱 I 3 ( l ) ) , mc ( 𝒱 I 1 ( l ) ) = mc ( 𝒱 I 3 ( l ) ) , sc ( 𝒱 I 1 ( l ) ) = sc ( 𝒱 I 3 ( l ) ) , c 3 ( 𝒱 I 1 ( l ) ) = c 3 ( 𝒱 I 3 ( l ) ) , ms c 3 ( 𝒱 I 1 ( l ) ) = ms c 3 ( 𝒱 I 3 ( l ) ) . The lists of trinucleotide circular codes C associated with c ( 𝒱 I 1 ( l ) ) and c ( 𝒱 I 3 ( l ) ) are different for l = 1 , … , 9 (not shown). No trinucleotide code is circular after a certain number of transversions I 𝒱 I i in the trinucleotides of the common trinucleotide circular code X 0 . Precisely, for l = 10 , … , 20 (27) c ( 𝒱 I 1 ( l ) ) = c ( 𝒱 I 3 ( l ) ) = 0 and for l = 18,19,20 (28) c ( 𝒱 I 2 ( l ) ) = 0 . The transversion I 𝒱 I i generates a maximum number of trinucleotide circular codes C for (29) max < c ( 𝒱 I i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >    = c ( 𝒱 I 2 ( 9 ) ) = 24310 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (30) max < ms c 3 ( 𝒱 I i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >    = ms c 3 ( 𝒱 I 2 ( 8 ) ) = 70 . A remarkable code property only found with transversion I 𝒱 I 2 is, for l = 1 , … , 20 , (31) c ( 𝒱 I 2 ( l ) ) = c 3 ( 𝒱 I 2 ( l ) ) , and furthermore, after a detailed computer analysis, the lists of trinucleotide circular codes C and C 3 associated with c ( 𝒱 I 2 ( l ) ) and c 3 ( 𝒱 I 2 ( l ) ) , respectively, are identical for l = 1 , … , 17 .

Transversion I map 𝒱 I i ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒱 I i ( l ) ) of circular codes C , number mc ( 𝒱 I i ( l ) ) of maximum circular codes MC , number sc ( 𝒱 I i ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒱 I i ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒱 I i ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 20 (32) c ( 𝒱 I 1,2 ( l ) ) = c ( 𝒱 I 2,3 ( l ) ) , mc ( 𝒱 I 1,2 ( l ) ) = mc ( 𝒱 I 2,3 ( l ) ) , sc ( 𝒱 I 1,2 ( l ) ) = sc ( 𝒱 I 2,3 ( l ) ) , c 3 ( 𝒱 I 1,2 ( l ) ) = c 3 ( 𝒱 I 2,3 ( l ) ) , ms c 3 ( 𝒱 I 1,2 ( l ) ) = ms c 3 ( 𝒱 I 2,3 ( l ) ) . The lists of trinucleotide circular codes C associated with c ( 𝒱 I 1,2 ( l ) ) and c ( 𝒱 I 2,3 ( l ) ) are different for l = 1 , … , 12 (not shown). No trinucleotide code is circular after a certain number of transversions I 𝒱 I i , j in the trinucleotides of the common trinucleotide circular code X 0 . Precisely, for l = 13 , … , 20 (33) c ( 𝒱 I 1,2 ( l ) ) = c ( 𝒱 I 2,3 ( l ) ) = 0 and for l = 19,20 (34) c ( 𝒱 I 1,3 ( l ) ) = 0 . The transversion I 𝒱 I i , j generates a maximum number of trinucleotide circular codes C for (35) max < c ( 𝒱 I i , j ( l ) ) ,   i , j = 1,2 , 3 ,   i < j ,   l = 1 , … , 20 >    = c ( 𝒱 I 1,2 ( 6 ) ) = c ( 𝒱 I 2,3 ( 6 ) ) = 630 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (36) max < ms c 3 ( 𝒱 I i , j ( l ) ) ,   i , j = 1,2 , 3 ,   i < j ,   l = 1 , … , 20 >    = ms c 3 ( 𝒱 I 1,2 ( 4 ) ) = ms c 3 ( 𝒱 I 2,3 ( 4 ) ) = 6 . The numbers sc ( 𝒱 I 1,3 ( l ) ) of self-complementary circular codes SC have a particular growth function (37) sc ( 𝒱 I 1,3 ( l ) ) = 1   for     l = 12,14,16,18 , sc ( 𝒱 I 1,3 ( l ) ) = 0   for     l = 8,10 . The numbers c 3 ( 𝒱 I 1,3 ( l ) ) of circular codes C 3 have a particular growth function (38) c 3 ( 𝒱 I 1,3 ( l ) ) = 1   for     l = 16,18 , c 3 ( 𝒱 I 1,3 ( 17 ) ) = 2 , c 3 ( 𝒱 I 1,3 ( l ) ) = 0   for     l = 7 , … , 15 .

Transversion I map 𝒱 I i , j ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒱 I i , j ( l ) ) of circular codes C , number mc ( 𝒱 I i , j ( l ) ) of maximum circular codes MC , number sc ( 𝒱 I i , j ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒱 I i , j ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒱 I i , j ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

The transversion I 𝒱 I 1,2 , 3 always generates trinucleotide circular codes. Indeed, for l = 1 , … , 20 (39) c ( 𝒱 I 1,2 , 3 ( l ) ) > 0 . The lists of trinucleotide circular codes C associated with c ( 𝒱 I 1,2 , 3 ( l ) ) and c ( 𝒱 I 1,2 , 3 ( 20 - l ) ) are different for l = 1 , … , 9 (not shown). The transversion I 𝒱 I 1,2 , 3 generates a maximum number of trinucleotide circular codes C for (40) max < c ( 𝒱 I 1,2 , 3 ( l ) ) ,   l = 1 , … , 20 >= c ( 𝒱 I 1,2 , 3 ( 10 ) ) = 66 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (41) max < ms c 3 ( 𝒱 I 1,2 , 3 ( l ) ) ,   l = 1 , … , 20 >    = ms c 3 ( 𝒱 I 1,2 , 3 ( 4 ) ) = ms c 3 ( 𝒱 I 1,2 , 3 ( 16 ) ) = 9 .

Transversion I map 𝒱 I 1,2 , 3 ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒱 I 1,2 , 3 ( l ) ) of circular codes C , number mc ( 𝒱 I 1,2 , 3 ( l ) ) of maximum circular codes MC , number sc ( 𝒱 I 1,2 , 3 ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒱 I 1,2 , 3 ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒱 I 1,2 , 3 ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 20 (42) c ( 𝒱 II 1 ( l ) ) = c ( 𝒱 II 3 ( l ) ) , mc ( 𝒱 II 1 ( l ) ) = mc ( 𝒱 II 3 ( l ) ) , sc ( 𝒱 II 1 ( l ) ) = sc ( 𝒱 II 3 ( l ) ) , c 3 ( 𝒱 II 1 ( l ) ) = c 3 ( 𝒱 II 3 ( l ) ) , ms c 3 ( 𝒱 II 1 ( l ) ) = ms c 3 ( 𝒱 II 3 ( l ) ) . The lists of trinucleotide circular codes C associated with c ( 𝒱 II 1 ( l ) ) and c ( 𝒱 II 3 ( l ) ) are different for l = 1 , … , 8 (not shown). No trinucleotide code is circular after a certain number of transversions II 𝒱 II i in the trinucleotides of the common trinucleotide circular code X 0 . Precisely, for l = 9 , … , 20 (43) c ( 𝒱 II 1 ( l ) ) = c ( 𝒱 II 3 ( l ) ) = 0 and for l = 12 , … , 20 (44) c ( 𝒱 II 2 ( l ) ) = 0 . The transversion II 𝒱 II i generates a maximum number of trinucleotide circular codes C for (45) max < c ( 𝒱 II i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >    = c ( 𝒱 II 2 ( 5 ) ) = 176 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (46) max < ms c 3 ( 𝒱 II i ( l ) ) ,   i = 1,2 , 3 ,   l = 1 , … , 20 >    = ms c 3 ( 𝒱 II 2 ( 4 ) ) = 6 .

Transversion II map 𝒱 II i ( l ) in l trinucleotides of the common trinucleotide circular code X 0 . Number c ( 𝒱 II i ( l ) ) of circular codes C , number mc ( 𝒱 II i ( l ) ) of maximum circular codes MC , number sc ( 𝒱 II i ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒱 II i ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒱 II i ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 20 (47) c ( 𝒱 II 1,2 ( l ) ) = c ( 𝒱 II 2,3 ( l ) ) , mc ( 𝒱 II 1,2 ( l ) ) = mc ( 𝒱 II 2,3 ( l ) ) , sc ( 𝒱 II 1,2 ( l ) ) = sc ( 𝒱 II 2,3 ( l ) ) , c 3 ( 𝒱 II 1,2 ( l ) ) = c 3 ( 𝒱 II 2,3 ( l ) ) , ms c 3 ( 𝒱 II 1,2 ( l ) ) = ms c 3 ( 𝒱 II 2,3 ( l ) ) . The lists of trinucleotide circular codes C associated with c ( 𝒱 II 1,2 ( l ) ) and c ( 𝒱 II 2,3 ( l ) ) are different for l = 1 , … , 5,13,14 (not shown). The distribution of trinucleotide codes which are not circular under transversions II 𝒱 II i , j in the trinucleotides of the common trinucleotide circular code X 0 is very unusual. Indeed, for l = 6 , … , 12,15 , … , 20 (48) c ( 𝒱 II 1,2 ( l ) ) = c ( 𝒱 II 2,3 ( l ) ) = 0 and for l = 15 , … , 20 (49) c ( 𝒱 II 1,3 ( l ) ) = 0 . The transversion II 𝒱 II i , j generates a maximum number of trinucleotide circular codes C for (50) max < c ( 𝒱 II i , j ( l ) ) ,   i , j = 1,2 , 3 ,   i < j ,   l = 1 , … , 20 >    = c ( 𝒱 II 1,3 ( 6 ) ) = 662 and a maximum number of C 3 self-complementary maximum circular codes MS C 3 for (51) max < ms c 3 ( 𝒱 II i , j ( l ) ) ,   i , j = 1,2 , 3 ,   i < j ,     l = 1 , … , 20 >    = ms c 3 ( 𝒱 II 1,3 ( 4 ) ) = 6 . The numbers c ( 𝒱 II 1,2 ( l ) ) = c ( 𝒱 II 2,3 ( l ) ) of circular codes C have a particular growth function (52) c ( 𝒱 II 1,2 ( l ) ) = c ( 𝒱 II 2,3 ( l ) ) = 1   for     l = 13,14 , c ( 𝒱 II 1,2 ( l ) ) = c ( 𝒱 II 2,3 ( l ) ) = 0   for     l = 6 , … , 12 .

Transversion II map 𝒱 II i , j ( l ) in l trinucleotide of the common trinucleotide circular code X 0 . Number c ( 𝒱 II i , j ( l ) ) of circular codes C , number mc ( 𝒱 II i , j ( l ) ) of maximum circular codes MC , number sc ( 𝒱 II i , j ( l ) ) of self-complementary circular codes SC , number c 3 ( 𝒱 II i , j ( l ) ) of circular codes C 3 , and number ms c 3 ( 𝒱 II i , j ( l ) ) of C 3 self-complementary maximum circular codes MS C 3 .

For l = 1 , … , 19 (53) c ( 𝒱 II 1,2 , 3 ( l ) ) = 0 and obviously, by letter invariance, c ( 𝒱 II 1,2 , 3 ( 20 ) ) = 1 as in Tables 4 and 7 .

The common trinucleotide circular code X 0 can be partitioned according to the maps 𝒱 II 1,2 , 3 , 𝒫 , and 𝒫 2 as shown in Table 10 .

Let a partition P i = < x , x ′ >, i ∈ < 1 , … , 10 >, composed of two trinucleotides x , x ′ ∈ X 0 . For l = 1 , any transversion II of a trinucleotide x ∈ P i generates a trinucleotide y which is a permuted trinucleotide of the other trinucleotide x ′ ∈ P i . So, any transversion II of a trinucleotide x ∈ X 0 leads to a trinucleotide code which is not circular. For 2 ≤ l ≤ 19 , the proof needs a computer analysis of the necklace for the nontrivial cases when two transversions II occur with two trinucleotides in the same partitions.

Very surprisingly, for the three maps of transition, transversions I and II, 𝒯 i ( l ) , 𝒱 I i ( l ) , and 𝒱 II i ( l ) , i ∈ < 1,2 , 3 >, 𝒯 i , j ( l ) , 𝒱 I i , j ( l ) , and 𝒱 II i , j ( l ) , i , j ∈ < 1,2 , 3 >with i < j and 𝒯 1,2 , 3 ( l ) (not for 𝒱 I 1,2 , 3 ( l ) and 𝒱 II 1,2 , 3 ( l ) ), the numbers ms c 3 of self-complementary maximum circular codes MS C 3 for the first even values of l follow a series of binomial coefficients. For 𝒯 i ( l ) , 𝒱 I i ( l ) , and 𝒱 II i ( l ) , i ∈ < 1,2 , 3 >, 𝒯 1,2 ( l ) , 𝒯 2,3 ( l ) , and 𝒱 II i , j ( l ) , i , j ∈ < 1,2 , 3 >with i < j , the numbers mc of maximum circular codes MC for the first even values of l follow a series of binomial coefficients. For 𝒯 1,2 , 3 ( l ) , the numbers c 3 of circular codes C 3 for the values l and ( 20 - l ) with l = 1 , … , 8 follow a series of binomial coefficients. These binomial properties with some numbers of circular codes for the three maps of transition, transversions I and II have no combinatorial explanation so far.

A comprehensive computer analysis of transition and transversions I and II in the C 3 self-complementary maximum circular code X 0 shows some new results in particular (i) transversion I 𝒱 I 2 ( l ) on the 2nd position of any subset of trinucleotides of X 0 generates trinucleotide circular codes which are always C 3 and (ii) transversion II 𝒱 II 1,2 , 3 on the three positions of any subset of trinucleotides of X 0 yields no trinucleotide circular codes. In addition to the classical self-complementary (Definition 20 ) partition of X 0 known since 1996, a new partition of X 0 based on the transversion II map 𝒱 II 1,2 , 3 (Definition 33 ) and the circular permutation maps 𝒫 and 𝒫 2 (Definition 18 ) is also identified here. These results here extend our theory of circular code in genes to its evolution under transition and transversion.


Supporting Information

Accession numbers

The EMBL database (http://www.ebi.ac.uk/embl) sequences used in this article under the following accession numbers are: for P. pedestris and Italopodisma sp. mitochondrial DNA (AF085501–AF085505) and Numts (AF085508–AF085524, AF085526–AF085538, AF085575–AF085578, AF085539–AF085545, AF085547–AF085550, AF085552–AF085574, EF088292–EF088294, EF088296–EF088309, EF088313, and EF088319–EF088323) for rDNA pseudogenes (AM183587, AM183588, AM183591–AM183594, AM183596–AM183608, AM183610–AM183613, AM183616–AM183624, and AM238436–AM238438) for Parapodisma mikado (AF085506) for Locusta migratoria (X80245) and for Drosophila sp. (AF012030–AF012035, AF012037–AF012052, U62715–U62731, U65653).


References

Stamatoyannopoulos G, Grosveld F (2001) Hemoglobin switching. In: Stamatoyannopoulos G, Majerus P, Perlmutter R, Varmus H (eds) Molecular basis of blood diseases, 3rd edn. Saunders, Philadelphia, pp 135–182

Wijgerde M, Grosveld F, Fraser P (1995) Transcription complex stability and chromatin dynamics in vivo. Nature 377:209–213, doi:10.1038/377209a0

Bollekens JA, Forget BG (1991) Deltabeta thalassemia and hereditary persistence of fetal hemoglobin. Hematol Oncol Clin North Am 5:399–422

Swank RA, Stamatoyannopoulos G (1998) Fetal gene reactivation. Curr Opin Genet Dev 8:366–370, doi:10.1016/S0959-437X(98)80095-6

Hardison RC, Chui DH, Giardine B, Reimer C, Patrinos GP, Anagnou N, Miller W, Wajcman H (2002) HbVar: a relational database of human hemoglobin variants and thalassemia mutations at the globin gene server. Hum Mutat 19:225–233, doi:10.1002/humu.10044

Gelinas R, Endlich B, Pfeiffer C, Yagi M, Stamatoyannopoulos G (1985) G to A substitution in the distal CCAAT box of the A gamma-globin gene in Greek hereditary persistence of fetal haemoglobin. Nature 313:323–325, doi:10.1038/313323a0

Berry M, Grosveld F, Dillon N (1992) A single point mutation is the cause of the Greek form of hereditary persistence of fetal haemoglobin. Nature 358:499–502, doi:10.1038/358499a0

Patrinos GP, Loutradi-Anagnostou A, Papadakis MN (1995) A novel DNA polymorphism of the Agamma globin gene (Agamma-588 A>G) is linked with the XmnI polymorphism (Ggamma-158 C>T). Hemoglobin 19:419–423, doi:10.3109/03630269509005835

Patrinos GP, Kollia P, Loutradi-Anagnostou A, Loukopoulos D, Papadakis MN (1998) The Cretan type of non-deletional hereditary persistence of fetal hemoglobin [Agamma-158 C>T] results from two independent gene conversion events. Hum Genet 102:629–634, doi:10.1007/s004390050753

Losekoot M, Fodde R, Hartveld CL, van Heeren H, Giordano PC, Bernini LF (1990) Denaturing gradient gel electrophoresis and direct sequencing of PCR amplified genomic DNA: a rapid and reliable diagnostic approach to beta thalassemia. Br J Haematol 76:269–274, doi:10.1111/j.1365-2141.1990.tb07883.x

Papadakis MN, Papapanagiotou E, Loutradi-Anagnostou A (1997) Scanning method to identify the molecular heterogeneity of the delta-globin gene, especially in delta-thalassemias: detection of three novel mutations in the promoter region of the gene. Hum Mutat 9:465–472, doi:10.1002/(SICI)1098-1004(1997)9:5<465::AID-HUMU14>3.0.CO2-0

Tan AS, Quah TC, Low PS, Chong SS (2001) A rapid and reliable 7-deletion multiplex polymerase chain reaction assay for alpha-thalassemia. Blood 98:250–251, doi:10.1182/blood.V98.1.250

Patrinos GP, van Baal S, Petersen MB, Papadakis MN (2005) Hellenic National Mutation database: a prototype database for mutations leading to inherited disorders in the Hellenic population. Hum Mutat 25:327–333, doi:10.1002/humu.20157

Papachatzopoulou A, Kaimakis P, Pourfarzad F, Menounos PG, Evangelakou P, Kollia P, Grosveld FG, Patrinos GP (2007) Increased gamma-globin gene expression in beta-thalassemia intermedia patients correlates with a mutation in 3′HS1. Am J Hematol 82:1005–1009, doi:10.1002/ajh.20979

Luo HY, Mang D, Patrinos GP, Pourfarzad F, Wuc CJY, Eung SH, Rosenfield CG, Daoust PR, Braun A, Grosveld FG, Steinberg MH, Chui DHK (2004) A novel single nucleotide polymorphism (SNP), T>G, in the GATA site at nucleotide (nt) −567 5′ to the Ggamma-globin gene may be associated with elevated Hb F. Blood 104:145a–146a

Chen Z, Luo HY, Basran RK, Hsu TH, Mang DW, Nuntakarn L, Rosenfield CG, Patrinos GP, Hardison RC, Steinberg MH, Chui DH (2008) A T-to-G transversion at nucleotide −567 upstream of HBG2 in a GATA-1 binding motif is associated with elevated hemoglobin F. Mol Cell Biol 28:4386–4393, doi:10.1128/MCB.00071-08

Huisman TH, Harris H, Gravely M, Schroeder WA, Shelton JR, Shelton JB, Evans L (1977) The chemical heterogeneity of fetal hemoglobin in normal newborn infants and in adults. Mol Cell Biochem 17:45–55, doi:10.1007/BF01732554

Ronchi AE, Bottardi S, Mazzucchelli C, Ottolenghi S, Santoro C (1995) Differential binding of the NFE3 and CP1/NFY transcription factors to the human gamma- and epsilon-globin CCAAT boxes. J Biol Chem 270:21934–21941, doi:10.1074/jbc.270.37.21934

Tasiopoulou M, Boussiou M, Sinopoulou K, Moraitis G, Loutradi-Anagnostou A, Karababa P (2008) (G)gamma-196 C->T, (A)gamma-201 C->T: two novel mutations in the promoter region of the gamma-globin genes associated with nondeletional hereditary persistence of fetal hemoglobin in Greece. Blood Cells Mol Dis 40:320–322, doi:10.1016/j.bcmd.2007.10.007

Patrinos GP, Giardine B, Riemer C, Miller W, Chui DH, Anagnou NP, Wajcman H, Hardison RC (2004) Improvements in the HbVar human hemoglobin variants and thalassemia mutations for population and sequence variation studies. Nucleic Acids Res 32:D537–D541, doi:10.1093/nar/gkh006

van Baal S, Kaimakis P, Phommarinh M, Koumbi D, Cuppens H, Riccardino F, Macek M Jr, Scriver CR, Patrinos GP (2007) FINDbase: a relational database recording frequencies of genetic defects leading to inherited disorders worldwide. Nucleic Acids Res 35:D690–D695, doi:10.1093/nar/gkl934

Papadakis MN, Patrinos GP, Tsaftaridis P, Loutradi-Anagnostou A (2002) A comparative study of Greek non-deletional hereditary persistence of fetal hemoglobin and beta-thalassemia compound heterozygotes. J Mol Med 80:243–247, doi:10.1007/s00109-001-0312-4

Kollia P, Kalamaras A, Chassanidis C, Samara M, Vamvakopoulos NK, Radmilovic M, Pavlovic S, Papadakis MN, Patrinos GP (2008) Compound heterozygosity for the Cretan type of non-deletional hereditary persistence of fetal hemoglobin and beta-thalassemia or Hb Sabine confirms the functional role of the Agamma-158 C>T mutation in gamma-globin gene transcription. Blood Cells Mol Dis 41:263–264

Ronchi A, Berry M, Raguz S, Imam A, Yannoutsos N, Ottolenghi S, Grosveld F, Dillon N (1996) Role of the duplicated CCAAT box region in gamma-globin gene regulation and hereditary persistence of fetal haemoglobin. EMBO J 15:143–149

Indrak K, Indrakova J, Kutlar F, Pospisilova D, Sulovska I, Baysal E, Huisman THJ (1991) Compound heterozygosity for a beta0-thalassemia (frameshift codons 38/39 −C) and a nondeletional Swiss type of HPFH (A>C) at NT −110, Ggamma) in a Czechoslovakian family. Ann Hematol 63:111–115, doi:10.1007/BF01707283

Patrinos GP, de Krom M, de Boer E, Langeveld A, Imam AM, Strouboulis J, de Laat W, Grosveld FG (2004) Multiple interactions between regulatory regions are required to stabilize an active chromatin hub. Genes Dev 18:1495–1509, doi:10.1101/gad.289704


Methods

Arabidopsis transgenic lines

Point mutation frequencies were assayed using the transgenic A. thaliana (accession Columbia) lines 693, 699, 747 and M4. In lines 693, 699 and 747, stop codons were introduced in the GUS ORF at three different positions 112G→T, 166G→T and 118A→T, respectively (Fig. 1a) these transgenic lines were provided by Igor Kovalchuk (University of Lethbridge, Canada) [32]. Line M4 was provided by Anna Depicker (Ghent University, Belgium) [33]. In line M4, a mis-sense mutant of the GUS gene is inserted where the base T is mutated to C at the 1390 th position (Fig. 1a). Somatic homologous recombinations were scored using lines 651 and R2L1, where the recombination substrates are inverted repeats of a truncated GUS gene (Fig. 1b). Line 651 (C24 ecotype) was gifted by Barbara Hohn (Friedrich Miescher Institute, Switzerland) [34]. We obtained the line R2L1 (Columbia ecotype) [35], as well as the line G10 (Columbia ecotype) [36] from Francois Belzile (University of Laval, Canada). Line G10 has a microsatellite (stretch of 10 Gs) within the GUS ORF (Fig. 1c). The Tag1 line was provided by Nigel Crawford (University of California, California) [37]. All the experiments were conducted on homozygous seedlings.

Bacterial strains

Infections were carried out using wild type A. tumefaciens strain Ach5 (referred to as VOT in the main body of the paper), and its derivatives (Table 1), as well as with E. coli. The Agrobacterium strains Ach5 and LBA4404 (referred to as VXX in the main body of the paper) were provided by K. Veluthambi (Madurai Kamaraj University, India) [51]. LBA4404 is a derivative of Ach5 with disarmed Ti plasmid pAL4404 and hence, it has only the vir and ori regions of the Ti plasmid, but not the oncogenes and T-DNA [52]. We generated LBA4404 (pCAMBIA2300) (referred to as VXT) by electroporating LBA4404 with the binary vector pCAMBIA2300. We obtained the strain LBA4002 (referred to as XXX) from Paul J. Hooykaas (Leiden University, the Netherlands). LBA4002 is an Ach5-derivative without the Ti plasmid [53]. All the Agrobacterium strains were grown on Luria-Bertani (LB) media at 28 °C. The strain LBA4404 (pCAMBIA2300) was grown on media containing 100 mg/l kanamycin. E. coli (strain, DH5α) was grown in LB media at 37 °C. Heat-killed bacteria were obtained by heating the culture at 70 °C for 30 min [39] and the heat killed forms were verified by plating them again in appropriate media. Prior to infection, the bacteria were cultured in liquid LB media with appropriate antibiotics and the suspension (0.6 to 0.9 optical density at 600 nm) was centrifuged at 1,100 X g for 10 min and washed with equal volume of liquid germination media thrice to eliminate traces of LB.

Plant growth conditions and method of infection

Arabidopsis seeds were surface sterilised with 500 μl of 70 % ethanol, rinsed with sterile water and treated with 0.5 % bleach for two minutes. Subsequently, the seeds were washed four times with sterile water. Infection of Arabidopsis seedlings was performed according to the protocol of Li et al. [54] with minimal modification. The seeds were plated on germination media (sterile Murashige and Skoog (MS) media with 3 % sucrose, pH 5.7). Seed germination was synchronised by keeping the MS plates with seeds in the dark, at 4 °C for 48 h. Then the plates were moved to a growth chamber (Percival, USA), having a uniform light intensity of 8000 lux units under a 16-h light/8-h dark cycle. The temperature was maintained at 22 °C throughout the experiments and the humidity was set at 80 %. After two days, the seedlings were rinsed in liquid germination media containing the bacteria and co-cultivated on germination media for the appropriate time period (0.5 h to 48 h). The control seedlings were treated similarly to the infected ones, except that the liquid medium was devoid of bacteria. The infected as well as control seedlings were then surface sterilised with liquid MS media containing 250 mg/l Cefotaxime and 0.05 % plant preservative mixture (Biogenuix Medsystem Pvt. Ltd., New Delhi, India). These seedlings were gently dropped onto germination media containing 250 mg/l Cefotaxime and 0.05 % plant preservative mixture using a wide-mouthed pipette. Uniform spacing between the seedlings was maintained for all the experiments. These plates were kept in the plant growth chamber in the same conditions mentioned above for two weeks and then used for GUS histochemical staining.

Β-Glucuronidase (GUS) histochemical staining

GUS histochemical staining was performed according to the protocol of Jefferson [55]. The staining buffer (100 mM sodium phosphate buffer [pH 7.0]) contained 1 mM 5-bromo-4-chloro-3-indolyl glucuronide (X-Gluc) (Biosynth, Switzerland), 0.1 % Triton X-100 and 50 μg/ml kanamycin. GUS staining solution (10 ml) was added to six-well plates containing approximately 50 plants per well. These plates were vacuum-infiltrated for 10 min and incubated at 37 °C for 48 h. Subsequently these plants were bleached with 70 % ethanol. The blue spots (Fig. 1e) reflecting mutation reversions were counted using a light microscope (Leica KL300).

Estimating mutation frequencies

Each experiment was done in triplicate and in total about 400–500 plants were taken for each treatment. For studies involving FSM, about 200–250 plants were taken. Mutation frequencies were expressed as the average number of spots observed per plant [33]. The mutation frequencies of the infected plants and the spontaneous mutation frequencies of the control plants were compared. The mutation frequency data sets were tested for normality [56] and equality of variances [57, 58], and were further subjected to one way Analysis of Variance (ANOVA, α = 0.05) to determine significant effects (P < 0.05), if any. Duncan´s multiple range test [58–60] was employed when ANOVA revealed significant differences (P < 0.05). All the statistical analyses were performed using STATISTICA version 8 software (Stat Soft Inc.). The data sets were plotted in MS Office - Excel software (Microsoft Inc.).


Watch the video: TLIG organizētais svētceļojums Roma-Itālija,. (September 2022).


Comments:

  1. Chesmu

    Interesting thing

  2. Edmon

    It is remarkable, a useful piece

  3. Westen

    I mean you are wrong. I can prove it. Write to me in PM, we'll talk.

  4. Caddaric

    I'm sorry, but I think you are wrong. I can defend my position.

  5. Nijas

    I don't realize



Write a message