How do researchers define the region a lead SNP encompasses?

How do researchers define the region a lead SNP encompasses?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

As I understand it, a lead SNP captures the variance for all unmeasured SNPs in a region due to it's low p-value and high linkage disequilibrium. However, in different papers the region size differs (e.g. from 500kb +/- to 1mb +/-), how are these sizes chosen and why do they differ?

Edit to add examples: Xia et al. Transl Psychiatry. 2017 Aug "Genome-wide association analysis identifies common variants influencing infant brain volumes" -500kb region

Polushina et al. Transl Psychiatry. 2017 Dec "Analysis of the joint effect of SNPs to identify independent loci and allelic heterogeneity in schizophrenia GWAS data" -1mb region

Genomic diversity of the human intestinal parasite Entamoeba histolytica

Entamoeba histolytica is a significant cause of disease worldwide. However, little is known about the genetic diversity of the parasite. We re-sequenced the genomes of ten laboratory cultured lines of the eukaryotic pathogen Entamoeba histolytica in order to develop a picture of genetic diversity across the genome.


The extreme nucleotide composition bias and repetitiveness of the E. histolytica genome provide a challenge for short-read mapping, yet we were able to define putative single nucleotide polymorphisms in a large portion of the genome. The results suggest a rather low level of single nucleotide diversity, although genes and gene families with putative roles in virulence are among the more polymorphic genes. We did observe large differences in coverage depth among genes, indicating differences in gene copy number between genomes. We found evidence indicating that recombination has occurred in the history of the sequenced genomes, suggesting that E. histolytica may reproduce sexually.


E. histolytica displays a relatively low level of nucleotide diversity across its genome. However, large differences in gene family content and gene copy number are seen among the sequenced genomes. The pattern of polymorphism indicates that E. histolytica reproduces sexually, or has done so in the past, which has previously been suggested but not proven.

Alustructure and amplification mechanism

The general structure of an Alu element is presented in Figure 1a. The body of the Alu element is about 280 bases in length, formed from two diverged dimers, ancestrally derived from the 7SL RNA gene, separated by a short A-rich region (reviewed in [11]). The 3' end of an Alu element has a longer A-rich region that plays a critical role in its amplification mechanism [12]. The entire Alu element is flanked by direct repeats of variable length that are formed by duplication of the sequences at the insertion site. Alu elements have an internal RNA polymerase III promoter that potentially initiates transcription at the beginning of the Alu and produces RNAs that are responsible for their amplification. However, Alu elements have no terminator for transcription and the transcripts terminate at nearby genomic locations using a TTTT terminator sequence.

The structure of an Alu element. (a) The top portion shows a genomic Alu element between two direct repeats formed at the site of insertion (red arrowheads). The Alu ends with a long A-run, often referred to as the A-tail, and it also has a smaller A-rich region (indicated by AA) separating the two halves of a diverged dimer structure. Alu elements have the internal components of a RNA polymerase III promoter (boxes A and B), but they do not encode a terminator for RNA polymerase III. They utilize whatever stretch of T nucleotides is found at various distances downstream of the Alu element to terminate transcription. A typical Alu transcript is shown below the genomic Alu, showing that it encompasses the entire Alu, including the A-tail, and has a 3' region that is unique for each locus. (b) The Alu RNA is thought to fold into separate structures for each monomer unit. The RNA has been shown to bind the 7SL RNA SRP9 and 14 heterodimer, as well as polyA-binding protein (PABP). It is thought that at least one other protein binds the duplex portion of the RNA structure. (c) In the target-primed reverse transcription mechanism, the Alu RNA (blue) brings the ORF2p to the genome where its endonuclease activity cleaves at a T-rich consensus sequence. The T-rich region primes reverse transcription by ORF2p on the 3' A-tail region of the Alu element. This creates a cDNA copy of the body of the Alu element. A nick occurs by an unknown mechanism on the second strand and second-strand synthesis is primed. The new Alu element is then flanked by short direct repeats that are duplicates of the DNA sequence between the first and second nicks.

Each RNA polymerase III generated Alu RNA is unique in terms of: (i) accumulated mutations in the Alu element itself (ii) the length and accumulated sequence heterogeneity in the encoded A-rich region at its 3' end and (iii) the unique 3' end on each RNA transcribed from the adjacent genomic site. Those RNAs are then thought to assemble into ribonucleoprotein particles (Figure 1b) that involve the SRP9/14 heterodimer [13], polyA-binding protein (PABP) [14, 15] and at least one other unidentified protein that binds to the RNA structure [14, 15]. The SRP9/14 proteins and PABP are thought to help the Alu RNA associate with a ribosome, where it might become associated with ORF2 protein (ORF2p) being translated from L1 elements [2, 16, 17]. Alu RNAs then utilize the purloined ORF2p to copy themselves at a new genomic site using a process termed target-primed reverse transcription (Figure 1c reviewed in [18, 19]).

Although Alu is dependent on the L1 ORF2p protein, Alu retrotransposition is not simply an extension of the L1 retrotransposition process. For instance, L1 depends on ORF1p and ORF2p, while Alu requires ORF2p only [2, 20, 21]. This may be one of the reasons why Alu causes several times as many diseases as L1 through insertion [22, 23] and has twice the copy number of L1 [1]. Because L1 elements have been shown to have a splice variant that makes only ORF2p [24], or that may express ORF2p from elements with a mutated ORF1, Alu might be able to amplify in cells that do not effectively amplify L1. In fact, although L1 transcription is high in the testis, almost all of the RNA is not full-length, mostly due to splicing [24]. This means that Alu may retrotranspose well in the testis, even though L1 retrotransposes poorly. Alu and L1s have several other differences. Following expression, Alu RNAs can retrotranspose rapidly, whereas L1 RNAs take almost 24 h longer [25]. Retrotransposition of Alu and L1 elements is also differentially influenced by different APOBEC3 proteins [26–28]. Alu elements encode the A-tail separately at each locus rather than through post-transcriptional polyadenylation, as with L1. Thus, Alu A-tails are prone to shrinkage and accumulation of mutations that can affect the amplification process from each particular locus (discussed below) [16].

Only a handful of the greater than 1 million genomic Alu elements can amplify [29, 30]. It seems highly likely that relatively few polymorphic elements in the population have high amplification capability that maintains Alu amplification within the population. There are many factors that contribute to the relative amplification activity of an Alu locus (Figure 2) [29, 31]. These include: (i) the influence of the primary genomic sequence on transcription (ii) epigenetic influences on transcription (iii) the length, and possibly the specific nature, of the 3' unique region of the Alu RNAs (iv) the length and heterogeneity of the A-tail of the Alu and (v) divergence of the body of the Alu element, which seems likely to influence RNA structure and probably relevant protein binding (Figure 1b).

Why so few Alu elements are active. Out of the more than 1 million Alu elements in the human genome, very few are capable of making copies, although many make transcripts. Upon insertion in a new locus, the factors that make a very active Alu element are the flanking sequences influencing the promoter, creating a short unique region. Active elements match the consensus Alu element fairly closely and they have a long and fairly perfect A-tail. Active elements degrade rapidly on an evolutionary time scale by A-tail shortening, heterogeneous base interruptions accumulating in the A-tail, and eventually by the accumulation of random mutations in the Alu element. At least some of these changes alter Alu activity through disruption of the various proteins binding to the RNA in the ribonucleoprotein (Figure 1b).

These mechanistic features all contribute to the observed paucity of actively amplifying 'master' or 'source' Alu elements in the human genome. The internal RNA polymerase III promoter is not strong unless it fortuitously lands near appropriate flanking sequences [32]. Furthermore, epigenetics seems to silence the majority of Alu transcripts. Thus, there are generally very low levels of RNA polymerase III transcribed Alu RNAs in a cell and it is transcribed by a number of dispersed loci, including many loci that are incapable of active retrotransposition [33]. Because the A-tail grows during the insertion process [2, 34], most new inserts have a sufficiently long A-tail for effective amplification. However, because each new insert lands in a different genomic environment, the new loci will vary tremendously in their transcription potential owing to the influences of flanking sequences [32] and epigenetics. In addition, the 3' flanking sequence will provide the RNA polymerase III terminator, and those with longer 3' unique regions will be poor at retrotransposition [29]. Following insertion, those elements that are initially capable of retrotransposition will gradually lose that capability by a series of sequence changes. The most rapid change will be that the long, relatively unstable A-tails will shrink rapidly [16], resulting in lower retrotransposition capability [12, 29]. In addition, the A-tails will rapidly accumulate mutations and often form variant microsatellite-like sequences at their ends that will also impair the activity [29]. Over the long run, the body of the Alu element will accumulate mutations [31], first CpG mutations, and then other random mutations, which will alter the promoter, RNA folding, and/or interactions with cellular proteins, leaving relatively few of the older Alu elements capable of retrotransposition. The sum of all of these factors contributes to the lack of activity of most Alu elements.

The prison scans

Psychopaths make up between 15 and 25 percent of the male North American prison population

A study from King's College in 2012 found that violent male offenders who met the diagnosis for psychopathy displayed significantly reduced gray matter volumes in the anterior rostral prefrontal cortex and temporal poles. This striking, and specific, structural abnormality in the part of the brain associated with empathy and feelings of guilt, points to a clear neurological difference between regular violent offenders and genuine psychopaths.

A straightforward lack of empathy isn't enough to make someone a full-blown psychopath though. Several MRI studies have shown a more complex combination of neurological activities is occurring inside the brain of a psychopath.

A 2013 study took MRI scans of 121 prison inmates split into three groups: rated as highly, moderately or weakly psychopathic. The inmates were shown images displaying physical pain and then asked to imagine that accident happening to themselves or others. The highly psychopathic subjects displayed a pronounced empathic response to the thought of pain when imagined to themselves. Brain activity across several regions involved in pain empathy was identified as heightened, including the anterior insula, the anterior midcingulate cortex, somatosensory cortex, and the right amygdala.

It was clear psychopaths understood and empathized with the concept of pain when inflicted upon themselves. When asked to imagine that same pain inflicted upon others those psychopathic subjects displayed a very different response. Not only did those empathic areas of the brain fail to activate, but increased activity was seen in another area of the brain, the ventral striatum.

The ventral striatum is a fascinating part of the brain, known to manage reward processing, motivation and decision-making. This particular study suggested that psychopaths could actually enjoy imagining pain being inflicted upon others.

But how this actually motivates a violent or antisocial action turns out to be a little more complex than simply deriving pleasure from other people being hurt.

After all, not all those that display psychopathic characteristics turn out to be violent criminals. Dr James Fallon can attest to that. So what else is going on inside the brain to cause a psychopath to make an antisocial decision?

One study from 2016 discovered no difference in excitability of the ventral striatum between criminal and non-criminal psychopaths when undertaking a reward game. However, a significant difference between the two groups was identified in the connectivity from the ventral striatum and another brain region called the dorsomedial prefrontal cortex.

This area of the brain is known to manage cognitive control of behavior, performance adjustment, impulse control and general self-inhibition. In highly psychopathic criminals an abnormally high connectivity was identified between the reward-signaling ventral striatum and the behavior-controlling dorsomedial prefrontal cortex.

"These observations raise the hypothesis that psychopathic criminals might exhibit a failure to adjust performance due to aberrant impact of reward expectation," write the scientists behind this 2016 study.

As well as over-valuing the reward signals from the ventral striatum, a recent Harvard study found that people with psychopathy are unable to accurately evaluate the future consequences of their actions.

This MRI study examined 49 prison inmates and discovered a weak connection between the ventral striatum and the ventral medial prefrontal cortex in those inmates with high psychopathic tendencies. Senior author of the Harvard study, Josh Buckholtz describes this part of the prefrontal cortex as vital for "mental time-travel" – that ability to evaluate the future outcomes of an action relative to the more immediate rewards.

The effect identified in the study was so pronounced that the researchers could accurately predict how often an individual inmate had been convicted of crimes relative to the strength of the connection between the striatum and prefrontal cortex. So the stronger the connection, the more the reward signals were dominating all aspects of a decision.

Buckholtz sees this as a "particular kind of brain wiring dysfunction" that results in bad decision making, regardless of psychopathy.


CFA5 and CFA19 confer risk for histiocytic sarcoma

A GWAS including 177 FCR histiocytic sarcoma cases and 132 FCR controls (Table 1) was performed using 108 084 SNPs. Principal component analysis revealed stratification between FCRs of European vs. North American origin (Fig 1). GWAS was performed in GEMMA [18] using a kinship matrix and linear mixed model to correct for population structure with a genomic inflation factor (λ) of 0.97. A single association exceeding Bonferroni significance (4.26x10 -7 ) was identified on CFA5 with Pwald = 4.83x10 -9 (Fig 2A and S1 Table). The top 27 markers are in high linkage disequilibrium (LD r2≥0.8) and span a 4.3 Mb region around the lead SNP (CFA5:33001550 Fig 2B). A shared haplotype was identified among 90% of cases, with recombination events defining a narrower 1.2 Mb FCR risk haplotype at CFA5:32389061–33633274 (Fig 2C and S2 Table), which harbors over 40 genes. The risk haplotype is also present in 64% of control dogs. There is considerable LD among markers at this locus (r 2 ≥ 0.6 28–37 Mb) with the broader GWAS signal extending to 28 Mb, and many cases continue to share a common haplotype throughout the region.

Principal components 1 (13.6% variance) and 2 (6.6% variance) are plotted on the x and y-axes, respectively. The European and North American FCRs (n = 309) form subpopulations with cases and controls distributed throughout both groups.

A) Manhattan plot of -log10P-values (y-axis) for 108084 Illumina SNPs plotted against chromosome position in CanFam3.1 (x-axis). The Bonferroni threshold is plotted on the y-axis in gray (-log10P = 6.33). B) Regional Manhattan plot of the CFA5 association with SNPs color-coded according to pairwise LD (r 2 ) with the lead SNP. C) Length of risk haplotype sharing among cases (purple) and controls (orange) is plotted on the x-axis with the percentage of dogs sharing on the y-axis. Continuous loss of haplotype sharing is tracked in darker purple/orange, while the lighter shades mark points at which some individuals re-gain the common risk haplotype.

The CFA5 risk haplotype is present in the heterozygous state in 53% of cases and 43% of controls. To determine whether additional loci differentiate these groups, we performed a GWAS using only cases and controls heterozygous for the CFA5 risk haplotype (94 vs. 43 Fig 3 and Table 1), thereby neutralizing the effect of the CFA5 locus. To reduce the possibility that our control group contained dogs who could eventually develop histiocytic sarcoma, we applied a more stringent minimum age at collection for controls (11 years), discarding samples from dogs in the lowest age quartile. This provided further separation between controls and cases, as 75% of cases were diagnosed at <10 years of age, while preserving as much power as possible for the GWAS (S1 Fig and S3 Table). A single locus at 52 Mb on CFA19 (Pwald = 2.25x10 -7 ) exceeded Bonferroni significance (4.67x10 -7 ) and was confirmed after permutations (Ppermutations = 0.014). This approach produced a more robust association compared to that which includes CFA5 genotypes as a covariate in the total GWAS cohort (CFA19:52487724 Pwald = 4.25x10 -5 S4 Table). A 741 kb critical interval is demarcated by the flanking SNPs in highest LD with the lead SNP (CFA19:52487724, r 2 ≥0.6), encompassing just three genes. Ninety-nine of the 177 cases in the total GWAS cohort (n = 309) had periarticular tumors and 77 had tumors in other locations at the time of diagnosis. The risk allele at the CFA19 locus was more common among periarticular cases (PFisher = 0.015, OR = 2.78, 95%CI = 1.21–6.37).

All cases and controls were heterozygous for the CFA5 risk haplotype. Manhattan plot of -log10P-values (y-axis) for 107102 SNPs by chromosome position in CanFam3.1 (x-axis) is shown at the top. The Bonferroni and 5% permutations thresholds are plotted as gray and red lines, respectively. QQ plot with genomic inflation factor (λ) and regional Manhattan plot of CFA19 locus, showing pairwise LD (r 2 ) relative to the lead SNP are below with genes in the region plotted at the bottom.

In the total cohort (n = 309), SNP genotypes from all autosomes explain 27% ± 14% PLRT = 0.0034) of the risk for developing histiocytic sarcoma. The CFA5:25-40Mb locus alone explains 22% ± 13% PLRT = 1.15x10 -5 , while CFA19:50.5-53Mb explains 8% ± 5% PLRT = 2.05x10 -4 . Together, these loci account for 35–37% ± 13% of the phenotypic variance (PLRT = 1.44x10 -8 ). When considering CFA5 and CFA19 genotypes in combination, 39% of dogs who are heterozygous at both loci are cases, whereas 80% of dogs heterozygous at CFA5 and homozygous at CFA19 are cases (Table 2). Thus, when CFA19 data are included, we observe greatly improved separation of cases and controls relative to analysis with CFA5 genotypes alone.

Multiple hematopoietic malignancies are associated with CFA5 locus

The CFA5 region colocalizes with previously-identified associations for two common hematological malignancies in golden retrievers: hemangiosarcoma (29 Mb) and B-cell lymphoma (34 Mb Fig 4A) [19]. Although distinct cancers, histiocytic sarcoma, B-cell lymphoma, and hemangiosarcoma all arise from cells in the hematopoietic stem cell pathway: dendritic cells and macrophages, B lymphocytes, and hematopoietic precursor cells, respectively [4,8,20,21]. FCRs and golden retrievers are closely related breeds, sharing an immediate common ancestor among the retriever phylogenetic clade [16]. To search for shared risk haplotypes at this locus, we examined published genotypes [19] from golden retrievers diagnosed with hemangiosarcoma or B-cell lymphoma. Using the same haplotype analysis applied to FCRs (See Methods), we defined a 1.4 Mb B-cell lymphoma risk haplotype (CFA5:33001663–34362236) encompassing the lead golden retriever GWAS SNP (CFA5:34117726) for this cancer (S2 Table). This haplotype overlaps the FCR risk haplotype for 631 kb (Fig 4B), and the interval is strongly associated with hematopoietic cancer in both breeds, with a combined P-value of 4.17x10 -10 compared to 3.43x10 -7 and 2.00x10 -4 in FCRs and golden retrievers alone, respectively. Direct overlap between golden retriever and FCR haplotypes was not observed at the 29 Mb hemangiosarcoma risk locus.

A) Regional Manhattan plot showing FCR histiocytic sarcoma GWAS SNPs in purple. Results of a combined golden retriever GWAS with 142 hemangiosarcoma, 41 B-cell lymphoma, and 172 controls are overlaid in gold. The peaks at 29 Mb and 33 Mb (CanFam3.1) in golden retrievers correspond to hemangiosarcoma and B-cell lymphoma risk, respectively. B) Regions harboring risk haplotypes identified independently in FCRs (purple) with histiocytic sarcoma and golden retrievers (gold) with B-cell lymphoma are plotted with genes in the region. The risk haplotypes overlap for a shared 631 kb span (bracket).

RNA-seq and allele-specific expression

To investigate potential effects of the CFA5 risk haplotype on gene expression, RNA-seq data were generated from RNA isolated from 11 FCR whole blood samples (Table 1). Differential expression analysis was based on the risk haplotype, comparing four dogs who were homozygous for the risk haplotype vs. seven dogs who were heterozygous. The frequency of the risk allele in the FCR control population indicates the difficulty in finding homozygous non-risk individuals however, this would clearly be beneficial in future expression studies. Forty-three genes and five non-coding RNAs demonstrated significant differential expression. The nearest gene to the CFA5 critical interval, NLRP1, was 1.7Mb upstream, suggesting the risk locus may have distal effects (S5 Table). When comparing gene expression levels in individual samples to the average expression across controls (see Methods, [22]), seven genes and one lncRNA demonstrated significant individual expression (z-score≥|2.5|) among cases. After excluding one heterozygous dog who received chemotherapy one week prior to the blood draw, comparison of the four dogs homozygous for the CFA5 risk haplotype to the remaining six heterozygous dogs revealed an additional 17 genes or non-coding RNAs with significant differential and individual expression (S5 Table).

Because RNA samples were only available from a small number of homozygous and heterozygous individuals and no dogs without risk alleles, the power to detect changes in gene regulation through differential expression analyses was limited. Allele-specific expression (ASE) analysis provides an alternative approach to investigate differential expression utilizing heterozygous individuals. ASE compares expression levels for two alleles at a given coding SNP within an individual, which may result from cis-regulation by variants in non-coding regions. This controls for sources of error between individuals, like environmental, technical, or trans-regulatory effects [23,24]. We performed an ASE analysis for RNA samples isolated from blood for the seven FCRs heterozygous for the CFA5 risk haplotype. We examined genes within 500 kb on either side of the 631 kb shared risk haplotype, extending our search to include potential long-range enhancer-gene interactions [25]. Variants demonstrating significant ASE in two or more FCRs were identified in seven genes: CD68, MPDU1, CHD3, BORCS6, NDEL1, and PIK3R6 (Fig 5 and S6 Table). Both NudE Neurodevelopment Protein 1 Like 1 (NDEL1) and Phosphoinositide-3-kinase regulatory subunit 6 (PIK3R6) lie within the minimal 631 kb shared risk haplotype and contain variants demonstrating significant ASE in at least six of seven FCRs. NDEL1 functions in neuron migration and neurite outgrowth, microtubule organization, and cell signaling. It has been associated with neurodegenerative disease [26] and may play a role in glioblastoma [27]. PIK3R6 functions in the PI3K/Akt pathway, which is commonly dysregulated in cancer, in leukocytes [28].

UCSC CanFam3.1 tracks at the 631 kb shared risk haplotype show the blood ChIP-seq regions for H3K4me1 and H3K4me3 for Bernese mountain dogs (pink) and the FCR (purple). ASE variants (black), 98 WGS variants meeting filtering criteria (gray), and the genes in the region are shown below (ENSCAF00000017382 = PIK3R6).

As in human cancers, our data show histiocytic sarcoma is not fully explained by one gene or locus. We next tested effects of the CFA19 risk locus on changes in gene expression. We performed differential expression analysis between four cases homozygous for CFA19 risk and four heterozygous unaffected controls. The CFA5 genotypes were matched between the two groups with one dog homozygous for the CFA5 risk haplotype and three heterozygous dogs in each group. Among the top differentially expressed genes was TNF alpha induced protein 6 (TNFAIP6 Padjusted = 0.024), which lies 37 kb downstream of the GWAS susceptibility critical interval. TNFAIP6 shows a 10.9-fold increased expression in histiocytic sarcoma cases homozygous for the CFA19 haplotype relative to heterozygous individuals. No other differentially expressed genes were proximal to the CFA19 critical interval (S7 Table). Three of the four cases demonstrate significant increased individual expression (z-score = 2.84–7.61, equivalent to P<0.01) relative to all controls (n = 7) at TNFAIP6 (S7 Table). Comparison of the log2(TPM) expression at this gene for the four cases vs. seven controls indicates significant differential expression (Wilcoxon P = 0.024) exclusion of the case who received chemotherapy increases the P-value to 0.067 (S2 Fig).

Variant filtering and ChIP-seq analysis

We next sought to identify potential pathogenic variants within the CFA5 631 kb risk haplotype. Using WGS from four FCRs, three cases and one control, we filtered for variants concordant with the risk haplotype (See Table 2). Because FCRs and golden retrievers diagnosed with hematopoietic cancer shared a 631 kb risk haplotype, we hypothesized that they may also share the pathogenic variant(s) on this haplotype. We thus included published WGS from four golden retrievers diagnosed with B-cell lymphoma (three heterozygous for the risk haplotype and one homozygous) for filtering (Table 1). A total of 284 variants matched the segregation pattern of the CFA5 risk haplotype in the four FCR and four golden retriever WGS. A conservative allele frequency threshold of 50% in 1090 genomes from 233 other breed dogs (S8 Table) was applied to eliminate variants common across many breeds, resulting in 218 variants, none of which were unique to FCRs and golden retrievers (S9 Table). No variants were predicted to impact protein sequence or splice sites. Visual inspection of the interval in Integrative Genomics Viewer [29] revealed no structural variants segregating with the risk haplotype. The CanFam3.1 reference genome contains six gaps, totaling approximately 3 kb, within the critical interval, which may mask variants relevant to histiocytic sarcoma susceptibility.

We next investigated potential regulatory variants, which are not fully annotated in the CanFam3.1 reference. To identify promoter and enhancer regions in canine cell types relevant to cancers investigated herein, ChIP-seq data from peripheral blood mononuclear cells from seven dogs were generated for two histone marks, H3K4me1 and H3K4me3, to identify canine promoters and enhancers (See Methods, S10 Table). Publicly available ATAC-seq data identifying open chromatin regions from multiple canine tissues, i.e. spleen, lymph node, and bone marrow [30], were combined with blood ChIP-seq data to define regulatory regions.

Of the selected 218 variants (AF<50%) within the 631 kb critical interval, 98 overlapped with ChIP-seq and/or ATAC-seq regions (Fig 5). As none of the variants were completely unique to the FCR or golden retriever, we considered whether there could be combinations of variants private to the breed. The 98 variants were phased, allowing us to generate haplotypes in retriever and spaniel breeds, the latter of which were included because the retriever and spaniel clades share a recent common ancestor [16], yet the spaniel is not at risk for histiocytic sarcoma. Thus, a comparison of haplotypes in the region between the breeds might highlight combinations of variants that are neutral polymorphisms versus those that are unique to the retriever and possibly pathogenic. However, no blocks of continuous variants were unique to FCRs and golden retrievers. This does not preclude more distal combinations of variants that may be unique to affected individuals however, it is likely that the causal mutations are present in other breeds.

Transcription factor binding motif analysis and variant genotyping

To further explore candidate pathogenic variants, we selected regulatory regions from blood ChIP-seq data surrounding NDEL1 and PIK3R6, candidates from ASE analysis, to interrogate variants for possible transcription factor binding motif alterations. ATAC-seq regions overlapped with blood ChIP-seq and were thus included. Regulatory elements at PIK3R6 and PIK3R5 contained 92% of the 98 variants overlapping ChIP-seq within the 631 kb shared haplotype (Fig 5 and S9 Table). Five of the 98 WGS variants had significant scores in two transcription factor (TF) motif programs (See Methods), suggestive of a difference in binding affinity between the FCR risk and non-risk alleles (S11 Table) all were within PIK3R5/6 regions. An additional variant (CFA5:33528647), significant in one TF binding affinity program (FIMO, Padj = 0.0067) and demonstrating a difference in SP1 and KLF5 binding affinity between risk and non-risk variants in sTRAP (log(P) = 0.5), was chosen for Sanger sequencing because it is within a human PIK3R6 regulatory region in the GeneHancer promoter- and enhancer-gene interaction database where SP1 and KLF5 are reported to bind [31]. Of the six variants selected for genotyping, one lies within a 12 bp G repeat in a GC-rich region, and we were unable to obtain reliable genotypes for this variant in all dogs (CFA5:33531804). The remaining five variants were significantly associated with histiocytic sarcoma (S12 Table). We calculated Fisher’s exact P-values for matched genotypes across 79 case and 69 control FCRs (Table 3). Variants at CFA5:33531780 and 33576022 had the lowest P-values of 4.2x10 -5 and 4.6x10 -5 , respectively (lead SNP P = 1.8x10 -4 , Table 3), and were located in ChIP-seq regulatory regions upstream of PIK3R6.

The 631 kb shared haplotype delineated here is associated with histiocytic sarcoma in FCRs and B-cell lymphoma in golden retrievers, and we hypothesize that it harbors one or more pathogenic variants contributing to susceptibility for both diseases in each breed. Although less frequently than hemangiosarcoma and B-cell lymphoma, which affect 20% and 6% of the breed respectively [32], golden retrievers also develop histiocytic sarcoma (7% of all tumors in the breed) [4,12,33]. Additional genotyping of the five candidate variants within this region, reveals that they are present in

75% of FCRs with lymphoma (B-cell n = 3, T-cell n = 4, and unspecified subtype n = 13 Table 1), and are in complete LD with the lead histiocytic sarcoma GWAS SNP (r 2 = 1), consistent with our hypothesis. Three of the five variants were also in complete LD with this SNP in golden retrievers with B-cell lymphoma (n = 9) or histiocytic sarcoma (n = 21), i.e. CFA5:33576022, 33587141, and 33594214. The remaining two had r 2 values of 0.32 (CFA5:33528647) and 0.16 (CFA5:33531780 S12 Table), indicating that they are not on the risk haplotype in golden retrievers. In aggregate, these data provide strong support that one or more of the three variants located on the CFA5 risk haplotype are likely to confer susceptibility to histiocytic sarcoma and B-cell lymphoma in both retriever breeds.

Genetic factors affecting dilated cardiomyopathy

Here, we review advances in our understanding of the contribution of rare and common variants to DCM. We focus particularly on rare variants, given the growth in the number of variant genes implicated in DCM, and the challenges in interpreting these data. There have been fewer advances from common variant studies of DCM, and we summarize briefly two of the major DCM GWAS.

Rare variants

Rare genetic variants associated with DCM have been identified in genes involved with a range of diverse cellular structures and functions, and most notably with the sarcomere (Table 2). Inheritance of DCM is most commonly autosomal dominant, although autosomal recessive, X-linked, and mitochondrial inheritance have also been reported, particularly in pediatric populations [43]. Approximately 40% of familial DCM is thought to have a primary monogenic basis [5]. Higher estimates of sensitivity for genetic testing have been reported (from 46 to 73% in one study [44]), but these estimates are likely confounded by insufficient control for population variation in the genes studied. Although variants in over 50 genes have been linked to DCM, the evidence is most robust for a ‘core disease set’ encompassing the sarcomeric genes MYH7 (which encodes beta myosin heavy chain), TNNT2 (which encodes troponin T2), and TTN (encoding titin) and the gene LMNA encoding a nuclear envelope protein.

A recent large-scale analysis of rare genetic variation in cardiomyopathy cases compared with normal population variation has also provided insights into the genetics of DCM. The study tested for an excess of rare variants in 46 genes sequenced in up to 1315 DCM cases compared with over 60,000 ExAC reference samples. Truncating variants in TTN were the most common DCM rare variant (14.6%) [16]. There was modest, statistically significant enrichment in only six other genes (MYH7, LMNA, TNNT2, TPM1, DSP, and TCAP) (Table 2). Based on available data, RBM20 is also likely to prove significant (reviewed below) but was not included in the published analysis owing to poor coverage in the ExAC data. Furthermore, sequencing methods were not uniform, and not all genes were sequenced across the DCM cohorts included in the study. Even allowing for this, many genes that have previously been linked to DCM, including genes routinely sequenced in clinical practice such as MYBPC3 and MYH6, showed little or no excess burden in DCM compared with the reference population. The accompanying Atlas of Cardiac Genetic Variation web resource [16] summarizes these data and serves as a useful adjunct to facilitate the interpretation of rare variants in DCM.

Recent disease–gene associations in DCM

Over the past decade, 47 new genes have been categorized as linked with DCM in the Human Gene Mutation Database (HGMD). Many of these links have not been replicated outside of the original reports, and a comprehensive review of these is beyond the scope of this article. A few examples of novel associations are discussed below, selected for critical evaluation either owing to robust evidence, novelty, or clinical importance.

BAG3 encodes a heat-shock chaperone protein and was first linked to DCM in 2011 through the discovery of a large 8733-bp deletion in exon 4 in seven affected family members in a three-generation family, which was absent in 355 controls [45]. Subsequently, coding exons in BAG3 in 311 other unrelated DCM probands were sequenced, which identified seven rare variants (one frameshift, two nonsense, and four missense variants) that were absent from 355 controls. The authors were also able to recapitulate the DCM phenotype in a zebrafish bag3 knockdown model. In separate studies, BAG3 was linked to DCM through a GWAS, with the discovery of a non-synonymous SNP in the coding sequence of BAG3 in DCM cases compared with healthy controls, which is discussed further below (rs2234962, P = 1.1 × 10 –13 ) [39]. The authors then performed targeted sequencing in a cohort of 168 unrelated DCM probands and identified six variants that were also detected in affected relatives, lending further support to the role of BAG3 as a disease-causing gene.

RBM20 encodes a spliceosome protein that regulates pre-mRNA splicing for many genes, including TTN [46], which is why variants in this gene could hold particular relevance for DCM, either in isolation or in compound heterozygosity with TTN [47]. RBM20 was initially associated with DCM through linkage analysis in two large families with DCM [48]. The authors sequenced all 14 RBM20 exons in each family member and identified a heterozygous missense mutation in exon 9 that co-segregated with disease in all affected individuals, and that was absent in unaffected relations and 480 ethnically matched controls. The authors went on to detect RBM20 missense mutations in exon 9 in six more families affected with DCM. Since the original link with DCM [48], subsequent studies found mutations both within and outside the original RBM20 hotspot in DCM probands, but the segregation data on these variants is limited and the control population was modest in size, meaning that population-level missense variation was not accounted for in these regions [49, 50]. The association of RBM20 and DCM appears most robust for variants in the original hotspot, and further curation is needed to understand the significance of variants in other regions.

The 1p36 deletion syndrome can be associated with cardiomyopathy, and the PRDM16 gene (which encodes a transcription factor) has been identified as a possible cardiomyopathy gene at this locus, linked with a syndromic cardiomyopathy as well as with adult-onset DCM (in 5 out of 131 individuals with four novel missense variants) [51]. However, although there might be a role for PRDM16 in cardiac development, its role as a cardiomyopathy gene has subsequently been questioned [52].

ZBTB17 is also encoded on chromosome 1, at the 1p36 locus. A study of cardiac myocytes and a mouse model of ZBTB17 deletion demonstrated that ZBTB17 is involved in cardiac myocyte hypertrophy and is essential for cell survival [53]. The authors also showed that ZBTB17 encodes a transcription factor (zinc-finger and BTB domain-containing protein 17) that binds the gene CSRP3, a Z-disc protein, mutations of which are found in both HCM and DCM. Given the association between CSRP3 and DCM (in a small cohort with limited segregation data [54], with no subsequent replication), and this new-found function of ZBTB17 in binding CSRP3, the authors hypothesized that ZBTB17 could be a novel gene implicated in DCM.

Many additional transcription factors have also been linked to DCM in recent years, such as GATA5 [55], TBX20 [56], TBX5 [57], GATA6 [58], GATA4 [59], and NKX2-5 [60]. Some of these genes are clearly linked to congenital heart disease phenotypes. However, many of the variants with claimed associations with DCM are missense variants that have been identified within one relatively small group of DCM patients, with variable segregation data. Further studies are required to confirm the link with DCM.

Desmosomal proteins, typically perturbed in arrhythmogenic right ventricular dysplasia/cardiomyopathy (ARVD/ARVC), have also been linked to DCM. The association has been most robust for DSP, which encodes desmoplakin, a desmosomal protein [61], with a strong excess of truncating variants in DSP in DCM [16]. However, some of the more recent associations of desmosomal protein gene variants have limited variant curation and segregation data, such as PKP2 [62] (which encodes plakophilin 2), and these associations are less clear. One such PKP2 variant (c.419C > T(p.(S140F)), previously linked to DCM has been shown not to be associated with heart failure phenotypes [63]. Therefore, of the desmosomal proteins, DSP variants have the most robust association with DCM.

Filamin-C (encoded by FLNC) is a Z-disc protein (Box 1) that provides sarcomeric stability. In recent work, two rare splicing variants in FLNC were detected through whole-exome sequencing in two Italian families and in one US family affected with DCM, with all variants co-segregating with disease [64]. Only one unaffected variant carrier was identified, but this individual declined further follow-up. These variants were absent from 1000 Genomes, NHLBI Go-ESP, and ExAC. The FLNC cardiomyopathy phenotype was not associated with skeletal muscle involvement in this cohort, but was associated with arrhythmias and sudden cardiac death. In the same study, a zebrafish knockdown model showed a phenotype of cardiac dysfunction, with defects in the Z-discs and sarcomere disorganization. Evaluation of FLNC variants in a large (n = 2877) cohort of patients with inherited cardiac diseases, including DCM, has shown that the phenotype of individuals with truncating variants in FLNC is notable for left ventricular dilation, systolic impairment, ventricular arrhythmias, cardiac fibrosis, and sudden cardiac death [65]. Further replication in DCM-specific cohorts is needed to validate this potentially prognostically important phenotypic association.

In summary, there have been many novel gene and variant associations with DCM. Although some appear robust and potentially clinically important (such as FLNC, BAG3, RBM20), others require further study (for example, variants in transcription factors). We encourage the reader to maintain critical review of variants outside of major disease genes and to utilize the variant interpretation aids we highlight in this article.

Truncating variants in titin

Truncating variants in the titin gene (TTN) represent the largest genetic cause of DCM, and, unlike many of the other genes related to DCM, a cardiologist is likely to encounter a DCM patient with one of these variants. However, as the interpretation of these variants is nuanced, we take the opportunity to discuss these variants in more detail. Variants in titin were first associated with DCM in 2002 through the study of two large multigenerational families affected with DCM [66]. In the first kindred, linkage analysis identified a disease gene locus [maximum logarithm of odds (LOD) score 5.0, penetrance of 70%]. In this study, TTN was chosen as a candidate gene owing to high levels of cardiac expression and its established role in muscle assembly and function. A 2-bp insertion was identified in exon 326 that resulted in a frameshift mutation generating a premature stop codon, and this mutation segregated with disease in family members. In the second kindred, a non-truncating TTN missense mutation in a highly conserved region was identified that also segregated with disease (Trp930Arg).

More recently, next-generation sequencing technologies have made the study of the giant titin gene (comprising 363 exons) possible in large cohorts. This led to the discovery that truncating variants in TTN (TTNtv) are found in approximately 15% of unselected DCM cases and in up to 25% of end-stage DCM cases [67, 68]. As yet, there do not appear to be any clear genotype–phenotype correlations permitting the phenotypic differentiation of genetic DCM, although one recent study suggests a milder phenotype associated with TTNtv cardiomyopathy than with non-TTNtv cardiomyopathy [69]. However, the findings in this latter study were driven by a direct comparison with LMNA cardiomyopathy, which has a severe and malignant phenotype, and need to be interpreted with this in mind.

Variant interpretation is complicated by the fact that TTN undergoes extensive alternative splicing to produce different protein isoforms, meaning that not all exons are included in the final processed mRNA transcripts. Allowing for this process, which is quantified by assessing the percentage spliced in (PSI)—that is, the percentage of final cardiac transcripts that include a particular exon—appears to be important for distinguishing variants that are important for disease. Variants in exons that are included in the final transcript more than 90% of the time are most significant for human cardiomyopathy [68]. Insights from induced pluripotent stem cell (iPSC) work suggest that the mechanism underlying TTNtv DCM might involve haploinsufficiency [70] as opposed to a dominant-negative model. The importance of haploinsufficiency was highlighted further in two rat models of TTNtv and by using Ribo-seq (integrated RNA sequencing and ribosome profiling) analysis of human RNA samples, which demonstrated haploinsufficiency of the mutant allele [71].

The finding of the importance of compound-heterozygous variants for severe phenotypes (for example, TTN and LMNA variants [72]) shows a potential for modifier genes or additive genetic effects in DCM. This concept was alluded to in a multi-center study of 639 patients with sporadic or familial DCM, with the finding of a 38% rate of compound mutations, and up to 44% when considering patients with TTNtv [44]. However, these findings must be interpreted with great caution as the ‘yield’ of DCM variants in this study was far higher than in any previous study, background population variation was not well accounted for, and there were no matched controls on the same sequencing platform.

Common variants

There have been two notable DCM-specific case-control GWA studies, and their results are summarized in Table 1 [39, 73]. In the first of these studies, two SNPs with significant association to disease were discovered and replicated [39]. One SNP was located within the coding sequence of BAG3 (rs2234962, P = 1.1 × 10 –13 ), and the authors went on to identify rare variants in BAG3 in a separate cohort of patients with DCM, as previously outlined. This is an unusual example of a situation where common and rare variants in the same gene can be associated with sporadic and monogenic forms of the disease, respectively. The second SNP was located within an intron of transcription factor gene ZBTB17 (rs10927875, 3.6 × 10 –7 ) [32]. ZBTB17 has since been postulated to be involved in cardiomyopathy in a mouse model, as discussed above [53]. However, the genomic region of this second locus contains many other genes, including heat-shock protein gene HSPB7, which has been linked to heart failure syndromes multiple times.

In the second GWAS of DCM, SNPs in the HSPB7 locus had weak association signals (rs1763610, P = 0.002 and rs4661346, P = 0.024) [73], but, in a separate association study of a subset of patients who featured in the replication stage of this GWAS, a stronger association was detected (rs1739843, P = 1.06 × 10 –6 ) [41]. Taking these findings together with the findings of the sub-genome array studies of heart failure discussed above [38], a role for HSPB7 in both DCM and heart failure is suggested. Also, in the second of the GWA studies for DCM, the most significant associated SNP (rs9262636, P = 4.9 × 10 –9 ) was an eQTL for genes encoding class I and class II major histocompatibility complex heavy chain receptors [73]. This suggests that DCM might arise in part as a result of a genetically driven inflammatory process.

In summary, these GWAS in DCM identify susceptibility variants in genes with broad cellular functions (heat-shock proteins and inflammatory pathway receptors). This breadth makes interpretation of these findings challenging. Below, we discuss the potential translational implications of these data, and of the other rare and common variant discoveries in DCM and systolic heart failure.


The international exploratory ‘What is Mental Health?’ survey sought the opinions of individuals, across multiple modes of inquiry, on what they perceived to be the core concepts of mental health. The survey found dissatisfaction with current definitions of mental health. There was no consensus among this group on a common definition. However, there was significant agreement among subcomponents of the definitions, specifically factors beyond the ‘ability to adapt and self-manage’, such as ‘diversity and community identity’ and creating distinct definitions, “one for individual and a parallel for community and society.” The Core Concepts of mental health that participants identified were predominantly centred on factors relating to the individual, and one's capacity and ability for choice in interacting with society. The concepts of agency, autonomy and control were commonly mentioned throughout the responses, specifically in regard to the individual's ability or capacity to effectively deal with and/or create change in his or her environment. Similarly, respondents pointed to the self as an essential component of mental health, signifying the subjective experience of the individual as fundamental to well-being, particularly in relationship to achieving one's valued goals. Respondents suggested that mentally healthy individuals are socially connected through meaningful participation in valued roles (ie, in family, work, worship, etc), but that mental health may involve being able to disconnect by choice, as opposed to being excluded (eg, having the capacity and ability to reject social, legal and theological practices). In contrast, Social and Environmental Factors reflected respondents’ emphasis on factors that are external to the individual and which can influence the core concepts of mental health. Many respondents reiterated the basic necessities for general health/mental health, similar to the foundations of Maslow's hierarchy of needs,24 and their recommendations for achieving social equity.

Descriptions of the core concepts of mental health were highly influenced by respondents’ Positionality and Paradigms/Theories/Models of reference, which often propelled the discourse of “What is mental health?” in opposing directions. The debate as to whether mental health and illness are distinct constructs, or points of reference on a continuum of being, was a common theme. Respondents were either, adamant in asserting the distinction between the descriptive or prescriptive nature of the core concepts, or, ardent in integrating them, producing ideas such as describing mental health as a life free of poverty, discrimination, oppression, human rights violations and war. Respondents’ made repeated references to human rights, suggesting that a basic standard, analogous to a legal definition, is required, and that ‘a human rights, political, economic and ecosystem perspective’ should be included. Again, in the tradition of Hume's ‘ought–is’ distinction, several respondents cautioned that problems of living, such as ‘poverty, vices and social injustices…’ should not be defined ‘as medical problems’. The significance of this issue cannot be understated: while we asked respondents what the core concepts of mental health are, overwhelmingly they answered in terms of what they should be. This finding is similar to other issues in public health policy that address instances of ‘conflating scientific evidence with moral argument’.15 , 22 Indeed, a primary criticism of the WHO definition of health is that its declaration of “complete physical, mental, and social wellbeing”6 is prescriptive rather than descriptive.15 Such a definition “contributes to the medicalization of society” and excludes most people, most of the time, and has little practical value “because ‘complete’ is neither operational nor measurable.”15

Accordingly, we propose a transdomain model of health (figure 4) to inform the development of a comprehensive definition for all aspects of health. This model builds on the three domains of health as described by WHO6 , 12 and Huber et al,15 and expands these definitions to include four specific overlapping areas and the empirical, moral and legal considerations discussed in the current study. First, all three domains of health should have a basic legal standard of functioning and adaptation. Our findings suggest that for physical health, a standard level of biological functioning and adaptation would include allostasis (ie, homeostatic maintenance in response to stress), whereas for mental health, a standard level of cognitive–emotional functioning and adaptation would include sense of coherence (ie, subjective experience of understanding and managing stressors), similar to Huber et al's15 proposal. However, for social health, a standard level of interpersonal functioning and adaptation would include interdependence (ie, mutual reliance on, and responsibility to, others within society), rather than Huber et al's15 focus on social participation (ie, balancing social and environmental challenges). Our results provide further insight into how these domains interact to affect overall quality of life. Integration of mental and physical health can be defined by level of autonomy (ie, the capacity for control over one's self), whereas integration of mental and social health can be defined by a sense of ‘us’ (ie, capacity for relating to others) the integration of mental and physical health can be defined by control (ie, capacity for navigating social spaces). The highest degree of integration would be defined by agency, the ability to choose one's level of social participation (eg, to accept, reject or change social, legal or theological practices). Such a transdomain model of health could be useful in developing cross-cultural definitions of physical, social and mental health that are both inclusive and empirically valid. For example, Valliant's25 seven models for conceptualizing mental health across cultures are all represented, to varying degrees, within the proposed transdomain model of health. The basic standard of functioning across domains which is proposed here is congruent with Valliant's25 criteria for mental health to be ‘conceptualised as above normal’ and defined in terms of ‘multiple human strengths rather than the absence of weaknesses’, including maturity, resilience, positive emotionality and subjective well-being. In addition, Valliant's25 conceptualisation of mental health as ‘high socio-emotional intelligence’ is also represented in the transdomain model's highest level of integration of the three areas for full individual autonomy. Finally, Valliant's25 cautions for defining positive mental health—being culturally sensitive, recognising that population averages do not equate to individual normalcy and that state and trait functioning may overlap, and contextualising mental health in terms of overall health—are all addressed within the transdomain model.

Transdomain Model of Health. This model builds on the three domains of health as described by WHO6 , 12 and Huber et al15 and expands these definitions to include four specific overlapping areas and the empirical, moral, and legal considerations discussed in the current study. There are three domains of health (ie, physical, mental, and social), each of which would be defined in terms of a basic (human rights) standard of functioning and adaptation. There are four dynamic areas of integration or synergy between domains and examples of how the core concepts of mental health could be used to define them.

Strengths and limitations of the current study

We are unaware of any study to date that has asked this research question to a group of international experts in the broad field of mental health. Although our survey sample was small (N=50), it was diverse with regard to place of origin and expertise it was also further validated by participants (N=58) at a day-long conference on mental health through discussion, debate and written responses. The current study included global experts who dedicate their research and professional lives to advancing the standards of mental health. Of particular note was that little to no consensus among the selected group of experts on any particular definition was found. In fact, this was simultaneously a limitation and strength of the study: the small sample size limited the scope of the core concepts of mental health, but indicated that it was sufficient to demonstrate that there are highly divergent definitions that are largely dependent on the respondents’ frame of reference. It is possible that saturation was not achieved in regards to the diversity of responses. Further, more than half of the survey respondents were from Canada, which may have influenced the preference towards the PHAC definition of mental health. Although there were advantages to using a snowball sampling method, another type of sampling method (eg, cluster sampling, stratified sampling) may have resulted in more varied responses to the survey items. The next logical step would be to survey experts in countries currently not represented and then ultimately survey members of the general public with regard to their conceptual and pragmatic understanding of mental health. One of the a priori objectives for the survey was to eventually create a consensus definition of mental health that could be used in public policy this objective was not communicated in the survey, nor did we actually ask this question. Our results indicate that finding consensus on a definition of mental health will require much more convergence in the frame of reference and common language describing components of mental health. Even we, as authors, have been challenged by consensus. For example, some of us wish to emphasise that future work should focus on developing an operational definition that can be applied across disciplines and cultures. Others among us suggest further exploring what purpose a definition of mental health would or should serve, and why. In contrast, others among us wish to emphasise the process of conceptualising mental health versus the outcome or application of such a definition. What we hoped would be a straightforward, simple question, designed to create consensus for a definition of mental health, ultimately demonstrated the nuanced but crucial epistemological and empirical influences on the understanding of mental health. Based on the results of the survey and conference, we present a preliminary model for conceptualising mental health. Our study provides evidence that if we are to try to come to a common consensus on a definition of mental health, we will need to understand the frame of reference of those involved and try to parse out the paradigms, positionality and the social/environmental factors that are offered from the core concepts we make seek to describe. Future work may also need to distinguish between the scientific evidence of mental health and the arguments for mental health. Similar debates in bioethics22 , 26–28 demonstrate the theoretical and practical limitations of science for proscribing human behaviour, especially with regard to individual freedom and social justice.

Decision letter

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns minor comments are not usually included.

Thank you for submitting your article "The meanings of "function" in biology and the problematic case of de novo evolution" for consideration by eLife. Your article has been reviewed by four peer reviewers, and the evaluation has been overseen by the eLife Features Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Ford Doolittle (Reviewer #1) Etienne Roux (Reviewer #3) Lauren Cagle (Reviewer #4).

We would like to invite you to submit a revised manuscript that addresses the points raised by the referees (please see below).

Regarding point 11 from reviewer #3: please address this point by revising your manuscript so as not to present the second half of your work as a test of a hypothesis or model (rather than, for example, performing further work on other abstracts in order to attempt test your hypothesis).

Introduction. The first sentence of this manuscript makes a very strong claim – not just that scientists inevitably express results in words, but that they do so "in terms intended to persuade or impress". At least that's how the OED in my Mac defines "rhetorical" and it's what scientists try NOT to do, when describing observations at least. Perhaps we never succeed in being "objective", and of course we do want to present theories in the best light, but the goal of science is NOT to "intend to persuade" in presenting data. Do authors mean to claim we never succeed in getting rid of such an intent? The second paragraph reminds us, usefully, that words mean different things to different people, which I grant – just not that any use of words is necessarily "rhetorical".

Introduction. I'm a little unsure why references to me (Doolittle) include my initials or middle name (Ford) while others just have the last name only. Ford is not part of my surname.

Introduction. I think it is (or should be) more correct to say that ENCODE critics like me don't so much insist that everyone use the SE definition of function, as that they be more precise about the definition they ARE using. Many genomicists and molecular biologist seem to think the "function" is unproblematic. This might be so if all activities arise by selection, and I think that most genomicists and molecular biologists are indeed pan-adaptationists, although not realizing it! Authors do say that something like this in the second paragraph on this page.

Introduction. I like the idea of thinking about the "function" of newly arising genes. I think I agree with them that the idea of a gene implies a function, but wonder whether one would say, of a newly arisen stretch of DNA with a promoter, open reading frame producing a substantially long protein with no activity whatever, and terminator, that it was not a gene? One could easily make such a thing in the lab. Would we not call it a "gene", or would we also require that it has a biological function?

Authors do problematize something like this, though having a de novo gene that causes cancer seems to be over-problematizing. I think many who would say "Gene X functions in the development of cancer [or even some disease that does not itself evolve]" would balk at saying that "The function of gene X is to cause cancer".

Not altogether sure what they mean "hinder scientific development". They say (and I agree) that "function" and "matter" (as in "Does it matter?") are often conflated, but possibly any further definition of function will do.

Subsection “A model of function for de novogene birth research”. Pittsburgh model seems very useful and appropriate. Was it just invented for this paper, or does it have a provenance? I think the former, but this should be made clear.

Subsection “The multiple meanings of function hinder scientific development in the field of de novogene birth”. Only 20 abstracts? Seems a small number, and it is unclear to me why they did not do more. Should be made very clear that all these 20 describe de novo originating genes. And actually, why not abstracts of paper in genomics/molecular biology more generally. Does the Pittsburgh model not apply equally to all?

Subsection “The multiple meanings of function hinder scientific development in the field of de novogene birth. Not sure what "consensus" means. Did all authors get together and try to agree (which is what I think the term usually means) or did they just take the majority of independent opinions.

Subsection “Interpretation and recommendations”. Totally agree that when the word "function" is used, it should be qualified, or else it should not be used.

Overall, a nice paper and I very much like the "Pittsburgh model". As a suggestion for further work, this is fine. It could so easily have been applied to a much larger number of readers (think of grad students in different biological disciplines, for instance) that I wonder why it wasn't. Or indeed to abstracts in genomics/molecular biology more generally.

The paper seeks to establish a hierarchy of clarifications for the word "function", to help reduce confusion. The authors propose a 5-level classification of Expression (E), Capacities (C), Interactions (I), Physiological Implications (PI), and Evolutionary Implications (EI), and then seek to classify the use of the word "function" in 20 papers into one of the 5 classes, or "vague", if it is unclear which use is meant. The four authors independently assign one of the 6 labels to each instance, and (to their surprise, it seems) find that even the authors themselves cannot agree 88% of the time which of the classes should be used. The authors conclude that this disagreement demonstrates how important their classification scheme is, but I would argue it also demonstrates that such a scheme is very unlikely to catch on.

1) Table 1: The classification scheme is problematic. For example, one could argue that E, C, and I could be referred to as "biochemical activity", not "function", despite the unfortunate naming of the field of "Functional Genomics" (which mostly concerns itself with biochemical activity). This 'activity-vs-function' distinction is consistent with the much less frequent usage in Figure 1C of these three categories compared to the last two.

2) Table 1: Still on the classification scheme, one could also argue, the top contenders for the word "function", PI and EI are already quite extensively distinguished in the literature in the names of Selected Effect (SE) and Causal Roles (CR). SE seems to be the same as EI, in which case a new term may not be needed. CR seems to be most closely aligned with PI.

3) Introduction: The paper frames itself in the context of ENCODE, and then shifts gears and focuses on the field of gene birth. The authors should take a broader perspective of the utilization of the word function in their Introduction, or take a much narrower perspective focusing on gene birth. The current Introduction distracts instead, and feels more like a bait-and-switch. If the authors choose to start with ENCODE in the Introduction, they should instead take on the task of classifying the 300-some papers by the ENCODE consortium instead.

4) Figure 1: The classification task undertaken is quite modest. Given the ambitious stated goal of the paper, a much larger number of papers should be utilized, to avoid the biases that can come from small datasets.

5) Figure 1: The authors should also provide the full classification by each author of each instance, to enable analysis of which terms are most confused, which are most ambiguous, which specific pairs are switched, etc. The authors should also carry out some of that analysis as well, in a larger sample.

6) Figure 1: The authors should engage third-party scientists (perhaps student volunteers from a graduate program or students from a class) to read the instructions and then classify the words, in order to get a larger sample.

7) Introduction: The first sentence of the third paragraph suggests a very narrow view of the debate surrounding the ENCODE project and the definition of function therein. This is probably not an area that the authors want to get into. In particular, searching for "ENCODE debacle" in Google Scholar returns only one paper, which has never been cited. Searching in Google returns only blog posts by Dan Graur and other angry bloggers, which is probably not the view that the authors want to align themselves with. A closer reading of the original ENCODE 2012 paper provides upfront a definition of "biochemical function" (which perhaps should be referred to as "biochemical activity", despite again the unfortunate naming of "functional assay" and "functional genomics"). It then uses that definition, and very clearly indicates that only a small fraction of the genome is under evolutionary selection. Even the criticisms of ENCODE have primarily cited the press articles written by news authors, not scientists, that claim that 80% of the genome is functional. Anyways, I would skip that whole section if the authors don't want to re-open a very large can worms.

8) Introduction: The selected effect that gave rise to a trait or a genomic region may be quite different from the current functional roles of that trait or genomic region. Thus, equating function to "selected effect" may be inaccurate as well.

9) Introduction: "scientists cannot agree on the number of functional genes in the human genome" is a peculiar statement to accompany the Pertea and Jungreis papers. Briefly, Pertea claims to discover thousands of new genes, Jungreis claims that Pertea made specific mistakes resulting in exclusively false positives. Yes, there is debate, but the Pertea paper is not a reference for this statement. This sentence alone indicates a lot of nonchalance on the part of the authors about dismissing the state of broad fields that they should be much more cautious about, especially in a paper that seeks to bring rigor to the field.

10) Introduction, "practically zero". This is again a great oversimplification of a rich field of gene birth, and does not reflect well on how scholarly the authors should be.

11) Introduction: "gene". Why choose this word, rather than "functional". The word "gene" itself has a long history of differing definitions and great debate. Once more, throwing this word around without much thought seems out of place for a paper that seeks to be scholarly.

12) Subsection “A model of function for de novo gene birth research”. The example seems contrived. Why conflate the concept of de novo gene birth with the fact that many disease relevant mutations are not in selected elements. These should be two separate examples. Ex1: mutations that lead to disease but lie in non-conserved regions. Ex2: gene birth example.

13) Subsection “A model of function for de novo gene birth research”: Please reword to avoid the word "fantastic".

14) Subsection “A model of function for de novo gene birth research”: "Perturbation effect" should feature within this list. Perturbation by experimental intervention is one example. Perturbation by natural genetic variation is another. Perhaps both should be separate entries on their classification.

15) Subsection “A model of function for de novo gene birth research”: Naming this Pittsburgh after the affiliation of the last author seems inappropriate. Does every scientist in Pittsburgh agree? Why not "Carvunis", since she's the only author from Pittsburgh? Why not "our" model, and let others name it "the Carvunis model".

16) Table 1: Should this be a hierarchy? Why not a list of checkboxes/attributes. Different papers show evidence of one without evidence of the other.

17) Table 1: Order between C and E is unclear. Perhaps capacities precede expression.

18) Table 1: E: does mere presence of DNA make every DNA segment functional?

19) Table 1: Selfish elements defy this classification. Distinct SE and CR 'functions'. Similarly, "runaway selection" for traits that are detrimental to biological functions challenges this definition. Both should be discussed.

20) Page 7, second paragraph: Why not address "surplus meaning"? Please expand how it could be addressed. Roux 2014 also brings up many additional important points worth discussing at greater length.

21)Subsection “The multiple meanings of function hinder scientific development in the field of de novo gene birth”,"only 12% were unanimous":

- need heatmap of misclassifications, table of all data.

- which ones were most confused and why

- this result perhaps indicates that nomenclature would not have helped, may be either too ambiguous or overspecified

- an independent cohort of students / trainees / colleagues would be useful in evaluating approach

- more than 20 papers would greatly help.

22) Subsection “The multiple meanings of function hinder scientific development in the field of de novo gene birth”. "again supporting our hypothesis. ". Perhaps also suggesting that the proposed classification scheme does not work?

23) Subsection “Interpretation and recommendations”: for "older" genes the SE and CR functions are more likely to differ, and thus focusing on gene birth may bias authors' perspective on one hand, and may also lead to non-representative results on the other hand.

The authors address an interesting question regarding the philosophical issue related to the meaning(s) the word "function", and in particular the question whether a selective process is required to legitimate the usage of the word "function". Though extensive philosophical debate has occurred within the last decades, little attention has been paid to the usage of the word by the biologist themselves. The birth event of de novo gene, understood as the event that occurs prior to any selective process, is an interesting limit case to see whether biologists use or not the word "function" to characterize some particular properties of these genes. However, as it is presented in the manuscript, the work exhibits several conceptual and methodological weaknesses.

A) The philosophical issue of de F-word:

1) As stated by the authors, the philosophical debate regarding the function is about (a) what a function means and (b) how this meaning is legitimated. The (a) debate is between, as formulated by Cummins the "How does it work?" question (causal role theories) and the "Why is it here?" question (teleological-etiological theories). In the first case, legitimation is grounded on the functional (or systemic) analysis of the biological system studied, in the later one, it is grounded, for the selected effect theories, on the existence of a past selective process. Hence, in these theories, the reference to evolutionary implications is not that the effect of a trait is adaptive, but that its presence is the consequence of a past process of adaptation. What legitimates a function is the past, not the current, fitness of a trait, because current fitness is not a legitimate response at the question "why an entity does what it does". Regarding this question, the de novo gene occurrence, seen as a transition between no-function to function, is an interesting case. Indeed, the de novo occurrence of a trait cannot be explained by a selective effect, even if it has physiological implications, and even if it considered as adaptive (which means that it has "positive" physiological implications). But his critical point is not taken into account by the authors in their study, including in their definition of "evolutionary implications".

2) Regarding the teleological-etiological theories, in parallel with the selected-effect theories of function, a new approach has emerged in the last decade, the so-called organizational approach of function (OAP). For these theories, what a function means is, as for the selected effect theories, "Why is it here?", but the legitimation is not grounded on a selective process but on organizational properties, basically the existence of a causal loop between the causal role of a trait in the self-maintenance of the system, and the production and maintenance of a trait by the system. The application of this theory to the case of de novo gene birth is interesting, but there is no reference to this theory of function in the manuscript.

3) Taking into account the philosophical issue of the F-word, the emergence of de novo genes is problematic only for the supporters of the selected-effect theories it is not for the advocates of the causal role theory and the OAP. Indeed, the main critic of the OAP against the selected effect theory is precisely that this theory cannot account for the emergence of function. The title is hence rather misleading, since it refers to a privileged theory of function. The word "evolution" is, in the same idea, also confusing. Why use it rather than emergence? Indeed, if the de novo emergence of gene "function" has anything specific regarding the F-word philosophical issue, it is that it is de novo, i.e., it has not (or not yet) been submitted to an past selective pressure. If the existence of de novo genes and their effects can be interpreted as a consequence of a selective pressure, they are no longer "de novo". The meaning of "evolution", used for de novo gene, should be precise. If it precisely means the birth event of a new gene, this event occurs prior to any selection, because selection can act only by differential screening between already existing traits. In this meaning, "emergence" seems more adequate.

B) Semantic plurality of the F-word and conceptual confusion.

4) The authors claim that the existence of several meanings of "function" is a source of confusion between biologists, but there is no reported evidence of such a confusion. Actually, the word "function" is used in biological sciences at least from the sixteenth century, without particular debates within the biologist about what this word means. The fact that there is a philosophical debate about the legitimacy of the meaning of a notion like "function", and the fact that this notion is a useful or a confusing conceptual tool is 2 different questions. The authors should precise if the object of their work is to contribute to the philosophical debate by clarifying the meaning of the concept by the biologists, or to highlight to existence of real confusion or discrepancies in the usage of the word by the biologists, specifically in the field of de novo gene research.

5) In their rhetorical analysis, the authors have considered that the meaning of "function" is similar when it is used as a name ("assessment of function"), a verb (MDF1 functions in two important pathways"), and adjective ("functional"). However, some philosophers have pointed out that "a function", "to function", and "functional" have different meanings, expressed by different words in other languages than English. For example, in the cited though experiment, scientist who may be reluctant to say that "the function of the pro-oncogenic gene is to generate cancer", but agree to say that "the function of the anti-oncogenic gene is to prevent cancer", would agree to say that both genes are "functional". The authors should provide the numbers of instances of the different words used and justify that there is no different meanings statistically associated to the name, the verb and the adjective.

6) The existence of different meanings of a notion does not necessarily imply that the notion is conceptually confusing, if the different meanings are complementary and not contradictory. Actually, the fact that the authors have been able to attribute, according to the "Pittsburg model of function", identified meanings to the usage of the term "function" can be interpreted as a proof that the word is not so confusing. Additionally, the presentation of the "Pittsburg model of function" in table 1 show causal relationships between the different meanings, symbolized by arrows from the lowest line (expression) to the highest (evolutionary implication). It seems to me that these arrows are a "surplus meaning" added by the authors from their rhetorical analyses, rather than relationships identified by the rhetorical analysis itself, but, anyway, as it is presented, the "Pittsburg model of function" is nothing more than the classical bottom-up view of causal relationships "from genes to function".

7) I do not understand on which argument the authors can say that the data presented in Figure 1 support the idea that the literature is hard to interpret. As said previously, if the different meanings are causally subordinated, the interpretation of the data does not seem so difficult, including in the fact that different meanings can be embedded in the term "function". Can the authors provide examples of misinterpretations?

8) Regarding the usage of the word "function", the main conclusion of the rhetorical analysis is that the large majority of the meanings are physiological implications and evolutionary implications, which represent almost all of the instances (40 on 42). The authors should discuss the implications of such a rather "integrative" or "high level" meaning of function on the philosophical debate about what a function mean. The fact that, as stated by the authors, "evidently, neither technique gives direct insights into evolutionary implications" should also be discussed. Do the authors consider that this methodological difference impacts the legitimation of the meaning of function?

9) Methodologically, the fact that the literature has been studied only on abstracts, and not on the full papers, is a serious limitation of the work. Why the authors haven't work on the full papers? In particular, the "vague" classification on abstracts may be not so vague in the full paper. The number of abstracts is also small, as is the number of coders.

10) Another problem is how the abstracts have been selected. A sample can be considered as representative only if it had been randomly picked up, which is not the case in the study. At the best, it can be considered as illustrative. Is the fact that the person who selected the abstracts is an "expert in the field" supposed to be, by itself, a guarantee of objectivity?

11) Another important methodological limitation is the fact that the sample of abstracts has been used to build the model, to validate it, and to use it is basically the same indeed, 17 of the 20 abstracts where used for model refinement, and the remaining 3 for its validation. After that, the same 20 abstracts where used for quantitative analysis and "hypothesis testing". I do not clearly see what are the tested hypotheses? But the fact that the same set of material and the same set of coders where used all along the process it for me a serious methodological limitation.

12) Taking together, these methodological considerations greatly impact the accuracy of the study. Independently from these methodological limitations, the question arises of how such a methodology brings novel insights is the F-word issue, and, in particular, the real value of the so-called "Pittsburg model of function". As said before, its presentation in table 1 seems no more than the classical bottom-up "layer cake" view of stratified levels of organization, from molecules to organisms interacting with their environmental conditions (which is, by the way, seriously challenged by several philosophical studies). The authors should hence precise what is new in the so-called "model", either philosophically or conceptually.

General assessment: this manuscript is an important and thoroughly argued contribution to eLife. It is particularly notable how accessibly the manuscript is written for a multi-disciplinary audience. The arguments are timely, and the need for this work is justified in the opening section. The description of the findings and later description of the approach are detailed and explanatory, with interpretation provided as relevant. I highly recommend this piece for publication.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting the revised version of "The meanings of 'function' in biology and the problematic case of de novo gene emergence". This version has been seen by two of the reviewers who reviewed the original version (Ford Doolittle – Reviewer #1 Etienne Roux – Reviewer #3), and their comments are below. I would like to invite you to submit a second revised version that addresses these comments, and also the additional comments listed below them.

I still wish that the authors had looked at more than 20 abstracts, but the writing is very clear and the intellectual clarification this promises is enough, in my view, to justify publication. And I'm sure that many in future will refer (if sometimes critically) to the "Pittsburgh Model of Function". So, I think this can be published as is, though I have these residual quibbles.

1) They say, of the top level in that model, that.…"Evolutionary Implications can refer to selection upon a trait driven by the object in the past, as in selected effect (Millikan, 1989), but it can equally describe novel adaptive effects, as one may expect in gene birth." Not sure what an "adaptive effect" is. If selection is a force and the effect is fitness-enhancing in environment E, then from the very instant that organism O finds itself in E, this "adaptive effect" is under selection, though possibly only purifying selection if it was previously fixed in another environment, where it had no, or another, effect on fitness. If selection is an outcome, then nothing is a selected effect unless actual mutations have actually been eliminated by purifying selection just yesterday, which seems silly. Any species emerging from a bottleneck has no functions, by that account.

2) The statement "For a genomic sequence to be labelled as a gene, it must by definition have a function." needs some further elaboration or justification (as in the response to reviewers.

3) And to say "The molecular objects of study are thus transitioning between a state without a function and a state with a function. They cannot have a function by the selected effect definition, since their existence cannot be explained by a past selection." seems a little problematic. If by a "new gene" one means a gene that was never detected in any other species, but it is fixed in the species in which it is detected, and that species has a reasonable population size, then it likely was selected for, in that species, even if we do not know why. And I suspect that most of the literature on de novo genes describes such situations, not situations in which the "new gene" is a rare polymorphism in the species in which it is described. So it's really only the parent of the new gene that is not a gene: at the very instant that a replicate of this non-gene parent starts to increase in frequency in its population because of a fitness advantage to its bearers, it has acquired 'function' and becomes a bona fide gene, I think. I don't see this as a gradual process or transition, which the authors in their next paragraph, seem to imply. Of course this could lead further into a discussion of whether selection is a force or an outcome or whatever, and what does fitness mean anyway, but I don't think authors need to go there to be clearer on this point.

The authors have greatly improved the quality of the manuscript and correctly respond to the majority of my comments. There is however 2 points for which the response given by the authors let me unsatisfied.

1) The first one is about the hierarchical relationships introduced by the authors in their model, and symbolized by ascendant arrows in Table 1. Since, as acknowledged by the authors, these relationships are a "surplus" meaning injected in, and not extracted from, the rhetorical analysis (so the model is not purely data-driven), the question remains of where this surplus meaning comes from, and why is it needed. As I have said in my initial comment, this view of hierarchically organized relationships is not new, and corresponds to the classical bottom-up causal relationships from lower (gene) to higher level of organization. This is not a question of logic, but a question of causal relationship between different levels of properties, similar to the classical causal relationship usually expressed by arrows from gene to proteins to pathways to subcellular mechanisms to cells to tissues to organs, etc. (see for example Noble, 2006). Saying so, I do not mean that the model is irrelevant, but that, since the authors privilege a bottom-up view of causal relationships between properties expressed at different level of organization which is not consensual, they should be more explicit about the import of this causal pattern in their model (including bibliographical references) and its justification. Also, some expressions seem misleading or inappropriate. For example, I don't see what is a "logical flow of genetic information" from lower to upper levels of organizational properties, and top-down causal relationships also exist (typically, gene expression is a consequence of interactions). The notion of "spatiotemporal relationship between concepts" does not seem to me relevant. First, the spatiotemporal relationships to which the authors refer are between properties of biological entities, not between concepts, and, in the case of de novo gene, the spatiotemporal relationship between interactions and expression is primarily from interaction to expression, not the opposite, since the expression of a de novo gene cannot occur without prior interaction of the DNA segment with a lot of different proteins involved in a complex interaction network.

If we admit the relevance of the directional necessary-but-not sufficient causal relationship between from Expression to Evolutionary implications, it can be logically inferred that each level of meaning of function is inclusive of all the meanings of lower levels. The authors should be more explicit in which way this modifies a model of function, compared to a model without these hierarchical relationships (basically, they should explain why they have chosen to add this hierarchical relationship and how it improves the model). Since previous studies have already identified different notions of functions (see for example Wouters, 2003), the authors should refer to these previous studies in their discussion.

2) The second point is about the claim, which is also a section of the article, that "The multiple meanings of function hinder scientific communication in the field of de novo gene birth". As I have previously explained, I don't think that, from a biologist viewpoint, the notion of function is confusing, and that is impairs the scientific communication. However, the point is not that I disagree with the authors, but is that the study of the authors provides no evidence for such a claim. The only factual evidence is that, in a set of four encoders, among whom one is an expert is the field, one non-consensual case remains on the 20 abstracts analyzed. Due to the numerous limitations of the study, both in its conception and methodology, these results are clearly unsubstantial also, non-consensus and discrepancies between members can be interpreted as the inability of the so-called Pittsburg model to account for the actual use of the concept by the researchers themselves. Due to all these limitations, I find such a claim, and the final recommendations, an overinterpretation of the results and, in my opinion, should be removed, or at least reformulated in a less affirmative way.


Patient characteristics

Characteristics of the 132 CLL patients with typical 13q14 deletions as detected through SNP 6.0 profiling (identified within a prospectively enrolled and profiled cohort of 255 CLL patients) analyzed in this study are summarized in Table 1, stratified by 13q14 subtype and treatment status. Data for 123 patients with CLL without typical 13q14 deletions are also summarized (Table 1). Of the 132 patients with CLL with typical 13q14 deletions, 100 (76%) were untreated (UT) and 32 (24%) relapsed (T) at the time of study enrollment. Within the group of previously untreated patients, the distribution of important biomarkers across type I or II 13q14 lesions was well balanced: Rai stage 0, 45%/41% Rai stages 1 or 2, 51%/53% IgVH unmutated, 32%/29% ZAP70 positive, 30%/26% p53 exons 2 to 10 mutated, 11%/6% del17p present, 7%/3% and del11q present, 6%/12%. The median time from diagnosis to enrollment and from enrollment to data analysis for previously untreated patients is detailed in Table 1. All outcome analyses described later are based on SNP 6.0 array analysis and biomarker measurements that were conducted on patient samples procured at study enrollment, thus avoiding confounding effects of longitudinal biomarker instability. Outcome was calculated using either the CLL trial enrollment date or the diagnosis date as the reference dates, as indicated, to minimize the effect of lead time biases.

The pathologic anatomy of acquired subchromosomal genomic copy number changes spanning 13q14 in CLL as defined through SNP 6.0 array copy number profiling

We cataloged all aCNAs on chromosome 13 in our CLL cohort using visual inspection of simultaneous displays of dChipSNP-based copy number estimates (heatmaps) for CD19 + cells and paired CD3 + /buccal DNA (Fig. 1). Overall, 51.7% (132 of 255) of CLL carried a classical 13q14 deletion as detected through SNP 6.0 arrays that included the genomic region recognized in the clinically used CLL-FISH panel (range of lesion sizes of 0.198–73.775 Mb see Supplementary Table S1). In addition, rare atypical 13q deletions were identified. As previously described by lower-resolution SNP array platforms (Affymetrix SNP50k XbaI arrays), 13q14 deletions displayed substantial anatomic heterogeneity (18). Nonetheless, multiple distinct break clusters located close to the 13q14-resident miR 15a/16-1 loci and, alternatively, close to and inclusive of all or parts of the RB1 locus were identified. A telomeric cluster of breaks was located at approximately 50.2 to 50.7 Mb physical position, comprising the vast majority of the breaks of short 13q14 deletions. Of the short and relatively uniform 13q14 deletions, 54% (71 of 132) were between 0.678 and 1.944 Mb in length and only 2 lesions were identified that were shorter (CLL145 and CLL97, with lesion lengths of 0.198 and 0.425 Mb, respectively refs. 29–31).

Genomic copy number heatmap display of chromosome 13q of 255 CLL cases ranked by the position of centromeric 13q14 deletion break points: Copy number heatmap displays for paired DNA samples based on SNP 6.0 array profiling were generated with dChipSNP. Left, CD3 + or buccal DNA right, CLL CD19 + DNA. Blue indicates copy loss, and red indicates copy gain. Each column represents one patient.

Using a previously proposed classification schema for typical 13q14 deletions into types I and II (exclusive and inclusive of RB1, respectively), we detected 85 type I lesions and 54 type II lesions (including 7 CLL cases in which both lesion types were identified, existing on separate chromosomes) in this cohort. Furthermore, the frequency of 13q14 type I and type II lesions in the entire CLL cohort (N = 255) was 33% and 21%, respectively.

Acquired uniparental disomy (aUPD) at 13q was identified in 7 CLL cases as previously described (18). In all these cases, the LOH region was very large (Supplementary Table S2). Six of these 7 cases also contained a region of copy loss at 13q14 (typical 13q14 deletions by FISH), whereas 1 case was associated with a homozygous miR16-1-5p mutation (see later).

A subset of CLL cases displays very low miR16-1/15a cluster expression

The 13q14-resident microRNAs miR16-1 and miR15a, located at 49,521 Mb physical position, are contributors to the biology of these lesions in CLL cells (18, 32–34). In our cohort of 132 CLL cases with typical 13q14 deletions, we identified 4 cases (CLL # 78, 113, 120, and 214 see Supplementary Fig. S1) with breaks occurring in proximity (within 12 kb) to the miR locus and with unresolved miR gene status, whereas 134 lesions unequivocally resulted in the removal of one and occasionally (∼10%) both copies of these miRs (see later ref. 16).

We proceeded with an analysis of the expression of miR16-1-5p and miR15a-5p in 238 of 255 CLL cases with available intact RNA by extending normalized Q-PCR–based measurements to cases not previously reported (18). Measurements were normalized to the expression of 2 unrelated microRNAs, RNU43 and RNU49, through simple subtraction of the means of Ct values [ΔCt mean = Ctm miR15a or 16 − Ctm RNU43 or 49 the R 2 correlation for miR measurements using either RNU43 or RNU49 was 0.867 (miR16.1 data) and 0.84 (miR15a data), respectively]. This resulted in largely negative ΔCt values for miR16-1 and positive values for miR15a this is due to lower Ct values (higher expression) for miR16-1 relative to the reference microRNAs (Supplementary Table S1).

Subsequently, we ranked all CLL cases according to mean SNP 6.0 array-based chromosomal copy number estimates for an approximately 0.669 Mb chromosomal region between rs9535414 (immediately centromeric to the miR16-1/15a locus) and rs706593 (∼650 kb telomeric to the miR16-1/15a locus) encompassing 216 consecutive copy number measurements. Next, we plotted mean ΔCt values for miR15a and miR16-1 versus 13q14 genomic copy number measurements for all CLL cases (grouped by 13q14 deletion status) and calculated mean normalized expression values for miR16-1 and miR15a for the CLL cases with more extensive chromosomal loss (copy number estimates <1) and for the CLL cases with less extensive loss (copy number estimates ≥1) or no loss (cases without 13q14 deletions). Data are summarized in Fig. 2A–D, with mean miR expression levels indicated with red numbers.

A–D, normalized expression of miR15a and miR16-1 versus SNP 6.0 array-based copy number estimates for typical 13q14 deletions. A–D, mean ΔCt values (miR15a or miR16-1 − housekeeping microRNAs) as single dots (y-axis) versus SNP 6.0 profiling-based copy number estimates for 13q14 deletions (x-axis) grouped by 13q14 deletion status. Red numbers indicate mean ΔCt values for CLL groups with copy number less than 1 or 1 or more and 13q14 deletions or non-del13q14 status. Black dots indicate CLL cases with unresolved miR gene status. Red dots indicate CLL cases with del13q. Blue dots indicate cases without del13q.

Mean normalized expression values of both miR16-1 and miR15a were substantially lower (miR15a ∼8-fold and miR16-1 ∼4-fold, respectively) in the CLL subgroup with more extensive del13q14 deletions (copy number estimates <1) as opposed to the group with no del13q14. For CLL cases with monoallelic 13q14 deletions and associated copy number estimates of 1 or more (the majority of such cases had copy number estimates of ∼1.2) as compared with cases without 13q14 deletions, there was evidence for a gene dosage effect for miR15a expression (∼1.3- to 1.9-fold lower) but no such effect for miR16-1 expression.

This combined data suggests that approximately 12% to 15% of CLL cases have very low miR16-1/15a levels and that the majority of CLL cases (∼85%) display an overlapping range of expression of these miRs, with mild relative reductions in miR15 expression in monoallelically deleted 13q14 cases.

Assessment of elevated HRs for short overall survival based on analysis of thresholded centromeric or telomeric 13q14 lesion breakpoints

Given various published observations on CLL 13q14 deletion sizes, associated gene deletions, and clinical parameters, including our proposal for 13q14 subclassification into types I and II based on RB1 gene status, we modeled HRs for shortened survival for dichotomized 13q14 groups defined by distinct physical chromosomal break positions. Specifically, various dichotomized 13q14 groups were defined by sliding the separation/break points across all physical positions actually identified in the 13q14 deletions in this cohort (based on the physical positions of either telomeric or centromeric breaks Supplementary Table S1). Using such an approach, we determined that the location of centromeric breakpoints strongly influenced the risk for short overall survival in CLL, whereas the telomeric breakpoints had little effect (Supplementary Fig. S2).

Specifically, the HR for short overall survival increased as the centromeric separation break for the two 13q14 groups moved closer to the telomere, with the highest HRs identified for breaks at approximately 48 to 48.5 Mb physical position, followed by a sharp drop in risk. The RB1, with well-documented effects on cell-cycle control and chromosomal instability in cancer cells, is located at 47.776 to 47.954 Mb physical position and is very close to the cohort dichotomization breakpoints that are associated with peak HRs for short overall survival, thus providing justification for a 13q14 classification schema centered on RB1 (see also RB1 mutation and expression data later). Additional genes located telomeric to RB1 and in the area (48–48.5 Mb physical position) demarcated by the highest HRs for short overall survival included LPAR6, P2RY5, RCBTB2, and CYSLTR2, but none of these genes have well-established roles in biological processes that are known to affect cancer cells. Finally, only 4 actual 13q14-associated chromosomal breaks occurred between 48 and 48.5 Mb physical positions, creating modest uncertainty about the actual physical positions associated with peak HRs as estimated by our approach.

Results of univariate outcome analyses of SNP 6.0 array-detectable 13q14 deletions and overall survival in CLL

We initially determined the prognostic value of SNP 6.0 array-defined 13q14 type I and type II deletions on overall survival in the CLL cohort (UT + T N = 132) using univariate analysis and either the CLL sample procurement/trial enrollment date or the CLL diagnosis date as the reference date. For both analyses, overall survival was significantly shorter for CLL patients with 13q14 type II lesions, as opposed to patients with 13q14 type I lesions (Kaplan–Meier plots for these analyses are displayed in Fig. 3A and C). These findings were true in the analyses in which CLL cases that carried both 13q14 lesion types (N = 7) were included and hierarchically assigned to type II status (shown) or, alternatively, were excluded from analysis. Next, we analyzed the prognostic value of SNP 6.0 array-defined 13q14 type I and type II deletions on overall survival in the untreated subset of patients (UT N = 100) and again noted shortened survival of CLL patients with 13q14 type II, as opposed to type I lesions (Kaplan–Meier plots for these analyses are displayed in Fig. 3B and D).

A–H, deletions 13q14 types I or II and overall survival in CLL (Kaplan–Meier plots). A–D, 13q14 type I or II deletions and overall survival in CLL. UT, untreated at enrollment UT + T, untreated or relapsed at enrollment. A and B, overall survival from date of enrollment. C and D, overall survival from date of diagnosis. E–H, pairwise groupings by 13q14 status (I or II) and any associated FISH lesions. The suffix “complex” indicates a 13q14 deletion with any (≥1) coexisting FISH-25 finding.

Finally, given that the nature of 13q14 deletion may also influence initial CLL disease progression, we analyzed TTFT estimates for the 2 13q14 types but found no strong effects.

Deletion 13q14 type II lesions are associated with a higher incidence of coexisting CLL-FISH–detectable chromosomal abnormalities than 13q14 type I lesions

Next, we determined the frequency of 13q14 deletions of type I and II lesions that are associated with coexisting genomic lesions as detected through (i) CLL-FISH-25 or (ii) SNP 6.0 array profiling (see later). We defined 13q14 deletions of either type I or II lesions as “FISH-complex” if they were associated with one or more additional abnormalities in the clinically used CLL-FISH panel [and if these abnormalities were detected in ≥25% of nuclei analyzed (FISH-25) this is to remove low percentage lesions of unclear clinical or biological significance]. Using these criteria, the following frequencies were measured: 13q14-I (N = 85): sole abnormality (74 of 85 = 87%), FISH-complex (11 of 85 = 13%) 13q14-II (N = 54): sole abnormality (38 of 54 = 70%), FISH-complex [16 of 54 = 30% (P = 0.03, the Fisher exact test) CLL cases with both type I and II lesions (N = 7) were counted as 13q14-I and 13q14-II].

Outcome analysis for del13q14 subtypes and associated CLL-FISH lesions

We proceeded with an outcome analysis of the relative contributions of 13q14 subtypes by category (I vs. II) versus associated CLL-FISH-25–based genomic complexity. Initially, we focused on bivariate analyses in which 13q14 lesion types were further separated into sole lesions or lesions coexisting with any other CLL-FISH-25 lesion (CLL cases that carried both 13q14 lesion types were included and hierarchically assigned to type II status). Data are summarized in Fig. 3E–H. From these analyses, the following conclusions can be supported: (i) sole 13q14 lesions of either type were associated with a relatively better prognosis than either type existing in the presence of additional FISH findings (ii) trends for shorter overall survival for 13q14 type II deletions with CLL-FISH complexity versus 13q14 type I deletions with CLL-FISH complexity were observed (iii) trends for shorter overall survival for sole 13q14 type II versus sole 13q14 type I lesions were observed, albeit more pronounced in CLL cases that were relapsed at the time of analysis (UT + T plots) and (iv) 13q14 type II deletions with CLL-FISH complexity were substantially enriched in relapsed CLL patients (frequency 45%).

Deletion of 13q14 type II lesions is associated with a higher incidence of coexisting SNP 6.0 array-detectable acquired subchromosomal copy number aberrations than 13q14 type I lesions

Next, we determined the frequencies of either 13q14 lesion type that were associated with 2 or more additional SNP 6.0 array-based subchromosomal aCNAs (total aCNA complexity ≥3). The following frequencies were measured: (i) 13q14-I (N = 85), sole abnormality (45 of 85 = 53%) and 13q14-I plus 2 or more aCNA (16 of 85 = 19%) (ii) 13q14-II (N = 54), sole abnormality (20 of 54 = 37%) and 13q14-II plus 2 or more aCNAs (18 of 54 = 33% P = 0.02, the Fisher exact test CLL cases with both type I and II lesions were counted as 13q14-I and 13q14-II). For comparison, the frequency of 3 or more SNP 6.0 array-based subchromosomal aCNAs in our entire CLL cohort of 255 profiled patients was 20% (35, 36).

Given the known strong association of p53 mutations with elevated genomic complexity in CLL, we determined p53 exon 2 to 10 mutation frequencies in the CLL cases with 13q14 type I or II deletions and 2 or more additional SNP 6.0-based aCNAs: There were 16 cases with 13q14-I plus 2 or more aCNA, of which 7 (44%) were mutated in p53 exons 2 to 10, and there were 18 cases with 13q14-II plus 2 or more aCNAs, of which 6 (34%) were mutated in p53 exons 2 to 10.

Results of outcome analyses of 13q14 deletions of type I or II lesions stratified by associated SNP 6.0 array profiling–detectable subchromosomal aCNA loads versus overall survival in CLL

We proceeded with bivariate outcome analysis of 13q14 deletions of type I and II lesions grouped by associated SNP 6.0 array-based subchromosomal aCNA status of 2 or more and 3 or more, respectively (CLL cases that carried both 13q14 lesion types were included and hierarchically assigned to type II status). Kaplan–Meier plots for these analyses are displayed in Fig. 4A–H. From these analyses, the following conclusions are supported: (i) 13q14 lesions of either type with elevated associated aCNA counts were associated with a significantly worse prognosis than either type with no or additional aCNAs below indicated thresholds and (ii) 13q14 type II deletions were prognostically more adverse than 13q14 type I deletions, as evidenced by an accelerated disease course in cases with low associated complexity (aCNA ≤1 or ≤2, respectively), a finding that was particularly evident in CLL cases that were relapsed at the time of analysis (UT + T plots).

A–H, deletion 13q14 types I or II and associated SNP 6.0 array profiling–based aCNA and overall survival in CLL (Kaplan–Meier plots). A–D, overall survival from date of enrollment. E–H, overall survival from date of diagnosis. UT, untreated at enrollment UT + T, untreated or relapsed at enrollment. The suffix “sc” indicates subchromosomal aCNA present at indicated thresholds.

Next, we conducted bivariate outcome analysis of 13q14 deletions of type I and II lesions stratified by associated SNP 6.0 array-based subchromosomal aCNA status of 2 or more and 3 or more, respectively, based on aCNA nominations that had been made by algorithmic aCNA calling methods (see Supplementary Methods). Results were similar to results based on aCNA nominations made through visual heatmap inspection. Kaplan–Meier plots are displayed in Supplementary Fig. S3A–H.

Degree of 13q14 deletions and outcome in CLL

The degree of 13q14 deletions was quantified using the mean of 216 consecutive copy number estimates based on probes located within 13q14 deletions and correlated with the outcome measures TTFT and overall survival for the group of 132 CLL patients with 13q14 deletions. The CLL cohort was dichotomized at every actual copy number measurement and HRs for short TTFT or overall survival were computed. As can be seen in Supplementary Fig. S4A–H, a copy number estimate of approximately 1 (N = 32 for copy number <1 and N = 100 for copy number ≥1) and high copy number cutoffs (the latter based on small N) optimally separated the cohort, suggesting that the degree of 13q14 deletion could have negative effects on CLL outcome once larger cohorts are studied (for this cohort, the 95% confidence bands in gray overlapped an HR of 1 ref. 37).

13q14-resident miR15a/16-1 expression and outcome in CLL

The relative expression levels of the 13q14-resident miRs as measured by Q-PCR (see earlier) were tested as a prognostic factor. The CLL cohort was dichotomized at every actual normalized miR expression measurement (actual ΔCt values) and HRs computed. No clear prognostic effects were identified (Supplementary Fig. S5A–H and S6A–H). This was equally true for 13q14-resident miR levels tested as prognostic factors in the entire CLL cohort (N = 255 data not shown).

CLL with 13q14 type II are associated with lower RB1 mRNA expression levels and RB1 frameshift mutations or bialleleic RB1 deletions in a minority subset of cases

The RB1 gene coding exons were resequenced in 53 of the CLL with 13q14 type II deletions, resulting in the identification of 2 cases (CLL# 158 and 173) with somatically acquired truncating frameshift mutations. Together with one case of CLL with biallelic RB1 deletion (CLL# 13), this indicates that approximately 5% of CLL with 13q14 type II deletions are RB1 null.

Next, we measured RB1 mRNA expression by normalized Q-PCR in the first 160 consecutively enrolled cases with intact RNA and identified significantly lower RB1 expression in CLL with 13q14 type II deletions as opposed to CLL with 13q14 type I deletions or no 13q14 deletions. Data are summarized in Fig. 5.

A and B, identification of somatically acquired RB1 mutations and results of Q-PCR–based RB1 expression analysis. A, ΔCt values [Ctm RB1 − Ctm glyceraldehyde 3-phosphate dehydrogenase (GAPDH)] with each dot representing the mean of duplicate measurements in individual patients (N = 160). Groupings are by 13q14 status. B, RB1 mutation results in CD19 + -derived DNA versus CD3 + -derived DNA.

Identification of rare somatically acquired mutations in 13q14-resident miR16-3p and miR16-5p in CLL

Sequence analysis of the genomic locus for miR16 and 15a and flanking sequences in the entire CLL cohort identified 2 somatically mutated miR16 cases and no miR15a mutations. CLL # 61 harbored a heterozygous 6 bp deletion spanning miR16-3p, whereas CLL # 70 carried a homozygous 1 nucleotide deletion in miR16-5p in the setting of chromosome 13 aUPD. The vast majority of CLL cases (99%) carried wild-type miR 16 and 15a genes. Data are summarized in Fig. 6.

A and B, identification of somatically acquired miR16.1 mutations: miR16.1 mutation results in CD19 + -derived DNA versus CD3 + -derived DNA. The nucleotide sequence of the miR16.1 gene located within 13q14 is indicated and miR16.1-3p and miR16.1-5p are highlighted in yellow.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DA: carried out cytogenetic studies, SNP-array analysis and drafted the manuscript EGU: carried out clinical examination and evaluation KM: carried out molecular genetic studies AW: carried out MCB analysis FVE: carried out molecular evaluation KY: carried out neurological examination NP: carried out array-CGH analysis NA: carried out molecular evaluation MA: carried out molecular evaluation and SNP-array analysis ET: carried out clinical evaluation. All authors read and approved the final manuscript.


  1. Kikus

    It not a joke!

  2. Mim

    Interested in making money for a webmaster?

  3. Bankole

    You are absolutely right. In there is something also I think it is the excellent idea.

  4. Tallon

    Sad consolation!

  5. Ganris

    You are wrong. I can defend my position. Write to me in PM, we'll talk.

Write a message