High-throughput annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing

Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, reference gene collections remain incomplete: many gene models are fragmentary, while thousands more remain uncatalogued—particularly for long noncoding RNAs (lncRNAs). To accelerate lncRNA annotation, the GENCODE consortium has developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third-generation long-read sequencing. We present an experimental re-annotation of the GENCODE intergenic lncRNA population in matched human and mouse tissues, resulting in novel transcript models for 3574 / 561 gene loci, respectively. CLS approximately doubles the annotated complexity of targeted loci, outperforming existing short-read techniques. Full-length transcript models produced by CLS enable us to definitively characterize the genomic features of lncRNAs, including promoter- and gene-structure, and protein-coding potential. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.

LncRNAs represent a vast and relatively unexplored component of the mammalian genome. The assignment of lncRNA functions depends on the availability of high-quality transcriptome annotations. At present such annotations are still rudimentary: we have little idea of the total number of lncRNAs, and for those that have been identified, transcript structures remain largely incomplete.
Projects using diverse approaches have helped to increase both the number and size of available lncRNA annotations. Early gene sets, derived from a mixture of FANTOM cDNA sequencing efforts and public databases 1,2 , were joined by long intergenic noncoding RNA (linc RNA) sets discovered through chromatin signatures 3 . More recently, researchers have applied transcript-reconstruction software such as Cufflinks 4 to identify novel genes in short-read RNA-sequencing (RNA-seq) data sets [5][6][7][8][9] . However, the standard references for lncRNAs are currently the regularly updated manual annotations from GENCODE, which are based on the curation of cDNAs and expressed sequence tags by human annotators 10,11 and have been adopted by international genomics consortia [12][13][14][15] .
At present, annotation efforts face a necessary compromise between throughput and quality. Short-read-based transcriptomereconstruction methods deliver large annotations with low financial and time investment, whereas manual annotation is slow and requires long-term funding. However, the quality of software-reconstructed annotations is often doubtful because of the inherent difficulty of reconstructing transcript structures from shorter sequence reads.
Such structures tend to be incomplete and often lack terminal exons or splice junctions between adjacent exons 16 . This particularly affects lncRNAs, whose low expression results in low read coverage 11 . The outcome is a growing divergence between large automated annotations of uncertain quality (e.g., 101,700 genes for NONCODE 8 ) and the highly curated, 'conservative' GENCODE collection 11 (15,767 genes for version 25).
Annotation incompleteness takes two forms. First, genes may be entirely missing from an annotation; many genomic regions are suspected to transcribe RNA but contain no annotation, including 'orphan' small RNAs with presumed long precursors 17 , enhancers 18 and ultraconserved elements 19,20 . Second, annotated lncRNAs may represent partial gene structures. Start and end sites frequently lack independent supporting evidence 11 , and lncRNAs are shorter and have fewer exons than mRNAs 7,11,21 . Recently, a method of rapid amplification of cDNA ends followed by sequencing (RACE-seq) was developed to complete lncRNA annotations, albeit at relatively low throughput 21 .
One of the principal impediments to the annotation of lncRNAs is their low steady-state levels 3,11 . To overcome this, RNA capture sequencing (CaptureSeq) 22 is used to boost the concentration of lowabundance transcripts in cDNA libraries. Such studies depend on short-read sequencing and in silico transcript reconstruction [22][23][24] . Thus, although CaptureSeq achieves high throughput, its transcript structures lack the confidence required for inclusion in GENCODE.
High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing was even across tissues, with the exception of testis ( Supplementary  Fig. 2d). ROIs had a median length of 1-1.5 kb (Fig. 2b), in agreement with previous reports 32 and exceeding the average lncRNA annotation of ~0.5 kb (ref. 11).
Capture performance is assessed on the basis of two factors: the 'on-target' rate-that is, the proportion of reads originating from probed regions-and enrichment, or the increase in the on-target rate after capture 34 . To estimate these, we sequenced pre-and post-capture libraries with MiSeq. CLS achieved on-target rates of 29.7%/16.5%, representing 19-fold/11-fold enrichment (Fig. 2c,d and Supplementary Fig. 2e). These rates are competitive with values for intergenic lncRNA capture from previous, short-read studies (Supplementary Fig. 2f,g). The majority of off-target signal arose from nontargeted, annotated protein-coding genes (Fig. 2c).
CLS on-target rates were similar to those from previous studies of fragmented cDNA 35 (Supplementary Fig. 2f,g), but lower than those observed with genomic DNA capture. Side-by-side comparisons showed that the capture of long cDNA fragments implies some loss in capture efficiency (Supplementary Fig. 2h,i), as has been observed by others 24 .
We used synthetic spike-in sequences at known concentrations to assess the sensitivity and quantitativeness of our method. We compared  MicroRNAs  786  494  snRNAs  838  721  snoRNAs  401  850  Enhancers  954  203  UCE  158  156  TOTAL  9,090  6,615  Prot. coding  124  112  Controls  ERCC  42  42  Ecoli  100  100   TOTAL  the relationship between sequence reads and starting concentration for the 42 probed and 50 nonprobed synthetic ERCC sequences in pre-and post-capture samples (Fig. 2e). We found that CLS was notably sensitive, extending detection sensitivity by two orders of magnitude, and was capable of detecting molecules at approximately 5 × 10 −3 copies per cell (Online Methods). It was less quantitative than CaptureSeq 24 , particularly at higher concentrations where the slope fell below unity. This suggests saturation of probes by cDNA molecules during hybridization. A degree of noise, as inferred by the coefficient of determination (R 2 ) between read counts and template concentration, was introduced by the capture process.

CLS expands the complexity of known and novel lncRNAs
CLS uncovered a wealth of novel transcript structures in annotated lncRNA loci. In the SAMMSON oncogene 36 (LINC01212), we discovered previously unannotated exons, splice sites and transcription termination sites (Fig. 3a, Supplementary Figs. 3-5; examples validated by RT-PCR). We quantified the amount of newly discovered complexity in targeted lncRNA loci. CLS detected 58%/45% of targeted lncRNA nucleotides and extended these annotations by 6.3/1.6 Mb (86%/64% increase compared with existing annotations) (Supplementary Fig. 6a). CLS discovered 45,673/11,038 distinct splice junctions, of which 36,839/8,847 were previously unidentified (Fig. 3b, Supplementary  Fig. 6b). We noted 20,327 novel, high-confidence splice junctions in comparison with a deeper human splice junction reference catalog composed of both GENCODE v20 and miTranscriptome 7 annotations (Supplementary Fig. 6c). For independent validation, and given the relatively high sequence insertion-deletion rate detected in PacBio reads (Supplementary Fig. 2m) (an analysis of sequencing error rates is presented in the Online Methods), we deep-sequenced captured cDNA with Illumina HiSeq at an average depth of 35 million/26 million paired-end reads per sample. Split reads from these data exactly matched 78%/75% of splice junctions from CLS. These 'high-confidence' splice junctions alone represent a 160%/111% increase over the existing, probed annotations (Fig. 3b, Supplementary Fig. 6b). The novel high-confidence lncRNA splice junctions were rather tissue specific, with the greatest numbers observed in testis ( Supplementary  Fig. 6d), and were also discovered across other classes of targeted and nontargeted loci (Supplementary Fig. 6e). We observed a greater frequency of intron-retention events in lncRNAs compared with that in protein-coding transcripts (Supplementary Fig. 6f).
To evaluate the biological significance of the novel lncRNA splice junctions, we computed their strength with standard position weight matrix models 37 (Fig. 3c, Supplementary Fig. 7a). High-confidence novel splice junctions from lncRNAs far exceeded the predicted strength of background splice-junction-like dinucleotides and were essentially indistinguishable from annotated splice junctions (Fig. 3c). Even unsupported novel splice junctions (Fig. 3c) tended to have high scores, although with low-scoring tails. Although they showed little evidence of sequence conservation according to standard measures (similar to lncRNA splice junctions in general; Supplementary Fig. 7b    A r t i c l e s novel splice junctions showed weak but nonrandom evidence of selected function (Supplementary Fig. 7c).
We estimated how close these sequencing data were to saturation (i.e., to reaching a definitive annotation). We tested the rate of novel splice junction and transcript model discovery as a function of increasing depth of randomly sampled ROIs (Fig. 3d, Supplementary  Fig. 8a,b). We observed a consistent increase in novelty with increasing depth for both low-and high-confidence splice junctions, up to that presented here. Similarly, no splice-junction-discovery saturation plateau was reached at increasing simulated HiSeq read depths (Supplementary Fig. 8c). Thus, considerable additional sequencing is required to complete existing lncRNA gene structures.
Beyond lncRNAs, CLS can be used to characterize other types of transcriptional units. As an illustration, we searched for precursors of small RNAs, whose annotation remains poor 17 . We probed 1-kb windows around all 'orphan' small RNAs (i.e., those with no annotated overlapping transcript). Note that although mature small nucleolar RNAs are nonpolyadenylated, they are processed from polyA+ precursors 38 . We identified more than 100 likely primary transcripts, and hundreds more potential precursors that harbored small RNAs within their introns (Fig. 3e). One interesting example was the cardiac-enriched hsa-miR-143, for which CLS identified a new RT-PCRsupported primary transcript belonging to the CARMEN1 lncRNA gene (CARMN) 39 (Supplementary Fig. 9).

Assembling a full-length lncRNA annotation
A unique benefit of the CLS approach is the ability to identify fulllength transcript models with confident 5′ and 3′ termini. ROIs of oligo-dT-primed cDNAs carry a fragment of the poly(A) tail, which can identify the polyadenylation site with base-pair precision 32 . Using conservative filters, we found that 73%/64% of ROIs had identifiable polyadenylation sites (Supplementary Table 1) representing 16,961/12,894 novel sites compared with end positions of GENCODE annotations. Known and novel polyadenylation sites were preceded by canonical polyadenylation motifs ( Supplementary  Fig. 10a-d). Similarly, the 5′ completeness of ROIs was confirmed by proximity to methyl-guanosine caps identified by cap analysis of gene expression (CAGE) 15 (Supplementary Fig. 10e). We used CAGE and polyadenylation sites to define the 5′ and 3′ completeness of all ROIs (Fig. 4a). 40 Figure 8b. (e) The identification of putative precursor transcripts of small RNA genes. Shown is the count of unique genes for each gene biotype. "Orphans" indicates genes with no annotated overlapping transcript in GENCODE that were targeted in the capture library. "Potential precursors" are orphan RNAs residing in the intron of a novel CLS transcript model. "Precursors" reside in the exon of a novel transcript. snoRNA, small nucleolar RNA; snRNA, small nuclear RNA; miRNA, microRNA.
We developed a pipeline to merge ROIs into a nonredundant collection of transcript models. In contrast to previous approaches 4 , our 'anchored merging' method preserved confirmed internal transcription start sites (TSSs) and polyadenylation sites (Fig. 4b). Application of this method to captured ROIs resulted in a greater number of unique transcript models than would have been identified otherwise (Fig. 4c, Supplementary Fig. 11a). We identified 179,993/129,556 transcript models across all biotypes (Supplementary Table 2), 86%/87% of which displayed support of their entire intron chain by captured HiSeq split reads (Supplementary Table 3). In the well-studied CCAT1 locus 40 , we identified novel full-length transcripts with 5′ and 3′ support (Fig. 4d). CLS here suggested that adjacent CCAT1 and CASC19 annotations are fragments of a single gene, a conclusion supported by RT-PCR (Fig. 4d).
In addition to probed lncRNA loci, CLS also discovered several thousand novel transcript models that originated from unannotated regions and mapped to probed (Fig. 1b) or unprobed regions (Supplementary Fig. 11h,i). These transcript models tended to have lower detection rates (Supplementary Fig. 11j) consistent with low overall expression (Supplementary Fig. 11k) and lower rates of 5′ and 3′ support than probed lncRNAs, although a small number were full length (Fig. 4e, Supplementary Fig. 11b).
We next compared the performance of CLS to that of conventional, short-read CaptureSeq. We took advantage of our HiSeq analysis (212 million/156 million reads) of the same captured cDNAs to make a fair comparison between methods. Short-read methods depend on in silico transcriptome assembly; using PacBio reads as a reference, we found that the StringTie tool outperformed Cufflinks, which was used in previous CaptureSeq projects 24,41 (Supplementary  Fig. 12a). Using intron chains to compare annotations, we found that CLS identified 69%/114% more novel transcript models than StringTie assembly (Fig. 4h, Supplementary Fig. 12b). CLS transcript models were more complete at 5′ and 3′ ends than StringTie assemblies were, and they were also more complete at the 3′ end compared with probed GENCODE annotations (Fig. 4i, Supplementary  Fig. 12d-h). Thus, although StringTie transcript models are slightly longer (Fig. 4j, Supplementary Fig. 12c), they are far less likely to be full length than CLS models are. This greater length might be attributable to the production of overly long 5′ extensions by StringTie, as suggested by the relatively high CAGE signal density downstream of StringTie TSSs (Supplementary Fig. 12g-h). CLS was more sensitive in the detection of repetitive regions and identified ~20% more repetitive nucleotides in human tissues (Supplementary Fig. 12i).

Redefining lncRNA promoter and gene characteristics
With a full-length lncRNA catalog, we revisited the basic characteristics of lncRNA and protein-coding genes. LncRNA transcripts, as annotated, are substantially shorter and have fewer exons than mRNAs 5,11 . However, it has remained unresolved whether this is a genuine biological trend or simply the result of annotation incompleteness 21 . When we considered full-length transcript models from CLS, we found that the median lncRNA transcript length was 1,108/1,067 nucleotides, similar to that of mRNAs mapped according to the same criteria (1,240/1,320 nucleotides) (Fig. 5a, Supplementary Fig. 13a). This length difference of 11%/19% was statistically significant (P < 2 × 10 −16 for both human and mouse samples; two-sided Wilcoxon test). These measured lengths are still shorter than those of most annotated protein-coding transcripts (median of 1,543 nucleotides in GENCODE v20), but they are much longer than those of annotated lncRNAs (median of 668 nucleotides). There are two factors that preclude our making firm statements regarding the relative lengths of lncRNAs and mRNAs: the upper length limitation of PacBio reads (Fig. 2b), and the fact that our size-selection protocol selected against shorter transcripts. Nevertheless, we did not find evidence that lncRNAs are substantially shorter 11 . We expect that this issue will be definitively answered with future nanopore sequencing approaches.
In a previous study, we observed enrichment for two-exon genes in lncRNAs 11 . However, the results of the current study show that this was clearly an artifact arising from annotation incompleteness: the mean number of exons for lncRNAs in the full-length models was 4.27, compared with 6.69 for mRNAs (Fig. 5b, Supplementary  Fig. 13b). This difference can be explained by lncRNAs' longer exons, although they peak at approximately 150 bp, or one nucleosomal turn (Supplementary Fig. 13c).
Improvements in TSS annotation are further demonstrated by the fact that full-length transcripts' TSSs are, on average, closer to expected promoter features, including promoters and enhancers predicted by genome segmentations 42 and CpG islands, although not evolutionarily conserved elements or phenotypic genome-wide association study variants 43 (Fig. 5c). Accurate mapping of lncRNA promoters may provide new hypotheses for the mechanism by which such variants result in observed phenotypes. For example, improved 5′ annotation brings genomewide association study SNP rs246185 closer to the TSS of RP11-65J2 (ENSG00000262454). Evidence for a functional link between the two is supported by the fact that rs246185 is an expression quantitative trait locus for RP11-65J2, which is expressed in heart and muscle 44 (Supplementary Fig. 13d,e).
The improved 5′ definition provided by CLS transcript models also allowed us to compare lncRNA and mRNA promoters. Recent studies based on the start positions of gene annotations have claimed that strong differences exist between lncRNA and mRNA promoters 45,46 .
To make fair comparisons, we created an expression-matched set of mRNAs in HeLa and K562 cells, and removed bidirectional promoters. We compared these across a variety of data sets from ENCODE 12 (Supplementary Figs. 14 and 15).
We observed a series of similar and divergent features of lncRNA and mRNA promoters. For example, activating promoter histone modifications such as H3K4me3 (Fig. 5d) and H3K9ac (Fig. 5e) were essentially indistinguishable between full-length lncRNAs and protein-coding genes, which suggests that, when expression differences are accounted for, the active promoter architecture of lncRNAs is not unique. The contrast between these findings and previous reports suggests that reliance on annotations alone in prior studies led to inaccurate promoter identification 45,46 .   A r t i c l e s However, as observed previously, lncRNA promoters were distinguished by elevated levels of repressive chromatin marks such as H3K9me3 (Fig. 5f) Figs. 14 and 15). This may have been a consequence of elevated recruitment to lncRNAs of the Polycomb repressive complex, as evidenced by its subunit Ezh2 (Fig. 5g). Promoters of lncRNAs were also distinguished by a localized peak of the insulator protein CTCF (Fig. 5h). Finally, there was a clear signal of evolutionary conservation at lncRNA promoters, although it was lower than that for proteincoding genes (Fig. 5i).
Two conclusions can be drawn. First, CLS-inferred TSSs have a greater density of expected promoter features compared with probed annotations, thus demonstrating that CLS improves TSS annotation. Second, after adjustment for expression, lncRNAs have similar activating histone modifications, but distinct repressive modifications, compared with protein-coding genes.

Discovery of new potential open reading frames
A number of studies have suggested that lncRNA loci encode peptide sequences through unannotated open reading frames (ORFs) 47,48 . We searched for signals of protein-coding potential in full-length models by using two complementary methods based on evolutionary conservation and intrinsic sequence features 49,50 (Fig. 6a, Online Methods,  Supplementary Data Set 1). This analysis revealed evidence for protein-coding potential in a small fraction of lncRNA full-length transcript models (109 of 1,271, or 8.6%), although a similar number of protein-coding full-length transcripts showed no evidence of protein coding (2,900 of 42,758, or 6.8%) (Fig. 6b). CLS full-length models supported reclassification of protein-coding potential for five distinct gene loci (Fig. 6c, Supplementary Fig.  16a, Supplementary Data Set 2). A good example is the KANTR locus, where extension by CLS (supported by independent RT-PCR) identified a placental-mammal-conserved 76-amino-acid ORF with no detectable protein ortholog 51 . It is composed of two sequential transmembrane domains (Fig. 6d, Supplementary Fig.  16e) and derives from a LINE1 transposable element. Another case is LINC01138, linked to prostate cancer, for which a potential 42-aminoacid ORF was found in the extended transcript 52 . We could not find peptide evidence for translation of either ORF (Online Methods). Whole-cell expression, as well as cytoplasmic-to-nuclear distributions, also showed that the behavior of potentially protein-coding lncRNAs was consistently more similar to that of annotated lncRNAs than to that of mRNAs (Supplementary Fig. 16b-d). Hence, CLS will be useful in improving biotype annotation of the small minority of lncRNAs that may encode proteins.

DISCUSSION
We have introduced an annotation methodology that addresses the competing needs of quality and throughput. Capture long-read sequencing produces transcript models with quality approaching that of human annotators, yet with throughput similar to that of in silico transcriptome reconstruction. CLS improves upon existing assemblybased methods through not only confident exon connectivity but also (1) far higher rates of 5′ and 3′ completeness and (2) the carrying of encoded poly(A) tails.
CLS is also competitive in economic terms. Using conservative estimates with 2016 prices ($2,460 for one lane of PE125bp HiSeq, and $500 for one SMRT), and including the cost of sequencing alone,  For each probed GENCODE gene annotation, we took the score of the best ORF across all transcripts (x-axis) and the score of the best ORF in the corresponding full-length CLS transcript models (y-axis). Yellow, loci from the GENCODE v20 annotation predicted to encode proteins; red, lncRNA loci where new ORFs were discovered as a result of CLS transcript models. (d) KANTR, an example of an annotated lncRNA locus where a novel protein-coding sequence was discovered. Top, the structure of the lncRNA and the associated ORF (highlighted region) falling within the range of novel full-length CLS transcripts (red). Note how this ORF lies outside the existing annotation (green) and overlaps a highly conserved region (see the PhastCons conservation track below). A sequence obtained by RT-PCR (black) is also shown. Bottom, conservative substitutions in the predicted 76-amino-acid ORF consistent with a functional peptide, generated by CodAlignView ("URLs"). High-confidence predicted SMART 53 domains are indicated by colored bars. This ORF lies within and antisense to an L1 transposable element (gray bar).
we estimate that CLS yielded one novel, full-length lncRNA structure for every $8 spent, compared with $27 with conventional CaptureSeq. This difference is due to the greater rate of full-length transcript discovery by CLS. Despite its advantages, CLS could still be optimized in several respects. First, the capture efficiency for long cDNAs can be improved by several-fold. Second, various technical factors limit the completeness of CLS transcript models, including sequencing reads that remain shorter than many transcripts, incomplete reverse transcription of the RNA template, and degradation of RNA molecules before reverse transcription. Resolution of these issues will be an important objective of future protocol improvements, and only after it has been achieved can we make definitive judgments about lncRNA transcript properties. In recent work separate from the current study, we further optimized the capture protocol, pushing on-target rates to around 35% (Online Methods and data not shown). However, the most dramatic gains in the cost-effectiveness and completeness of CLS will come from advances in sequencing technology. The latest nanopore cDNA sequencing promises to be ~150-fold less expensive per read than PacBio technology (0.01 versus 15 cents per read, respectively).
Full-length annotations have provided the most confident view so far of lncRNA gene properties. LncRNAs are more similar to mRNAs than previously thought in terms of splice length and exon count 11 . We noted a similar trend for promoters: when lncRNA promoters were accurately mapped by CLS and compared with expressionmatched protein-coding genes, we found them to be surprisingly similar in terms of activating modifications. This suggests that previous studies that placed confidence in annotations of TSSs should be reassessed 45,46 . On the other hand, lncRNA promoters do have unique properties, including elevated levels of repressive histone modification, recruitment of Polycomb group proteins, and interaction with the insulator protein CTCF. To our knowledge, this is the first report to suggest a relationship between lncRNAs and insulator elements. Overall, these results suggest that lncRNA gene features per se are generally similar to those of mRNAs, after normalization for differences in expression. Finally, extended transcript models did not yield evidence for widespread protein-coding capacity encoded in lncRNAs.
Despite our success in mapping novel structures in annotated lncRNAs, we observed surprisingly low numbers of transcript models originating in the relatively fewer numbers of unannotated loci that we probed, including ultraconserved elements and developmental enhancers. This suggests that, at least in the tissue samples probed here, such elements do not give rise to substantial numbers of lncRNA-like, polyA+ transcripts.
In summary, by resolving a longstanding roadblock in lncRNA transcript annotation, the CLS approach promises to accelerate progress toward an eventual 'complete' mammalian transcriptome annotation. These updated lncRNA catalogs represent a valuable resource for the genomic and biomedical communities, and address fundamental issues of lncRNA biology.

METHODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.

Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
pool each were prepared to avoid PCR duplicates. Eighteen PCR cycles were performed, with an increased extension step of 3 min to allow long fragments to be fully amplified. The length of post-capture PacBio and Illumina libraries was verified by Bioanalyzer, and quantity was verified by Qubit.
PacBio sequencing of captured cDNA. Pooling. After quantification and quality control, the four post-capture libraries were pooled together by species to produce one unique human and one unique mouse pool. The 110 µl of each sample were again quantified by Qubit dsDNA BR assay (Thermo Fisher), with 12.3 µg for human and 9.57 µg for mouse.
Size selection. Samples were subsequently size-selected with E-gel (Invitrogen) into three different ranges: 1,000-1,500 bp, 1,500-2,500 bp and >2,500 bp. We collected two shorter fractions of 200-500 bp and 500-1,000 bp, but after reviewing the preliminary sequencing data we decided not to scale them up because of the large number of reads in this size range obtained in the larger fractions. After size selection, each size fraction was dried and resuspended with 20 µl of water and quantified by Qubit dsDNA BR assay (Thermo Fisher). These samples were then amplified again by PCR (four cycles) with Kapa HiFi HotStart (Kapa Biosystems) to reach the required amount for PacBio library preparation. The quality and length of obtained libraries were verified with Bioanalyzer and Qubit.
We checked the efficiency of size selection via analysis of spike-in sequences (Supplementary Fig. 1d). For each size-selected captured library, and for pre-capture libraries, we calculated the sequencing efficiency as a function of spike-in sequence length. Sequencing efficiency was defined for each spikein sequence as follows: (number of reads)/(molar concentration × sequence length × total read count). This showed that, as expected, size selection boosted the sequencing of longer templates.
PacBio library preparation. Approximately 2 µg of each of the size-fractionated and amplified DNAs was used for each of the human and mouse pools, for a total of 6 (3 × 2) distinct samples. Sizes and concentrations were verified by Bioanalyzer. PacBio libraries were constructed for each sample with kit #100-250-100 (Pacific Biosciences) as per the manufacturer's protocol. Briefly, this involved polishing the PCR amplicon ends to 'blunt' them, ligating the SMRTbell adaptors, removing linear (nonligated) fragments of DNA, and carrying out AMPure bead purification followed by Bioanalyzer analysis to assess the size distribution and Qubit quantifications.
PacBio sequencing and collection of post-capture data. We ran each of the PacBio libraries on an initial SMRT cell to assess their respective performance and optimal sequencing concentration. Those that performed well were then scaled up to an additional 20 SMRT cells for deep data collection. The PacBio reagents and metrics used for each sample are listed in Supplementary Table 8. The sequencing was performed on a PacBio RSII instrument. Upon completion of the sequencing, SMRT cells from a given library were aggregated on SMRT Portal, and the PacBio post-processing method "RS_ReadsOfInsert.1" was run on each aggregated sample to generate ROIs for downstream processing. This yielded a single FASTQ file per library.
HiSeq sequencing of captured cDNA. Post-capture Illumina cDNA libraries were sequenced on a HiSeq 2500 machine (2 × 125 nt, v4, high-output mode). One sequencing lane was generated per species at a depth of ~212 million (human) or ~156 million (mouse) pairs of reads. Read pairs were demultiplexed with Illumina software. Note that these libraries were unstranded and Covaris-fragmented before capture.
To demultiplex samples (i.e., to determine the tissue of origin of each ROI), for each adaptor we selected its middle 26 nt. Each of the 26-mers derived from the indexed adaptors contained the hexamer barcode in the center. We used the GEM mapper 55 to demultiplex samples. PacBio reads were compiled into a FASTA file (one file per species) and indexed by GEM. Mapping the middle 26-mer of indexed adaptors to the PacBio read allowed us to assign it to its tissue of origin. The additional presence of the universal adaptor within ROIs was used to confirm the completeness of the insert. The GEM-based demultiplexing procedure allowed up to three mismatches (-m 0.1) and three indels (-e 0.1) for accurate identification of the barcodes. The following non-default GEM parameters were used during the mapping step: -T 3 --max-big-indel-length 0 -s 3 -D 4. We filtered out 'chimeric' ROIs (that is, reads arising from the concatenation of inserts during adaptor ligation) by removing those reads that contained more than one indexed or more than one universal TruSeq Illumina adaptor sequence.
Overall, we were able to demultiplex 1,627,322 and 1,509,374 ROIs in human and mouse samples, respectively (Fig. 2a, Supplementary Fig. 2b). As shown in Supplementary Figure 2d, only a minute fraction of human ROIs were assigned a mouse barcode (and vice versa), which highlights the high specificity of the demultiplexing procedure.
For analysis of MiSeq (pre-capture cDNA) and HiSeq (post-capture) data, FASTQ files were aligned to the human and mouse genomes (plus the sequences of 96 ERCC spike-in controls) with STAR 56 (v.2.4.0.1) compiled for short reads. The reference annotations described above were used to guide the mapper. To maximize the mapping rate, we aligned the mates of each pair of reads separately. The following non-default STAR parameters were specified: --outFilterMismatchNoverLmax 0.04 --alignIntron-Min 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMunmapped Within --runThreadN 6.
Analysis of CLS performance and on-target enrichment. RNA-capture on-target enrichment. We evaluated the overall RNA-capture performance by calculating an on-target rate in both MiSeq pre-capture and PacBio postcapture libraries. The on-target rate was defined as the ratio of the number of distinct ROIs mapping to targeted genomic regions (excluding ERCC RNA spike-in controls) to the total number of mapped ROIs. The number of reads overlapping targeted regions was calculated directly from the STAR BAM file with bedtools intersect 57 . Overlap was defined as ≥1 bp of intersection between the sequencing read and the exonic span of a feature on the same strand. The overall on-target fold enrichment was computed as the on-target rate in the post-capture library divided by the on-target rate in the pre-capture library.
We calculated enrichment separately by referencing two distinct sequencing data sets of post-capture cDNA: (a) the main PacBio reads, and (b) Illumina MiSeq of the same material. Figure 2d shows data for enrichments calculated with the latter data set: MiSeq post-capture versus MiSeq pre-capture. Equivalent enrichments for the former comparison (PacBio post-capture versus MiSeq pre-capture) were 16.6-fold/11.1-fold for human/mouse.
We compared CLS enrichments to values from a previous capture shortread sequencing (CSS) study 24 . We focused our analysis on the CSS tissues that were also assayed in CLS (human brain, heart, liver and testis), and computed on-target rates on lincRNAs more than 5 kb away from any protein-coding gene in both studies, based on GENCODE v20 and v19 for CLS and CSS, respectively. CSS pre-capture rates were estimated from pre-capture MiSeq libraries generated in the present work, and remapped to hg19/GENCODE 19. Across the four tissues studied, CLS outperformed CSS in terms of both on-target enrichment (in all samples) and post-capture on-target rate (in brain and testis only) (Supplementary Fig. 2f,g).