Tools to analyze RNA-seq data

Tools to analyze RNA-seq data

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I hope this is a good place to ask such question. I have to do some data analysis on RNA-seq data from human cells. I am currently searching for tools to help me with that. Specifically, I would need some tools to analyze the gene expression from the data. Something to help me plot the expression of selected genes in each fastq file and compare the differences in the expression with the possibility to export the results or some command line interface for scripting. Basically I need something where I can put a fastq file and perhaps also a human genome annotation file as input and get gene expression as output. I have looked at bioconductor and it's packages and on Wikipedia's List of RNA-Seq bioinformatics tools. I suppose some of these tools have to be able to do what I need, but I have been unable to find out which one and how should they be used to achieve that. Could someone please give me some advice?

You will likely need a tool to "map" the reads on the reference genome. You may find such a reference genome, together with annotations, here:

Mapping tools such as bowtie2 or bwa take fastq files and reference genomes and output mapping results in a format called sam.

You then have a lot of options to estimate gene expression.

  • You can write your own algorithm to parse sam format and estimate normalized read counts on each gene.

  • You can combine more or less low-level tools such as samtools, pysam, htseq with some scripting to do this.

  • You can use tools that do the counting (like bedtools ot htseq-count) and differential expression analysis (like deseq2).

In the last case, I would advice to start from the documentation of the final tool to find out what are the tools you need to generate the output of the preceding step.

It is very likely you will use some R or Python, or use the web platform galaxy for some of the steps.


As mentioned by @scribaniwannabe in this answer, the paper about the Tuxedo suite of tools gives a good example of the steps to carry out an RNA-seq analysis using recent tools (as of October 2016).

As @Student T reminds in this answer, RNA-seq data contain reads that can come from exon-exon junctions, so the read mapper has to be set up in such a way as not to discard reads not mapping continuously on all their length on the genome. To my knowledge, HISAT2 and CRAC do this by default. Bowtie2 needs special settings.

While I also agree @bli that R and Python (in particularBioconductor) have more than enough packages for you to compare gene expression. You shouldn't align your reads with bwa or bowtie because they don't take introns into consideration. You should useTopHatorSTAR.

The answer @bli gave is great. I thought I would point out that Johns Hopkins also recently upgraded their tuxedo suite. Looks promising and has great instructions for use.

Also, I've begun to grow quite fond of the GeneTrail 2 tool for my RNA-Seq secondary analysis. Gives great results for enrichment analyses.

Hope this is helpful.

I think that STAR is the preferred splice-aware aligner nowadays. STAR can output counts by gene or by transcript. Assuming you have Illumina data, you can try using the tools on Illumina's BaseSpace. RNASeq might be one of the things that you can do for free there.

I think HTSeq does almost that. It outputs a matrix of read counts per gene given a fastq sample and annotation file

PCAGO: An interactive tool to analyze RNA-Seq data with principal component analysis

The initial characterization and clustering of biological samples is a critical step in the analysis of any transcriptomics study. In many studies, principal component analysis (PCA) is the clustering algorithm of choice to predict the relationship of samples or cells based solely on differential gene expression. In addition to the pure quality evaluation of the data, a PCA can also provide initial insights into the biological background of an experiment and help researchers to interpret the data and design the subsequent computational steps accordingly. However, to avoid misleading clusterings and interpretations, an appropriate selection of the underlying gene sets to build the PCA and the choice of the most fitting principal components for the visualization are crucial parts. Here, we present PCAGO, an easy-to-use and interactive tool to analyze gene quantification data derived from RNA sequencing experiments with PCA. The tool includes features such as read-count normalization, filtering of read counts by gene annotation, and various visualization options. In addition, PCAGO helps to select appropriate parameters such as the number of genes and principal components to create meaningful visualizations.

Availability and implementation PCAGO is implemented in R and freely available at The tool can be executed as a web service or locally using a Docker image.



The development of high-throughput next-generation sequencing (NGS) technologies has revolutionized the transcriptomics field, paving the way for large-scale RNA sequencing (RNA-Seq) 1 . RNA-Seq can not only be used to study genome-wide transcription but also, it offers the ability to discover new genes and transcripts 2 or to identify additional elements, such as new non-coding RNAs, small interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs) and micro-RNAs (miRNA). Recently, a new class of RNAs has been described, called circRNAs 3 , that are characterized by their ability to form circular RNA through a covalent linkage at the ends of a single RNA molecule. These circRNAs seem to participate in the regulation of gene expression, acting as regulators of miRNAs by specific binding to them. The appearance of these new regulatory molecules has led to the development of new tools for the identification of circRNAs, also through RNA-Seq experiments 4 .

There are two important aspects of RNA-Seq experiments, the vast amount of data generated in this kind of study, and the ability to extract and interpret biologically relevant information. These issues are particularly relevant since transcriptomics data analysis can easily become an important experimental bottleneck, especially given the additional constraints that both RNA-Seq and miRNA-Seq analyses impose. Indeed, the combination of different statistical and bioinformatics tools with many customizable parameters often makes such analysis difficult for non-experienced researchers. In addition, the use of different tools may involve time-consuming installations, usually requiring human intervention to proceed to the next step. To alleviate this problem, several tools have been generated for gene expression analysis, like ExpressionPlot 5 , GENE-counter 6 , RobiNA 7 , TCW 8 , Grape RNA-Seq 9 or MAP-RSeq 10 . In addition, another set of tools focuses on the analysis of miRNA expression profiles, such as DSAP 11 , miRanalyzer 12 , miRExpress 13 , miRNAkey 14 , iMir 15 , CAP-miRSeq 16 , mirTools 2.0 17 or sRNAtoolbox 18 . Moreover, a few tools have been implemented to perform both RNA-Seq and miRNA-Seq analysis, such as wapRNA 19 , eRNA 20 , BioVLAB-MMIA-NGS 21 or Omics Pipe 22 . Other available methods integrating several software enabling different type of NGS analyses are GALAXY (, QuasR 23 , RAP 24 , Subread/edgeR 25 , while others provide a collection of modules to process files, like the ViennaNGS 26 suite.

Although extremely valuable, the main disadvantage of these tools is that, with some exceptions, they often still rely on manual installation procedures and further human input, steps that have proven difficult to automate. There are also other issues that hamper their wider diffusion and implementation: i) some of the tools have been designed to work on web-based platforms with the consequent restriction on data upload or limited offer of parameter’s choice (i.e Galaxy, RAP 24 , BioVLAB-MMIA-NGS 21 , or DSAP 11 ) ii) the analysis pipelines implemented have rigid workflows, so users cannot start the analyses at different steps of the pipeline (i.e. RAP 24 , BioVLAB-MMIA-NGS 21 ) iii) some of these tools have a large list of pre-requisites for local installation that complicates their use by less experienced researchers (i.e.: Cap-miRSEq 16 , Omics Pipe 22 , iMir 15 , Galaxy, ExpressionPlot 5 ) iv) the analysis is usually restricted to a few selected model organisms (i.e. QuasR 23 , ExpressionPlot 5 , BioVLAB-MMIA-NGS 21 ), and iiv) some tools uses in-house code which has not been extensively tested in the NGS community (i.e. Grape RNA-Seq 9 or ExpressionPlot 5 ). In addition, to our knowledge, none of these tools has implemented a pipeline for the analysis of circRNAs.

With these limitations in mind, we have developed a comprehensive pipeline analysis suite called “miARma-Seq”, which stands for miRNA-Seq And RNA-Seq Multiprocess Analysis, that is designed to identify mRNAs, miRNAs and circRNAs, as well as for differential expression, target prediction and functional analysis. Most importantly, it can be applied to any sequenced organism, and it can be initiated at any step of the workflow.

How to Analyze Gene Expression with ROSALIND


The study of gene expression provides valuable insights into the nature of diseases and the effect of treatments by quantifying the activity of RNA in a biological sample. RNA-seq is a fast-growing Next Generation Sequencing (NGS) assay for evaluating gene expression, alternative splicing transcripts and fusions.

Scientists working in Oncology, Immunology, Regenerative Medicine, Drug Discovery and other areas of research often conduct experiments between healthy and disease states to identify Differentially expressed genes and biological pathways to discover therapeutic targets. Comparisons between these differential patterns reveal unique gene signatures valuable for drug and diagnostic development.


ROSALIND is a cloud platform that connects researchers to experiment design to quality control, differential expression and pathway exploration in a real-time collaborative environment.

Scientists of every skill level benefit from ROSALIND since no programming or bioinformatics are required. By accepting raw FASTQ sequence data as well as processed counts data, ROSALIND enables powerful downstream analysis and truly insightful visualizations on gene expression datasets. Receive same-day results with every experiment in an interactive experience designed for ease of use and saving valuable time.


ROSALIND enables scientists and researchers to analyze and interpret differential gene expression without the need for bioinformatics or programming skills. All that is required is basic background in biology and a current subscription or active trial.

Biological questions can also be explored independently, or in conjunction with, uploaded experiment data as ROSALIND automates the import of public data from the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) and Gene Expression Omnibus (GEO).


“ I can now design and access my sequencing analysis within hours, and I have much more confidence in my results.”


ROSALIND simplifies data analysis and works like a data hub interconnecting every stage of data interpretation. The ROSALIND Gene Expression discovery experience enables visual exploration and self-investigation of experiment results to give researchers the freedom to adjust cut-offs, add comparisons, apply covariate corrections, and even find patterns across multiple datasets, without the need for bioinformatic expertise. There are five easy steps to performing RNA-seq data analysis on ROSALIND.


Starting an RNA-seq data analysis begins with creating a new experiment and capturing the experiment design. ROSALIND walks through the key aspects of an experiment in a guided experience to record biological objectives, sample attributes and analysis parameters. These details become the basis of the experiment discovery dashboard. Researchers who publish papers and work with NCBI public data know the importance of natively supporting NCBI data models. ROSALIND fully supports the NCBI BioProject and BioSample models for metadata assignment and sample attribute descriptions. ROSALIND also enables scientists to create custom attributes to describe biological behaviors in terms relevant to the experiment. Setup of comparisons is simplified by describing and annotating samples using these familiar terms. This methodology minimizes the risk of differential expression errors when selecting samples for comparison.

For RNA-seq data analysis, ROSALIND provides scientists with a choice: a) Begin with raw FASTQ files produced by high throughput sequencing, or b) Use processed data files generated by another analysis pipeline. Processed data is imported as normalized or raw counts. This provides flexibility for scientists to utilize the ROSALIND discovery experience to visualize and interpret data regardless of the data source. When analyzing raw FASTQ files, ROSALIND streamlines data analysis using an advanced pipeline for analysis that includes intelligent quality control with automatic contamination detection, identification of Differentially expressed genes and deep pathway interpretation. Visit the technical specifications section to learn more about the ROSALIND RNA-seq data analysis pipeline and available reference materials.

For proper RNA-seq results, an analysis pipeline must adjust for sample preparation and proprietary differences in library preparation kits used in the experiment. Not only is the kit selection important for targeting and capturing the desired transcriptomic elements, the analysis pipeline adjusts and optimizes for the kit’s unique characteristics, such as strandedness, strand direction, any unique molecular identifiers (UMIs) as well as the adapters used. ROSALIND integrates and supports a broad library of sample and library preparation kits, automatically calibrating each analysis with the appropriate details. To learn more about supported kits, visit the technical specifications section. Featured kits and instrument partners are also listed below.


Researchers must be confident in the quality control phase before gathering insights from an RNA-seq experiment, otherwise the results of the analysis should not be trusted. Biology’s mysteries are elusive and complex. Time should not be lost chasing corrective measures for outliers, contamination, swapped samples and the many other errors that can occur in the course of a well-designed experiment.

Some of the most important Quality Control metrics to verify are Q30 scores, alignment rates, ribosomal content, duplicate rates, sample correlation, gene coverage, genomic regions and multidimensional scaling (MDS) or principal component analysis (PCA) for all samples. When ROSALIND detects low alignment, non-aligning reads are evaluated for possible contamination. If ribosomal content is higher than expected, ROSALIND generates alerts. With Illumina sequencers, the results are usually good when Q30 values are over 85% and alignment rates are over 80% for the target species. Additionally, duplication rates less than 25% with fewer than 10% of reads trimmed is preferred. Researchers can eliminate offending samples and the deleterious effects on results by identifying the sample as an outlier and move confidently into the discovery and exploration phase of results interpretation.

ROSALIND Quality Control Intelligence identifies potential data quality issues and triages the data before presenting the results. This eliminates the needs for researchers to be experts in Sequencing quality control issues. Learn how researchers gain confidence in their results through Quality Control Intelligence.


After a researcher has reviewed the quality control phase the interactive presentation of results is ready to begin. The next step is to unlock the experiment. ROSALIND calculates the quantity of Analysis Units (“AU”) required to unlock the results. This is generally 1 AU per single-sample FASTQ file for RNA-seq experiments, however this may differ based on counts files or other experiment parameters. Account balances and quick links for acquiring more AU are directly accessible from the unlock screen. To learn more about Analysis Units, check out the Q&A in the section below, or visit the ROSALIND Store.


A typical RNA-seq analysis provides a list of Differentially expressed genes, generally in the form of a massive and obtuse CSV file. Unfortunately, this often results in more questions than answers for scientists. Multiple applications may also need to be used to generate this CSV file. Such applications often have a wide range of complexity with non-standard input/output formats, many of which are command-line tools requiring advanced knowledge in programming — an exercise well beyond the level of most biologists.

ROSALIND moves beyond the CSV file by providing a comprehensive dashboard for differential expression analysis and interpretation of RNA-seq data. Researchers begin with a list of significant Differentially expressed genes determined by a calculated cut-off filter. Default settings for the filter begin with a fold change of 1.5 upregulated and 1.5 down regulated with a p-Adjust of 0.05. Further adjustments to achieve a significant set of genes are performed by ROSALIND, if needed. Researchers may also create an unlimited set of their own customized filters using fold changes and P value parameters. Convenient on-screen controls are easily accessible for modifying filters, adding covariant corrections, applying gene lists and signatures, and adjusting plot color palettes. The ROSALIND gene expression discovery experience features deep interpretation of top pathways, gene ontology diseases and drug interactions, as rich interactive plots that fill the screen and respond to interactions from the scientist, showing customizable heatmaps, volcano and MA plots as well as box and bar plots.

New comparisons and meta-analysis may be added at any time. Comparisons are created using BioProject attributes. Meta-analyses created can be cross experiments and multi-omic. Each of these perspectives are available within minutes of setup, reducing internal bioinformatic workload and enabling scientists to react fluidly by focusing directly on the science of the experiment.


The discovery process rarely ends with a single point of view from a single researcher opinion. ROSALIND Spaces enables true scientist-to-scientist collaboration through virtual data rooms where scientists and collaborators can come together on related datasets anywhere in the world to interactively explore shared experiments much like working with Google Docs. Researchers access a consistent version of the data, without the need to transfer unwieldy files or reinterpret origin files. All changes are interactive, instantly available, and viewable everywhere in the world (as authorized by the organization) with real-time activity feeds and historical reports. Spaces participants can add experiments, explore pathways, change cut-offs, add meta-analyses and add new comparisons all within the shared collaborative environment.

Spaces are virtual meeting rooms where scientists meet with niche experts, clients and supporting teams to maximize the discovery value of every experiment and prepare for the next one.

A Beginner’s Guide to Analysis of RNA Sequencing Data

Since the first publications coining the term RNA-seq (RNA sequencing) appeared in 2008, the number of publications containing RNA-seq data has grown exponentially, hitting an all-time high of 2,808 publications in 2016 (PubMed). With this wealth of RNA-seq data being generated, it is a challenge to extract maximal meaning from these datasets, and without the appropriate skills and background, there is risk of misinterpretation of these data. However, a general understanding of the principles underlying each step of RNA-seq data analysis allows investigators without a background in programming and bioinformatics to critically analyze their own datasets as well as published data. Our goals in the present review are to break down the steps of a typical RNA-seq analysis and to highlight the pitfalls and checkpoints along the way that are vital for bench scientists and biomedical researchers performing experiments that use RNA-seq.

RNA sequencing (RNA-seq) was first introduced in 2008 (1–4) and over the past decade has become more widely used owing to the decreasing costs and the popularization of shared-resource sequencing cores at many research institutions. The increased popularity of RNA-seq has led to a fast-growing need for bioinformatics expertise and computational resources. In order for bench scientists to correctly analyze and process large datasets, they will need to understand the bioinformatics principles and limitations that come with the complex process of RNA-seq analysis. Although RNA-seq analysis can be incredibly powerful and can uncover many exciting new findings, it differs from the usual analyses bench scientists are used to in that it comes as a very large dataset that cannot be interpreted without extensive analysis.

The protocol of RNA-seq starts with the conversion of RNA, either total, enriched for mRNA, or depleted of rRNA, into cDNA. After fragmentation, adapter ligation, and index ligation, each cDNA fragment is subsequently sequenced or “read” using a high-throughput platform. Raw read data then are demultiplexed, aligned, and mapped to genes to generate a raw counts table, at which point the data often are handed over to the bench researcher to start his or her own analysis. No true consensus exists yet on the most appropriate pipeline for RNA-seq data processing however, there are numerous online semiautomated tools available, such as BaseSpace (Illumina), MetaCore (Thomson Reuters), or Bluebee (Lexogen). Although these tools generate principal component analysis (PCA) plots, display heat maps, and run differential gene expression analysis without the assistance of a bioinformatician, they do not allow users to fully assess the quality of their data, determine the accuracy of their own analysis, and tailor the analysis to their biological question, which can lead to misinterpretation of the dataset. It is important for investigators to understand how to approach their dataset, to appreciate the characteristics of their dataset, and to watch for weaknesses in the data that may limit the ability to draw conclusions. In addition, it is imperative that each dataset be analyzed de novo, in the sense that thresholds and methods must be adapted anew, which cannot be achieved by using generic online apps or tools.

For the purposes of this methods paper, we used an example dataset from an experiment within our research group in which naive murine alveolar macrophages were compared with those isolated from transplanted lungs 2 and 24 hours postreperfusion. We present our analysis using this dataset to describe a user-friendly approach to RNA-seq analysis for a bench scientist.

Male Cx3cr1 gfp/+ mice on a C57BL/6 background and wild-type BALB/c mice aged 12–14 weeks were used. All mice were housed in a specific pathogen–free facility. All reagents were certified endotoxin free by the manufacturer. All studies were conducted in compliance with guidelines of the Northwestern University Animal Care and Use Committee.

Transplants were performed between allogeneic mismatched donor–recipient pairs as described previously (5). Specifically, donor lungs from Cx3cr1 gfp/+ mice were used as allografts and implanted into wild-type BALB/c recipients. In brief, donor mice were heparinized and flushed antegrade through the pulmonary artery, the trachea were ligated after lungs were recruited, and then the heart-lung block was harvested and kept at 4°C for a 2-hour period of cold ischemia. Anastomoses for the single left lung transplant were completed using the cuffed technique through a left thoracotomy the lung was reperfused and re-recruited and then the thoracotomy was closed in layers. Mice were weaned from the ventilator and extubated during recovery once they were ambulatory. At specified time points after reperfusion, recipient mice were killed, and the lung allograft was harvested.

Lungs were processed for single-cell suspensions as described previously (5). Briefly, the right ventricle was flushed with 10 ml of ice-cold Hanks’ balanced salt solution, then the lungs were infiltrated with a tissue digestion mixture containing collagenase D (Roche) and DNase I (Roche). A combination of mechanical dissociation using the GentleMACS (Miltenyi Biotec) and enzymatic digestion at 37°C for 30 minutes was performed. Samples were then enriched using CD45 microbeads (Miltenyi Biotec) and AutoMACS system (Miltenyi Biotec) before antibody staining.

See Table E1 in the data supplement for antibodies and dilutions used for staining of single-cell suspension and Figure E1 for the gating strategy for sorting of alveolar macrophages. Cells were sorted into magnetic-activated cell sorting buffer at 4°C using a BD FACSAria II SORP four-laser flow cytometer (BD Biosciences).

Freshly sorted cells were pelleted immediately, resuspended in 100 μl of PicoPure Extraction Buffer (Thermo Fisher Scientific), and then stored at −80°C. RNA isolation was performed using the PicoPure RNA isolation kit (Thermo Fisher Scientific), and samples with high-quality RNA (RNA integrity number, >7.0) as measured using the 4200 TapeStation (Agilent Technologies) were used for library preparation. The mRNA was obtained from total RNA using NEBNext Poly(A) mRNA magnetic isolation kits (New England BioLabs), and cDNA libraries were subsequently prepared using the NEBNext Ultra DNA Library Prep Kit for Illumina (New England BioLabs). Libraries were sequenced on a NextSeq 500 platform using a 75-cycle single-end high-output sequencing kit (Illumina). Sequencing yielded libraries with an average size of 8 million reads after alignment. RNA-seq analysis was based on uniquely aligned reads.

Reads were demultiplexed (bcl2fastq), and fastq files were aligned to the mm10 mouse genome (TopHat2 [6]) and mapped to genes (HTSeq [7]) using the Ensembl gene annotation. Pairwise comparisons between the various conditions were run using a negative binomial generalized log-linear model through the glmLRT fit function in edgeR (8, 9).

The RNA-seq data reported in this article has been deposited in NCBI’s Gene Expression Omnibus (GEO) and are accessible through GEO Series accession number GSE116583.

A major goal of RNA-seq analysis is to identify differentially expressed and coregulated genes and to infer biological meaning for further studies. Source material can be cells cultured in vitro, whole-tissue homogenates, or sorted cells. The ability to interpret findings depends on appropriate experimental design, implementation of controls, and correct analysis. Every effort should be made to minimize batch effect, because small and uncontrolled changes in an environment can result in identification of differentially expressed genes (DEGs) unrelated to the designed experiment. Sources of batch effect can occur during the experiment, during the RNA library preparation, or during the sequencing run and include but are not limited to those listed in Table 1. Once a well-designed and controlled experiment is performed, a structured approach to the dataset allows for quality control followed by unbiased analysis of the data. In the present analysis, we use an approach that includes setting low count filtering, establishing a noise threshold, checking for potential outliers, running appropriate statistical tests to identify DEGs, clustering of genes by expression pattern, and testing for gene ontology (GO) enrichment. For each of these analysis components, we aim to highlight important checkpoints and quality controls that will streamline and strengthen data analysis, avoid bias, and allow investigators to maximally use their datasets.

Table 1. Sources of Batch Effect and Proposed Strategies to Mitigate Them

For this tutorial, we use a dataset comprising three groups of alveolar macrophages that were studied in a murine model of lung transplantation during the first 24 hours of reperfusion. This approach (of which we make no claims of originality and refer the reader to an excellent review by Conesa and colleagues [10] outlining the major steps of RNA-seq data analysis) allows the investigator to probe the data in an unbiased manner in an effort to identify transcriptional signatures and to enable further downstream analyses.

When assessing variability within the dataset, it is preferable that the intergroup variability, representing differences between experimental conditions in comparison with control conditions, is greater than the intragroup variability, representing technical or biological variability. A global overview of the data allows for the characterization of variation between replicates and whether investigator-defined experimental groups show actual differences between groups (a group being a set of replicates from the same condition or of the same cell type). One way to visualize the variation in a dataset is through PCA (11). PCA takes a large dataset as input and reduces the number of gene “dimensions” to a minimal set of linearly transformed dimensions reflecting the total variation of the dataset. The results are commonly presented as a two-dimensional plot in which data are visualized along axes that describe the variation within the dataset, known as the principal components (PCs). PC1 describes the most variation within the data, PC2 the second most, and so forth. The variation represented by each PC can be calculated as a percentage of the total variance and visualized by a scree plot. If the first two PCs do not capture the majority of the variance, it may be helpful to generate additional two-dimensional PCA plots displaying other PCs. In this way, a PCA plot may help to visualize grouping among replicates and aid in identifying technical or biological outliers.

Another approach to determining inter- and intragroup variability is to calculate distance as represented by correlation between samples. Two commonly used measures of correlation are the Pearson’s coefficient and the Spearman’s rank correlation coefficient (12–14), which describe the directionality and strength of the relationship between two variables. The Pearson’s correlation reflects the linear relationship between two variables accounting for differences in their mean and SD, whereas the Spearman’s rank correlation is a nonparametric measure using the rank values of the two variables. The more similar the expression profiles for all transcripts are between two samples, the higher the correlation coefficient will be. These correlation coefficients are calculated between all samples and can be visualized as either a table or a heat map, allowing the investigator to assess whether replicates (technical or biological) group together. In addition to allowing an assessment of variability, both PCA and sample correlation analysis can help to identify outliers that were not excluded during upstream steps such as alignment. For instance, a sample that aligned well and demonstrated good read depth might make it to this step of the pipeline however, a PCA or correlation analysis may identify this library as a mislabeled or contaminated sample, clustering the outlier within another group. It is also possible that a correctly labeled sample will fall out as a biological outlier, such as if it was collected from an animal that was believed to have received a challenge but did not show symptoms. In summary, these analyses provide a global overview of all samples, allow for determination of outliers, and present data in an easy-to-digest format to the investigator and reader.

Using our alveolar macrophage dataset, we show a PCA plot and a heat map of Pearson’s correlation across alveolar macrophage samples: naive, transplant 2 hours postreperfusion, and transplant 24 hours postreperfusion sample groups ( Figure 1A ). Both the PCA plot and the Pearson’s correlation heat map were generated using normalized reads per kilobases of transcript per 1 million mapped reads (RPKM) counts (see N ormalized C ounts box). The PCA demonstrated expected grouping among replicates within samples and sample groups spread across the two PCs. PC1 accounts for 68.1% of the variance, and PC2 accounts for an additional 20.3%. The scree plot (Figure E2) confirmed that the majority of the variance within the dataset was described by the first two PCs. Although the PCA plot emphasizes intergroup variability, the Pearson’s correlation analysis ( Figure 1B ) provides an overview of all the variation between samples showing a correlation value of r > 0.9 (Table 2), consistent with each group belonging to the same cell type.

Figure 1. Assessing inter- and intragroup variability. (A) Principal component (PC) analysis plot displaying all 12 samples along PC1 and PC2, which describe 68.1% and 20.3% of the variability, respectively, within the expression data set. PC analysis was applied to normalized (reads per kilobases of transcript per 1 million mapped reads) and log-transformed count data. (B) Pearson’s correlation plot visualizing the correlation (r) values between samples. Scale bar represents the range of the correlation coefficients (r) displayed.

Integrate input files into AskOmics

AskOmics conversion into RDF is called integration.

On the Files page (link at the top of the page), you will see the files you uploaded from Galaxy . We will now integrate all these files.

Hands_on Hands-on: Integrate data

  1. Got to the Files page
  2. Select all the input files
  3. Click on the Integrate button

You will land on the Integrate page that shows a preview of the data present in each selected file, depending of its data type.

Target Audience

Graduates, postgraduates, and PIs working or about to embark on an analysis of RNA-seq data. Attendees may be familiar with some aspect of RNA-seq analysis (e.g. gene expression analysis) or have no direct experience.


Basic familiarity with Linux environment and S, R, or Matlab.

You will also require your own laptop computer. Minimum requirements: 1024x768 screen resolution, 1.5GHz CPU, 2GB RAM, 10GB free disk space, recent versions of Windows, Mac OS X or Linux (Most computers purchased in the past 3-4 years likely meet these requirements). If you do not have access to your own computer, please contact [email protected] for other possible options.

This workshop requires participants to complete pre-workshop tasks and readings.

Tools to analyze RNA-seq data - Biology

A database of software tools for the analysis of single-cell RNA-seq data. To make it into the database software must be available for download and public use somewhere (CRAN, Bioconductor, PyPI, Conda, GitHub, Bitbucket, a private website etc). To view the database head to

This database is designed to be an overview of the currently available scRNA-seq analysis software, it is unlikely to be 100% complete or accurate but will be updated as new software becomes available.

We welcome contributions from the scRNA-seq community! If you would like to contribute please follow the have a look at the wiki or fill in the submission form on our website ( Please be aware that by contributing you are agreeing to abide by the code of conduct.

If you are interested in joining the scRNA-tools team please contact us.

If you find the scRNA-tools database useful for your work please cite our publication:

Reference-based RNA-seq data analysis (Galaxy)

Galaxy is an open source, web-based platform for data intensive biomedical research. This tutorial is modified from Reference-based RNA-seq data analysis tutorial on github. In this tutorial, we will use Galaxy to analyze RNA sequencing data using a reference genome and to identify exons that are regulated by Drosophila melanogaster gene. To achieve that objectives, we will go through:

The original data we use is available at NCBI Gene Expression Omnibus(GEO) under accession number GSE18508
To conduct a Differential expression analysis, we will look at 7 first samples:

  • 3 treated samples with Drosophila melanogaster gene depletion: GSM461179, GSM461180, GSM461181
  • 4 untreated samples: GSM461176, GSM461177, GSM461178, GSM461182

Each sample constitutes a separate biological replicate of the corresponding condition (treated or untreated). Moreover, two of the treated and two of the untreated samples are from a paired-end sequencing assay, while the remaining samples are from a single-end sequencing experiment.

We have extracted sequences from the Sequence Read Archive (SRA) files to build FASTQ files. All files are available on Zenodo First we need create a new history for this RNA-seq exercise. Detailed instruction is shown below:

  1. Click “History Option " icon on the top of History section.
  2. Hit “create new”. A new history will be created. You may rename the name by directly editing it.

Then we need to import a FASTQ pair (e.g. GSM461177_untreat_paired_chr4_R1.fastq and GSM461177_untreat_paired_chr4_R2.fastq ) from Zenodo, and convert file format to fastqsanger. Detailed instruction is shown below:

  1. Copy the link location
  2. Open the Galaxy Upload Manager
  3. Select “Paste/Fetch Data”
  4. Paste the link into the text field
  5. Press Start (Note that Galaxy takes the link as name. It also do not link the dataset to a database or a reference genome as default)
  6. Click on the pencil button displayed in your dataset in the history
  7. Rename the datasets according to the samples
  8. Press Save
  9. Choose Datatype on the top
  10. Select fastqsanger
  11. Press Save

Both files contain the reads that belong to chromosome 4 of a paired-end sample. The sequences are raw sequences from the sequencing machine, without any pretreatments. They need to be controlled for their quality.

For quality control, we use FastQC and Trim Galore. We first run Fastqc on both FastQ files to check quality of reads.

  1. Select the paired ended dataset (e.g GSM461177_untreat_paired_chr4_R1.fastq and GSM461177_untreat_paired_chr4_R2.fastq )
  2. Find and open FastQC from Tools bar
  3. Press Execute

Then treat for the quality of sequences by running Trim Galore on the paired-end datasets.

  1. Find and open FastQC from Tools
  2. Choose “Pair-end”
  3. Use default value for other parameters
  4. Select the paired ended dataset
  5. Press “Execute”

Finally, we may re-run FastQC on Trim Galore’s outputs and inspect the differences.

To make sense of the reads, their positions within Drosophila melanogaster genome must be determined. This process is known as aligning or ‘mapping’ the reads to the reference genome. Here, we will use HISAT2 , a successor to TopHat2 that is faster with low memory requirements.To run efficiently the mapping, HISAT2 needs to know on important parameters about the sequencing library: the library type. This information should usually come with your FASTQ files, ask your sequencing facility! If not, try to find them on the site where you downloaded the data or in the corresponding publication. Another option is to estimate these parameters with a preliminary mapping of a downsampled file and some analysis programs. Afterward, the actual mapping can be redone on the original files with the optimized parameters.

We first need to run a preliminary mapping, we will estimate the library type to run HISAT2 efficiently afterwards. This step is not necessary if you don’t need to estimate the library type of your data. The library type corresponds to a protocol used to generate the data: which strand the RNA fragment is synthesized from.

In the previous illustration, you could see that for example dUTP method is to only sequence the strand from the first strand synthesis (the original RNA strand is degradated due to the dUTP incorporated).

If you do not know the library type, you can find it by yourself by mapping the reads on the reference genome and infer the library type from the mapping results by comparing reads mapping information to the annotation of the reference genome.

The sequencer always read from 5’ to 3’. So, in First Strand case, all reads from the left-most end of RNA fragment (always from 5’ to 3’) are mapped to transcript-strand, and (for pair-end sequencing) reads from the right-most end are always mapped to the opposite strand.

We can now try to determine the library type of our data. The first step is loading the Ensembl gene annotation for Drosophila melanogaster ( Drosophila_melanogaster.BDGP5.78.gtf ) from Zenodo into your current Galaxy history and rename it.

  1. “FASTQ” as “Input data format”
  2. “Individual paired reads”
  3. Downsampled “Trimmed reads pair 1” (Trim Galore output) as “Forward reads”
  4. Downsampled “Trimmed reads pair 2” (Trim Galore output) as “Reverse reads”
  5. “dm3” as reference genome
  6. Default values for other parameters

Then run Infer Experiment to determine the library type:

    1. HISAT2 output as “Input BAM/SAM file”
    2. Drosophila annotation as “Reference gene model”

    Sometimes it is difficult to find out which settings correspond to those of other programs. The following table might be helpful to identify library type:

    We can now map all the RNA sequences on the Drosophila melanogaster genome using HISAT2. HISAT2 will output a BAM file.

        1. FASTQ” as “Input data format”
        2. “Individual paired reads”
        3. “Trimmed reads pair 1” (Trim Galore output) as “Forward reads”
        4. “Trimmed reads pair 2” (Trim Galore output) as “Reverse reads”
        5. “dm3” as reference genome
        6. Default values for other parameters except “Spliced alignment parameters”
        7. “Specify strand-specific information” to the previously determined value
        8. Drosophila_melanogaster.BDGP5.78.gtf as “GTF file with known splice sites”

        We can inspect the mapping statistics:

        The BAM file contains information about where the reads are mapped on the reference genome. But it is binary file and with the information for more than 3 millions of reads, it makes it difficult to visualize it. We use IGV to visualize the HISAT2 output BAM file, particularly the region on chromosome 4 between 560kb to 600 kb.

        1. Download and install IGV on your local machine by following instruction found here
        2. Hit the BAM file
        3. click “local” under display with IGV

        Analysis of the differential gene expression
        To compare the expression of single genes between different conditions (e.g. with or without PS depletion), an first essential step is to quantify the number of reads per gene. HTSeq-count is one of the most popular tool for gene quantification.To quantify the number of reads mapped to a gene, an annotation of the gene position is needed. We already upload on Galaxy the <code>Drosophila_melanogaster.BDGP5.78.gtf</code> with the Ensembl gene annotation for Drosophila melanogaster.

        In principle, the counting of reads overlapping with genomic features is a fairly simple task. But there are some details that need to be decided, such how to handle multi-mapping reads. HTSeq-count offers 3 choices (“union”, “intersection_strict” and “intersection_nonempty”) to handle read mapping to multiple locations, reads overlapping introns, or reads that overlap more than one genomic feature:

        The recommended mode is “union”, which counts overlaps even if a read only shares parts of its sequence with a genomic feature and disregards reads that overlap more than one feature.

        1. Drosophila_melanogaster.BDGP5.78.gtf as “GFF file”
        2. The “Union” mode
        3. A “Minimum alignment quality” of 10
        4. Appropriate value for “Stranded” option

        For time and computer saving, in this section, we run the previous steps for you and obtain 7 count files, available on Zenodo. These files contain for each gene of Drosophila the number of reads mapped to it. We could compare directly the files and then having the differential gene expression. But the number of sequenced reads mapped to a gene depends on some other factors, such as expression level, length,and sequencing depth. Either for within or for inter-sample comparison, the gene counts need to be normalized. We can then use the Differential Gene Expression (DGE) analysis. This expression analysis is estimated from read counts and attempts are made to correct for variability in measurements using replicates that are absolutely essential accurate results. For your own analysis, we advice you to use at least 3, better 5 biological replicates per condition. You can have different number of replicates per condition. In our example, there are 2 factors that can explain differences in gene expression, treatment and sequencing type. Here treatment is the primary factor which we are interested in.

        DESeq2 is a great tool for DGE analysis. It takes read counts produced by HTseq-count , combine them into a big table (with gene in the rows and samples in the columns) and applies size factor normalization. To import read count files and run DESeq2 , follow instruction shown below:

        1. Create a new history
        2. import the seven count files from Zenodo
        3. Run DESeq2
        4. Set “Treatment” as first factor with “treated”
          and “untreated” as levels and selection of count files corresponding to both levels
        5. Press “insert factor”
        6. set “Sequencing” as second factor with “PE” and “SE” as levels and selection of count files corresponding to both levels (Keeping the CTRL key pressed and clicking on the files to select several files)
        7. hit “execute”

        The first output of DESeq2 is a tabular file. The columns are:

        • Gene identifiers
        • Mean normalized counts, averaged over all samples from both conditions
        • Logarithm (to basis 2) of the fold change
        • Standard error estimate for the log2 fold change estimate
        • Wald statistic
        • p-value for the statistical significance of this change
        • p-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate (FDR)

        To extract genes with most significant changes (adjusted p-value equal or below 0.05), we use Filter .

        1. Launch Filter
        2. Select the DESeq2 result table as input
        3. Type c7 < 0.05 in “With following condition”
        4. Press “execute”
        5. (optional)rename the output file for downstream analysis

        In addition to the list of genes, DESeq2 outputs a graphical summary of the results, useful to evaluate the quality of the experiment based on histogram of p-values for all tests, MA plot, principal Component Analysis (PCA), Heatmap of sample-to-sample distance matrix, and dispersion estimate.

        MA plot provides a global view of the relationship between the expression change of conditions (log ratios, M), the average expression strength of the genes (average mean, A), and the ability of the algorithm to detect differential gene expression. The genes that passed the significance threshold (adjusted p-value < 0.25) are colored in red.

        The heatmap provides overview over similarities and dissimilarities between samples.

        Dispersion estimates: gene-wise estimates (black), the fitted values (red), and the final maximum a posteriori estimates used in testing (blue)

        Analysis of the functional enrichment among differentially expressed genes

        We have extracted genes that are differentially expressed in treated (with PS gene depletion) samples compared to untreated samples. We would like to know the functional enrichment among the differentially expressed genes.

        The Database for Annotation, Visualization and Integrated Discovery (DAVID) provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes.

        The query to DAVID can be done only on 100 genes. So, we will need to select the ones where the most interested in.

        1. Launch Sort tool
        2. Select previously filtered file under “Sort Query”
        3. “Column:3” under “on column” and “Descending order under “everything in” to check most unregulated genes
        4. Press “Execute”
        5. Launch Select first tool
        6. Extract first 100 lines
        7. Lauch DAVID
        8. First column as “Column with identifiers”
        9. “ENSEMBL_GENE_ID” as “Identifier type”
        10. press “Execute”

        The output of the DAVID tool is a HTML file with a link to the DAVID website.

        Inference of the differential exon usage
        Now, we would like to know the differential exon usage between treated (PS depleted) and untreated samples using RNA-seq exon counts. We will rework on the mapping results we generated previously.

        We will use DEXSeq. DEXSeq detects high sensitivity genes, and in many cases exons, that are subject to differential exon usage. But first, as for the differential gene expression, we need to count the number of reads mapping the exons.

        Similar to the step of counting the number of reads per annotated gene. Here instead of HTSeq-count, we are using DEXSeq-Count

        1. Transfer Gene annotation file Drosophila_melanogaster.BDGP5.78.gtf from Zenodo to a Galaxy history.
        2. Launch “DEXSeq-Count”
        3. “Prepare annotation” of “Mode of operation”

        The output is again a GTF file that is ready to be used for counting. To count reads using DEXSeq-Count,

        1. “count reads” as “Mode of operation”
        2. “HISAT2 output as “Input bam file”
        3. GTF file from previous step as “DEXSeq compatible GTF file”

        This output a flatten GTF file.

        Next, we calculate differential exon usage. As for DESeq2 , in the previous step, we counted only reads that mapped to exons on chromosome 4 and for only one sample. To be able to identify differential exon usage induced by PS depletion, all datasets (3 treated and 4 untreated) must be analyzed with the similar procedure. For time saving, we use results available on Zenodo.

        1. Create a new history
        2. Import the seven count files from Zenodo and the gtf file generated from previous step
        3. Launch DEXSeq
        4. “Condition” as first factor with “treated” and “untreated” as levels and selection of count files corresponding to both levels
        5. “Sequencing” as second factor with “PE” and “SE” as levels and selection of count files corresponding to both levels

        Note that unlike DESeq2 , DEXSeq does not allow flexible primary factor names. Always use your primary factor name as “condition”. This step will take a couple hours to run.

        Similarly to DESeq2 , DEXSeq generates a table with:

        • Exon identifiers
        • Gene identifiers
        • Exon identifiers in the Gene
        • Mean normalized counts, averaged over all samples from both conditions
        • Logarithm (to basis 2) of the fold change
        • Standard error estimate for the log2 fold change estimate
        • p-value for the statistical significance of this change
        • p-value adjusted for multiple testing with the Benjamini-Hochberg procedure which controls false discovery rate

        Similarly, we also run Filter to extract exons with a a significant usage (adjusted p-value equal or below 0.05) between treated and untreated samples.

        In addition, DEXSeq generates a interactive HTML file which allows users to inspect deferentially expressed exons graphically.

        In this tutorial, we have analyzed real RNA sequencing data to extract useful information, such as which genes are up- or downregulated by depletion of the Drosophila melanogaster gene and which genes are regulated by the Drosophila melanogaster gene. To answer these questions, we analyzed RNA sequence datasets using a reference-based RNA-seq data analysis approach. This approach can be sum up with the following scheme:

        Target Audience

        Graduates, postgraduates, and PIs working or about to embark on an analysis of RNA-seq data. Attendees may be familiar with some aspect of RNA-seq analysis (e.g. gene expression analysis) or have no direct experience.

        Prerequisites: Basic familiarity with Linux environment and S, R, or Matlab. Must be able to complete and understand the following simple Linux and R tutorials (up to and including “Descriptive Statistics”) before attending:

        Watch the video: Single Cell RNA Seq Data Analysis - Visualizing Data with Figures (January 2023).