Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Metadata search page

Loading...

Supported studies overview

Loading...
Tutorial

Introduction

Welcome to RiboCrypt RiboCrypt is an R package for interactive visualization in genomics. RiboCrypt works with any NGS-based method, but much emphasis is put on Ribo-seq data visualization.

This tutorial will walk you through usage of the app.

RibCrypt app currently supports creating interactive browser views for NGS tracks, using ORFik, Ribocrypt and massiveNGSpipe as backend.

Browser

The browser is the main coverage plot display page. It contains a click panel on the left side and display panels on the right. It displays coverage of NGS data in either transcript coordinates (default), or genomic coordinates (like IGV). Each part will now be explained:

Display panel (browser)

The display panel shows the primary settings, (study, gene, sample, etc), the possible select boxes are:

Experiment selection

  • Select an organism: Either select “ALL” to keep all experiments, or select a specific organism to select display only that subset of experiments in experiment select tab.
  • Select an experiment: The experiments contain study names combined with organism (some studies are multi species, so sometimes one study have multiple experiments). Select which one you want. There also exist merged experiments (all samples merged for the organism, etc)

Gene selection

  • Select a gene: A gene can be selected currently using:
    • Gene id (ENSEMBL)
    • Gene symbol (hgnc, etc)
  • Select a transcript: A transcript isoform of the given gene above, default is Ensembl canonical isoform. Can be selected using:
    • Transcript id (ENSEMBL)

Library selection

Each experiment usually have multiple libraries. Select which one to display, by default if you select multiple libraries they will be shown under each other.

Library are by default named:

  • Library type (RFP, RNA etc),
  • Condition (WT, KO (wild type, knock out ) etc)
  • Stage/timepoint (5h, 1d (5 hours, 1 day) etc)
  • fraction (chx, cytosolic, ATF4 (ribosomal inhibitor, cell fraction, gene) etc)
  • replicate (technical/biological replicate number (r1, r2, r3))

The resuting name above could be:

  • RFP_WT_5h_chx_cytosolic_r1

A normal thing to see is that if condition is KO (knockout), the fraction column usually contains a gene name (the name of the gene that was knocked out) Currently, best way to find SRR run number for respective sample is to go to metadata tab and search for the study.

View mode

  • Select frames display type:
    • lines (single line, most clear for middle distance (> 100 nt))
    • columns (single point bars, most clear for single nt resolution)
    • stacks (Area under curve, stacked, most clear for long distance (> 1000 nt))
    • area (Area under curve, with alpha (see-through), most clear for long distance (> 1000 nt))
  • K-mer length: When looking at a large region (> 100nt), pure coverage can usually be hard to inspect. Using K-mer length > 1 (9 is a good starting point to try), you can easily look at patterns over larger regions.

Display panel (settings)

Here additional options are shown:

  • 5’ extension (extend viewed window upstream, outside defined region)
  • 3’ extension (extend viewed window downstream, outside defined region)
  • Custom sequences highlight (Motif search, given in purple color, support IUPAC, examples: CTG or NTG)
  • Genomic region (Browse genomic window instead of gene, syntax: chromosome:start-stop:strand, human/mouse/zebrafish: 1:10000-20000:+ , yeast: I:10000-20000:+. Both 1 and chr1 works, conversion will be done internally)
  • Genomic View (Activate/deactivate genomic view, giving splice information and correct positions in genome, but a lot harder to understand)
  • Protein structures (If you click the annotation name of a transcript in the plot panel it will display the alpha-fold protein colored by the ribo-seq data displayed in the plot panel)
  • Full annotation (display full annotation or just the tx you selected)
  • Summary top track (Add an additional plot track on top, summarizing all selecte libs)
  • Select Summary display type (same as frames display type above, but for the summary track)
  • Export format (When you hover the plot top right image button, and click export (the camera button), which format to export as)

Plot panel

From the options specified in the display panel, when you press “plot” the data will be displayed. It contains the specific parts:

  1. Ribo-seq data (top), the single or multi-track data is displayed on top. By default Ribo-seq is displayed in 3 colors, where
  • red is 0 frame, the start frame of reference transcript.
  • green is +1 frame
  • blue is +2 frame
  1. Sequence track (top middle), displayes DNA sequence when zoomed in (< 100nt)
  2. Annotation track (middle), the annotation track displays the transcript annotation, together with black bars that is displayed on top of the data track.
  3. Frame track (bottom), the 3 frames displayed with given color bars:
  • white (Start codons)
  • black (Stop codons)
  • purple (Custom motifs) When zoomed in, the amino acid sequence is displaced within each frame

Analysis

Here we collect the analysis possibilities, which are usually on whole genome scale.

Codon analysis

This tab displays a heatmap of codons dwell times over all genes selected, for both A and P sites. When pressing “Differential” you swap to a between library differential codon dwell time comparison (minimum 2 libraries selected is required for this method!)

Display panel (codon)

Study and gene select works same as for browser specified above. In addition to have the option to specify all genes (default). - Select libraries (multiple allowed)

Filters

  • Codon filter value (Minimum reads in ORF to be included)
  • Codon score, all scores are normalized for both codon and count per gene level (except for sum):
    • percentage (percentage use relative to max codon, transcript normalized percentages)
    • dispersion(NB) (negative binomial dispersion values)
    • alpha(DMN) (Dirichlet-multinomial distribution alpha parameter)
    • sum (raw sum, (a very biased estimator, since some codons are used much more than others!))

Heatmap

This tab displays a heatmap of coverage per readlength at a specific region (like start site of coding sequences) over all genes selected.

Display panel (heatmap)

Study and gene select works same as for browser specified above. In addition to have the option to specify all genes (default).

  • Select libraries Only 1 library can be selected currently in heatmap mode.
  • View region Select one of:
    • Start codon
    • Stop codon
  • Normalization Normalization mode for data display, select one of:
    • transcriptNormalized (each gene counts sum to 1)
    • zscore (zscore normalization, will give better overview if 1 readlength is extreme)
    • sum (raw sum of counts, is very sensitive to extreme peaks)
    • log10sum (log10 sum of counts, is less sensitive to extreme peaks)
  • Min Readlength The minimum readlength to display from library
  • Max readlength The maximum readlength to display from library

Display panel (settings)

Here additional options are shown:

  • 5’ extension (extend viewed window upstream from point, default 30)

  • 3’ extension (extend viewed window downstreamfrom point, default 30)

  • Extension works like this, first extend to transcript coordinates.

  • After gene end extend in genomic coordinates

  • If chromosome boundary is reached, remove those genes from the full set.

Differential gene expression

Given an experiment with a least 1 design column with two values, like wild-type (WT) vs knock out (of a specific gene), you can run differential expression of genes. The output is an interactive plot, where you can also search for you target genes, making it more useable than normal expression plots, which often are very hard to read.

Display panel (Differential expression)

Organism and experiment explained above - Differential method: FPKM ratio is a pure FPKM ratio calculation without factor normalization (like batch effects), fast and crude check. DESeq2 argument gives a robust version, but only works for experiments with valid experimental design (i.e. design matrix must be full ranked, see deseq2 tutorial for details!) - Select two conditions (which 2 factors to group by)

Display panel (settings)

  • draw unregulated (show dots for unregulated genes, makes it much slower!)
  • Full annotation (all transcript isoforms, default is primary isoform only!)
  • P-value (sliding bar for p-value cutoff, default 0.05)
  • export format for plot (explained above)

Meta Browser

Display all samples for a specific organism over selected gene. This tab does not use bigwig files to load (as that would be very slow). It uses precomputed fst files of coverage over all libraries. Note: Not all isoforms are computed, by default the longest isoform is computed.

Display panel (Meta browser)

Organism, experiment and gene explained above - Group on: the metadata column to order plot by - K-means clusters: How many k-means clusters to use, if > 1, Group will be sorted within the clusters, but K-means have priority.

Statistics tab

This tab gives the statistics of over representation analysis per cluster. Using chi squared test, it gives the residuals per term from metadata (like tissue, cell-line etc). If a value is bigger than +/- 3, it means it is quite certain this is over represented.

If no clustering was applied, this tab gives the number of items per metadata term (40 brain samples, 30 kidney samples etc).

Display panel (settings)

  • Normalization (all scores are tpm normalized and log scaled)
    • transcriptNormalized (each sample counts sum to 1) (default)
    • Max normalized (each position count divided by maximum)
    • zscore (zscore normalization, (count * mean / sd) variance scaled normalization)
    • tpm (raw tpm of counts, is very sensitive to extreme peaks)
  • Color theme: Which color theme to use
  • Color scale multiplier: how much to amplify the color signal (if all is single color, try to reduce or increase this depending on which color is the majority)
  • Sort by other gene: Sort heatmap on expression of another gene (increasing). A line plot will be shown left that display the expression of that other gene per row. Also the enrichment analysis is done on bins of the other genes expression instead of metadata term.
  • Summary top track (same as in browser, the aggregate of all rows, useful to see the frame information, which is not represented in the heatmap)
  • Split by frame: (Not active yet!): When completed will display the heatmap with frame information, so rgb colors from white to dark red etc.
  • Summary display type: Same as in browser
  • Region to view: By default uses entire transcript. Can subset to only view cds or leader+cds etc.
  • Select a transcript: Which transcript isoforms not all isoforms have computed fsts. By default the longest isoforms have been computed for all genes.
  • K-mer length: Smoothens out the signal by applying a mean sliding window, default 1 (off)

Requirements

This mode is very intensive on CPU, so it requires certain pre-computed results for the back end. That is namely: - Premade collection experiments (an ORFik experiment of all experiments per organism) - Premade collection count table and library sizes (for normalizations purpose) - Premade fst serialized coverage calculation per gene (for instant loading of coverage over thousands of libraries)

Note that on the live app, the human collection (4000 Ribo-seq samples) takes around 30 seconds to plot for a ~ 2K nucleotides gene, ~99% of the time is spent on rendering the plot, not actual computation. Future investigation into optimization will be done.

Read length (QC)

This tab displays a QC of pshifted coverage per readlength (like start site of coding sequences) over all genes selected.

Display panel (Read length QC)

The display panel shows what can be specified to display, the possible select boxes are same as for heatmap above:

Plot panel

From the options specified in the display panel, when you press “plot” the data will be displayed. It contains the specific parts:

Top plot: Read length relative usage

  1. Y-axis: Score
  2. Color: Per frame (red, green, blue)
  3. Facet box: the read length

Bottom plot: Fourier transform (3nt periodicity quality, clean peak means good periodicity)

Fastq (QC)

This tab displays the fastq QC output from fastp, as a html page.

Display panel (Read length QC)

The display panel shows what can be specified to display, you can select from organism, study and library.

Plot panel

Displays the html page.

Metadata

Metadata tab displays information about studies.

Study accession number

Here you input a study accession number in the form of either:

  • SRP
  • GEO (GSE)
  • PRJNA (PRJ….)
  • PRJID (Only numbers)

Output

On top the abstract of the study is displayed, and on bottom a table of all metadata found from the study is displayed.

Additional information

All files are packed into ORFik experiments for easy access through the ORFik backend package:

File formats used internally in experiments are:

  • Annotation (gtf + TxDb for random access)
  • Fasta genome (.fasta, + index for random access)
  • Sequencing libraries (all duplicated reads are collapsed)
    • random access (only for collapsed read lengths): bigwig
    • Full genome coverage (only for collapsed read lengths): covRLE
    • Full genome coverage (split by read lengths): covRLElist
  • count Tables (Summarized experiments, r data serialized .rds)
  • Library size list (Integer vector, .rds)
  • Precomputed gene coverages per organism: fst (used for metabrowser)

massiveNGSpipe

For our webpage the processing pipeline used is massiveNGSpipe which wraps over multiple tools:

  1. Fastq files are download with ORFik download.sra
  2. Adapter is detected with either fastqc (sequence detection) and falls back to fastp auto detection.
  3. Reads are then trimmed with fastp (using the wrapper in ORFik)
  • Adapter removal specified
  • minimum read size (20nt)
  1. Read are collapsed (get the set of unique reads and put duplication count in read header)
  2. Reads are aligned with the STAR aligner (using the wrapper in ORFik), that supports contamination removal. Settings:
  • genomic coordinates (to allow both genomic and transcriptomic coordinates)
  • local alignment (to remove unknown flank effects)
  • minimum read size (20nt)
  1. When all samples of study are aligned, an ORFik experiment is created that connects each sample to metadata (condition, inhibitor, fraction, replicate etc)
  2. Bam files are then converted to ORFik ofst format
  3. These ofst files are then pshifted
  4. Faster formats are then created (bigwig, fst and covRLE) for faster visualization

Introduction to Ribo-seq

If you’re not familiar with terms like “p-shifting” or “p-site offset”, it’s best to walk through ORFikOverview vignette, especially chapter 6 “RiboSeq footprints automatic shift detection and shifting”

https://bioconductor.org/packages/release/bioc/vignettes/ORFik/inst/doc/ORFikOverview.html#riboseq-footprints-automatic-shift-detection-and-shifting

API for URL access and sharing

RiboCrypt uses the shiny router API system for creating runable links and backspacing etc. The API specificiation is the following:

Primary url:

https://ribocrypt.org/ (This leads to browser page)

Page selection API:

Page selection is done with “#” followed by the page short name, the list is the following:

  • broser page (/ or /#browser)
  • heatmap (/#heatmap)
  • codon (/#codon)
  • Differential expression (/#Differential expression)
  • Periodicity plot (/#periodicity)
  • fastq QC report (/#fastq)
  • MetaBrowser (/#MetaBrowser)
  • SRA search (/#SRA search)
  • Studies supported (/#Studies)
  • This tutorial (/#tutorial)

Example: https://ribocrypt.org/#tutorial sends you to this tutorial page

Parameter API:

Settings can be specified by using the standard web parameter API:

  • “?”, Starts the parameter specification
  • “&”, to combine terms:

Example: https://RiboCrypt.org/?dff=all_merged-Homo_sapiens&gene=ATF4-ENSG00000128272#browser will lead you to browser and insert gene ATF4 (all other settings being default).

A more complicated call would be: https://RiboCrypt.org/?dff=all_merged-Homo_sapiens&gene=ATF4-ENSG00000128272&tx=ENST00000404241&frames_type=area&kmer=9&go=TRUE&extendLeaders=100&extendTrailers=100&viewMode=TRUE&other_tx=TRUE#browser

browser:

  • dff=all_merged-Homo_sapiens (The Experiment to select: for webpage it is “study id”_“Organism”)
  • gene=ATF4-ENSG00000128272 (For webpage it is: “Gene symbol”-“Ensembl gene id”)
  • tx=ENST00000404241(isoform)
  • frames_type=area (plot type)
  • kmer=9 (window smoothing)
  • go=TRUE (plot on entry, you do not need to click plot for it to happen)
  • extendLeaders=100 (extend 100 nt upstream of 5’ UTR / Leader)
  • extendTrailers=100 (extend 100 nt downstream of 3’ UTR / Trailer)
  • viewMode=TRUE (Genomic coordinates, i.e. with introns)
  • other_tx=TRUE (Full annotation, show transcript graph for all genes/isoforms in area)
  • #browser (open browser window)

About

This app is created as a collaboration with:

  • University of Warsaw, Poland
  • University of Bergen, Norway

Main authors and contact: