This is a list of bioinformatics software available at LUNARC. Please note that this list is not exhaustive. To see if a specific package is available and which versions are installed, you will have to login (how to login) and use 'module spider package-name' e.g. 'module spider BCFtools'.
BAM Statistics, Feature Counting and Feature Annotation. Alfred is an efficient and versatile command-line application that computes multi-sample quality control metrics in a read-group aware manner. Alfred supports read counting, feature annotation and haplotype-resolved consensus computation using multiple sequence alignments.Alfred is available as a Bioconda package, you will have to load Anaconda3/2018.12 first before you can use it. https://gear.embl.de/docs/alfred/
Amber is a package of programs for molecular dynamics simulations of proteins and nucleic acids.
AmberTools consists of several independently developed packages that work well by themselves, and with Amber. The suite can also be used to carry out complete molecular dynamics simulations, with either explicit water or generalized Born solvent models. http://ambermd.org/AmberTools.php
ANNOVAR is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others). http://annovar.openbioinformatics.org/en/latest/
AutoDock Vina is an open-source program for doing molecular docking. http://vina.scripps.edu/index.html
BBMap short read aligner, and other bioinformatic tools. https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/
BCFtools - Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants. http://www.htslib.org/
The Illumina sequencing instruments generate per-cycle base call (BCL) files at the end of the sequencing run. A majority of analysis applications use per-read FASTQ files as input for analysis. You can use the bcl2fastq2 Conversion Software v2.19 to convert base call (BCL) files from a sequencing run into FASTQ files. https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html
beagle-lib is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics packages. https://github.com/beagle-dev/beagle-lib
BEAST is a cross-platform program for Bayesian analysis of molecular sequences using MCMC. It is entirely orientated towards rooted, time-measured phylogenies inferred using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but is also a framework for testing evolutionary hypotheses without conditioning on a single tree topology. BEAST uses MCMC to average over tree space, so that each tree is weighted proportional to its posterior probability. http://beast.community/
Bedtools is a fast, flexible toolset for genome arithmetic. http://bedtools.readthedocs.io/en/latest/
Biopython is a set of freely available tools for biological computation written in Python by an international team of developers. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. http://www.biopython.org
BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. https://genome.ucsc.edu/cgi-bin/hgBlat?command=start
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome. http://bowtie-bio.sourceforge.net/index.shtml
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie2 supports gapped, local, and paired-end alignment modes. http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.
The bx-python project is a Python library and associated set of scripts to allow for rapid implementation of genome scale analyses. https://github.com/bxlab/bx-python
- Cell Ranger
Cell Ranger is a set of analysis pipelines that process Chromium single-cell RNA-seq output to align reads, generate gene-cell matrices and perform clustering and gene expression analysis. https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger
- Cell Ranger ATAC
Cell Ranger ATAC is a set of analysis pipelines that process Chromium Single Cell ATAC data.
UCSF Chimera is a highly extensible program for interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles. https://www.cgl.ucsf.edu/chimera/
Chimerascan is a software package that detects gene fusions in paired-end RNA sequencing (RNA-Seq) datasets. Recurrent gene fusions (a.k.a. chimeras) are a prevalent class of mutations that can produce functional transcripts that contribute to cancer progression. Recent advanced in high-throughput sequencing technologies have enabled reliable gene fusion discovery. https://code.google.com/archive/p/chimerascan/
CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent. https://cnvkit.readthedocs.io/en/stable/
CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from targeted DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent. This is a bundle to provide dependencies for cnvkit that aren't available in the standard EasyBuild Python. https://cnvkit.readthedocs.io/en/stable/
CNVnator is a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads. https://github.com/abyzovlab/CNVnator
Transcript assembly, differential expression, and differential regulation for RNA-Seq. http://cole-trapnell-lab.github.io/cufflinks
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. http://opensource.scilifelab.se/projects/cutadapt/
deepTools is a suite of Python tools particularly developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. https://deeptools.readthedocs.io/en/develop/
EMBOSS is 'The European Molecular Biology Open Software Suite'. EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. http://emboss.sourceforge.net/
EricScript is a computational framework for the discovery of gene fusions in paired end RNA-seq data. https://sites.google.com/site/bioericscript/
A quality control tool for high throughput sequence data. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files reprocessing. http://hannonlab.cshl.edu/fastx_toolkit/
FEELnc (FlExible Extraction of LncRNAs) is an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames.
Powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry. The high resolution of this method derives from utilizing haplotype linkage information and from focusing on the most recent coalescence (common ancestry) among the sampled individuals to derive a "co-ancestry matrix" - a summary of nearest neighbor haplotype relationships in the dataset. Further advantages when compared with other model-based methods (e.g. STRUCTURE and ADMIXTURE) include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population. http://cichlid.gurdon.cam.ac.uk/fineRADstructure.html
FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate software tool to merge paired-end reads from next-generation sequencing experiments. FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies. They can also improve transcriptome assembly when FLASH is used to merge RNA-seq data. https://ccb.jhu.edu/software/FLASH/
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. https://github.com/ekg/freebayes
FusionCatcher searches for novel/known fusion genes, translocations, and chimeras in RNA-seq data (paired-end reads from Illumina NGS platforms like Solexa/HiSeq/NextSeq/MiSeq) from diseased samples. https://github.com/ndaniel/fusioncatcher
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size. http://www.broadinstitute.org/gatk/
GENESIS (short for GEneral NEural SImulation System) is a general purpose simulation platform that was developed to support the simulation of neural systems ranging from subcellular components and biochemical reactions to complex models of single neurons, simulations of large networks, and systems-level models. http://genesis-sim.org/
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. http://www.gromacs.org/
HISAT is a fast and sensitive spliced alignment program for mapping RNA-seq reads. It is recommended that HISAT and TopHat2 users switch to HISAT2. https://ccb.jhu.edu/software/hisat/index.shtml
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) against the general human population (as well as against a single reference genome). HISAT2 is a successor to both HISAT and TopHat2. https://ccb.jhu.edu/software/hisat2/index.shtml
HOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis. It is a collection of command line programs for Unix-style operating systems written in Perl and C++. HOMER was primarily written as a de novo motif discovery algorithm and is well suited for finding 8-20 bp motifs in large scale genomics data. HOMER contains many useful tools for analyzing ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and numerous other types of functional genomics sequencing data sets. http://homer.ucsd.edu/homer/
Analysing high-throughput sequencing data with Python. http://htseq.readthedocs.io/
A C library for reading/writing high-throughput sequencing data. This package includes the utilities bgzip and tabix. http://www.htslib.org/
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. http://www.broadinstitute.org/software/igv/
This package contains command line utilities for preprocessing, computing feature count density (coverage), sorting, and indexing data files. http://www.broadinstitute.org/software/igv/igvtools_commandline.
IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. http://mathgen.stats.ox.ac.uk/impute/impute_v2.html
Efficient tree reconstruction. A fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. http://www.iqtree.org
Jellyfish is a tool for fast, memory-efficient counting of k-mers in DNA. http://www.cbcb.umd.edu/software/jellyfish/
kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. https://pachterlab.github.io/kallisto/about
Model Based Analysis for ChIP-Seq data. https://github.com/taoliu/MACS/
Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout. https://sourceforge.net/projects/mageck/
Manta calls structural variants (SVs) and indels from mapped paired-end sequencing reads. It is optimized for analysis of germline variation in small sets of individuals and somatic variation in tumor/normal sample pairs. Manta discovers, assembles and scores large-scale SVs, medium-sized indels and large insertions within a single efficient workflow. https://github.com/Illumina/manta
MAVIS is a Python (requires >=3) command-line tool for the post-processing of structural variant calls. On Aurora you'll need to load GCC and OpenMPI (module load GCC/7.3.0-2.30 OpenMPI/3.1.1) and Python 3.7.0 (module load Python/3.7.0) http://mavis.bcgsc.ca/
The MEME Suite allows the biologist to discover novel motifs in collections of unaligned nucleotide or protein sequences, and to perform a wide variety of other motif-based analyses.
The MEME Suite supports motif-based analysis of DNA, RNA and protein sequences. It provides motif discovery algorithms using both probabilistic (MEME) and discrete models (MEME), which have complementary strengths. It also allows discovery of motifs with arbitrary insertions and deletions (GLAM2). In addition to motif discovery, the MEME Suite provides tools for scanning sequences for matches to motifs (FIMO, MAST and GLAM2Scan), scanning for clusters of motifs (MCAST), comparing motifs to known motifs (Tomtom), finding preferred spacings between motifs (SpaMo), predicting the biological roles of motifs (GOMo), measuring the positional enrichment of sequences for known motifs (CentriMo), and analyzing ChIP-seq and other large datasets (MEME-ChIP). http://meme-suite.org/doc/overview.html?man_type=web
Molden is a package for displaying Molecular Density from the Ab Initio packages GAMESS-UK, GAMESS-US and GAUSSIAN and the Semi-Empirical packages Mopac/Ampac. http://www.cmbi.ru.nl/molden/
MuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes. http://archive.broadinstitute.org/cancer/cga/mutect
Aggregate results from bioinformatics analyses across many samples into a single report. MultiQC searches a given directory for analysis logs and compiles a HTML report. It's a general use tool, perfect for summarising the output from numerous bioinformatics tools. http://multiqc.info/
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. http://www.ks.uiuc.edu/Research/namd/
The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the INSDC Sequence Read Archives. https://github.com/ncbi/ncbi-vdb
NGS is a new, domain-specific API for accessing reads, alignments and pileups produced from Next Generation Sequencing. https://github.com/ncbi/ngs
Picard is a set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. https://broadinstitute.github.io/picard/
Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads. http://gmt.genome.wustl.edu/packages/pindel/
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. http://zzz.bwh.harvard.edu/plink/index.shtml
PLUMED is an open source library for free energy calculations in molecular systems which works together with some of the most popular molecular dynamics engines. Free energy calculations can be performed as a function of many order parameters with a particular focus on biological problems, using state of the art methods such as metadynamics, umbrella sampling and Jarzynski-equation based steered MD. The software, written in C++, can be easily interfaced with both Fortran and C/C++ codes. http://www.plumed.org/
Ontology editor and framework for building intelligent systems. https://protege.stanford.edu/products.php
Pysam is a Python module for reading, manipulating and writing genomic data sets. http://pysam.readthedocs.io/en/latest/
QCTOOL is a command-line utility program for basic quality control of gwas datasets and other genome-wide data. It supports the same file formats used by the WTCCC studies, as well as the binary file format described here and the Variant Call Format, and is designed to work seamlessly with SNPTEST and related tools. http://www.well.ox.ac.uk/~gav/qctool_v2/
RasMol is a program for molecular graphics visualisation. http://www.openrasmol.org/
ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. https://root.cern.ch//
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels.
RSeQC provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc. http://rseqc.sourceforge.net/
RevBayes provides an interactive environment for statistical computation in phylogenetics. It is primarily intended for modeling, simulation, and Bayesian inference in evolutionary biology, particularly phylogenetics. http://revbayes.github.io/intro.html
samblaster: a tool to mark duplicates and extract discordant and split reads from SAM files. https://github.com/GregoryFaust/samblaster
Salmon is a wicked-fast program to produce a highly-accurate, transcript-level quantification estimates from RNA-seq data. https://combine-lab.github.io/salmon/
SAM Tools provide various utilities for manipulating alignments in the SAM/BAM/CRAM format, including sorting, merging, indexing and generating alignments in a per-position format. http://www.htslib.org
SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. https://www.seqan.de/
A tool to visualise and analyse high throughput mapped sequence data. https://www.bioinformatics.babraham.ac.uk/projects/seqmonk/
Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip. https://github.com/lh3/seqtk
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. https://snakemake.readthedocs.io/en/stable/
SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (such as amino acid changes). http://snpeff.sourceforge.net
Analysis of single SNP association in genome-wide studies. https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
SplAdder, short for Splicing Adder, is a toolbox for alternative splicing analysis based on RNA-Seq alignment data. https://github.com/ratschlab/spladder
The Sequence Read Archive (SRA) Toolkit, and the source-code SRA System Development Kit (SDK), will allow you to programmatically access data housed within SRA and convert it from the SRA format. https://www.ncbi.nlm.nih.gov/sra/docs/
Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography. http://catchenlab.life.illinois.edu/stacks/
STAR aligns RNA-seq reads to a reference genome using uncompressed suffix arrays. https://github.com/alexdobin/STAR
STAR-Fusion uses the STAR aligner to identify candidate fusion transcripts supported by Illumina reads. STAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set. https://github.com/STAR-Fusion/STAR-Fusion
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs. The germline caller employs an efficient tiered haplotype model to improve accuracy and provide read-backed phasing, adaptively selecting between assembly and a faster alignment-based haplotyping approach at each variant locus. The germline caller also analyzes input sequencing data using a mixture-model indel error estimation method to improve robustness to indel noise. https://github.com/Illumina/strelka
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. https://ccb.jhu.edu/software/stringtie/
High performance read alignment, quantification and mutation discovery.
The Subread package comprises a suite of software programs for processing next-gen sequencing read data including:
- Subread: a general-purpose read aligner which can align both genomic DNA-seq and RNA-seq reads. It can also be used to discover genomic mutations including short indels and structural variants.
- Subjunc: a read aligner developed for aligning RNA-seq reads and for the detection of exon-exon junctions. Gene fusion events can be detected as well.
- featureCounts: a software program developed for counting reads to genomic features such as genes, exons, promoters and genomic bins.
- Sublong: a long-read aligner that is designed based on seed-and-vote.
- exactSNP: a SNP caller that discovers SNPs by testing signals against local background noises.
These programs were also implemented in Bioconductor R package Rsubread.
TelSeq is software that estimates telomere length from whole genome sequencing data (BAMs). https://github.com/zd1/telseq
Structural variant calling: identify chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs. https://github.com/SciLifeLab/TIDDIT
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. It is recommended that HISAT and TopHat(2) users switch to HISAT2. https://ccb.jhu.edu/software/tophat/index.shtml
Trimmomatic performs a variety of useful trimming tasks for Illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line. http://www.usadellab.org/cms/?page=trimmomatic
Tools from the UCSC browser. http://bioinformatics.readthedocs.io/en/latest/kent/
Variant calling and somatic mutation/CNV detection for next-generation sequencing data.
Convert a VCF into a Mutation Annotation Format (MAF), where each variant is annotated to only one of all possible gene isoforms. https://github.com/mskcc/vcf2maf
The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files. http://vcftools.sourceforge.net/
Sequence assembler for very short reads. https://www.ebi.ac.uk/~zerbino/velvet/
Zero-inflated dimensionality reduction algorithm for single-cell data. https://github.com/epierson9/ZIF