Welcome to SPARSE’s documentation!¶
SPARSE indexes >100,000 reference genomes in public databases into hierarchical clusters and uses it to predict origins of metagenomic reads.
Installation guide¶
SPARSE runs on Unix and requires Python >= version 2.7
hardware
- SPASE runs best in multi-processes mode. Thus servers with at least 10-20 CPU cores were suggested.
- >= 300 GBytes of memory is required to handle over 10 million metagenomic reads.
- >= 500 GBytes of storage space.
System modules (Ubuntu 16.04):
- pip
- gfortran
- llvm
- libncurses5-dev
- cmake
- xvfb-run (for malt, optional)
3rd-party software:
- samtools (>=1.2)
- mash (>=1.1.1)
- bowtie2 (>=2.3.2)
- malt (>=0.4.0) (optional)
See requirements.txt for python module dependencies.
Installation with a Conda environment (Ubuntu)¶
To install SPARSE and all its dependencies in an isolated conda environment, you can use the environment file provided.
git clone git@github.com:zheminzhou/SPARSE.git
cd SPARSE/envs
conda env create -f sparse_env.yml
source activate sparse
Installation via PIP¶
pip install meta-sparse
Install from source files (Ubuntu)¶
sudo apt-get update
sudo apt-get install gfortran llvm libncurses5-dev cmake python-pip samtools bowtie2
git clone https://github.com/zheminzhou/SPARSE
cd SPARSE/EM && make
pip install -r requirements.txt
Change the parameters if needed.
Updating SPARSE¶
To update SPARSE, move to the installation directory and pull the latest version:
cd SPARSE
git pull
Toy example of SPARSE¶
There is a bash script in the examples folder that runs through an example of SPARSE. This example takes reads from an ancient DNA sample of 800 year old Salmonella enterica from this publication.
The contents of the shell script download and create a Bowtie database of some Salmonella genomes (01-03) and then maps some of the ancient reads and displays a report.
sparse init --dbname toyset
sparse index --dbname toyset --seqlist Salmonella_toyset.txt
sparse query --dbname toyset --tag m==a | python ../SPARSE.py mapDB --dbname toyset --mapDB Salmonella --seqlist stdin
sparse predict --dbname toyset --r1 Ragna.sample.fq.gz --workspace Ragna_toy --mapDB Salmonella
sparse report Ragna_toy
cat Ragna_toy/profile.txt
sparse extract --workspace Ragna_toy --ref_id 10
Parameter settings¶
All the default parameters are stored in parameter.py. Users do not need to specifiy these parameters in most of cases. Some parameters here may need to be specified during installation, while others can be specified for each database or for each SPARSE run.
Installation parameters¶
Parameters that need to be specified during installation, You need only point BIN to a folder that contains all the executables of the dependencies, e. g.
- BIN = ‘/usr/local/bin/’
Alternatively, if you have all executables in the system environmental parameter $PATH, use
- BIN = ‘’
You can also specify a pointer for each executable file :
- mash = ‘{BIN}mash’,
- bowtie2 = ‘{BIN}bowtie2’,
- bowtie2_build = ‘{BIN}bowtie2-build’,
- samtools = ‘{BIN}samtools’,
- malt_run = ‘xvfb-run –auto-servernum –server-num=1 {BIN}malt-run’,
- malt_build = ‘xvfb-run –auto-servernum –server-num=1 {BIN}malt-build’,
Runtime parameters¶
The following parameters that can be specified on-fly. You can also specify there default values for each database in: /path/to/sparse/database/dbsetting.cfg
- mismatch = 0.05 # mismatch parameter is used in the probalistic model. Given a higher value will report less bins
- n_thread = 20 # number of threads for SPARSE. Higher value can accelerate the program
- minFreq = 0.0001 # Minimum frequencies of a strain to be reported. Use minFreq = 0.000001 for ancient DNA samples
- minNum = 10 # Minimum number of specific reads to report a strain. Use * minNum = 5 or less for ancient DNA samples
- HGT_prior = [[0.05, 0.99, 0.1], [0.02, 0.99, 0.2], [0.01, 0.99, 0.5]] # parameters to identify core genomic regions. Suggest to use default values
- UCE_prior = [487, 2000] # parameters to identify ultra-conserved elements. Suggest to use default values
Advanced parameters¶
Parameters to construct SPARSE databases, only for advanced uses:
- msh_param = ‘-k 23 -s 4000 -S 42’ # change the parameter for the MASH program. reduce k and s accelerate the database indexing while bring in slightly more incorrect clusterings
- # following three parameters are pointers to corresponding sub-folders. Change them if you want the actual data in a different folder than the database
- mash_db = ‘{dbname}/mash_db’
- bowtie_db = ‘{dbname}/bowtie_db’
- placer_db = ‘{dbname}/placer_db’
- taxonomy_db = ‘{dbname}/taxonomy’
Parameters for hierarchical clustering levels:
- barcode_dist = [0.1, 0.05,0.02,0.01, 0.005,0.002,0.001, 0.0005]
- barcode_tag = [‘u’, ‘s’ ,’r’ ,’p’ , ‘n’ ,’m’ ,’e’ , ‘c’ ,’a’]
- representative_level = 2
These parameters are for experts, and have not been tested for varied values
- SPARSE = sparse_folder
- ipopt = ‘{SPARSE}/EM/solve-model’
- db_columns = [‘index’, ‘deleted’, ‘barcode’, ‘sha256’, ‘size’]
- metadata_columns = [‘assembly_accession’, ‘version’, ‘refseq_category’, ‘assembly_level’, ‘taxid’, ‘organism_name’, ‘file_path’, ‘url_path’]
- taxa_columns = [‘subspecies’, ‘species’, ‘genus’,’family’, ‘order’, ‘class’, ‘phylum’, ‘kingdom’, ‘superkingdom’],
RefSeq database¶
The refseq database from NCBI stores >100,000 complete genomes and drafts that compass all tree of life. We firstly construct an empty database folder and assigns default control parameters for the database.
sparse init --dbname refseq
Index refseq database or update an exising database¶
A second command allows SPARSE to download all genomes in refseq on-fly and construct the database. The efficiency of the indexing process depends on both the downloading speed and the number of assigned CPUs. When assigning 20 CPUs, you can expect the whole process to finish in about one day.
sparse index --dbname refseq --update
Be aware that the newly added genomes are not ready for metagenomic reads. You need to run another command to update your representative databases.
We also release a pre-compiled database named “refseq_20180519”, on the basis of NCBI RefSeq at 2018.05.19, at http://enterobase.warwick.ac.uk/sparse/
This database contains the MASH indexed master database and four default mapping databases:
representative
subpopulation
Virus
Eukaryota
As well as reference genomes for three important animal hosts:
Human
Swine
Bovine
To use the database, just download and untar the package (~350 GB):
curl -o refseq_20180519.tar.gz http://enterobase.warwick.ac.uk/sparse/refseq_20180519.tar.gz
tar -vxzf refseq_20180519.tar.gz
Custom databases¶
You can also create a custom database, or add in custom genomes to an old database.
Representatives database¶
In order to do read-level taxonomic binning, representative databases need to be compiled. Four default databases were designed cover most of the genetic diversities in metagenomic samples.
ANI 98% database for bacteria and archaea
sparse query --dbname refseq --default representative | python SPARSE.py mapDB --dbname refseq --seqlist stdin --mapDB representative
ANI 99% database for bacteria and archaea (always use together with representative database)
sparse query --dbname refseq --default subpopulation | python SPARSE.py mapDB --dbname refseq --seqlist stdin --mapDB subpopulation
ANI 99% virus database
sparse query --dbname refseq --default Virus | python SPARSE.py mapDB --dbname refseq --seqlist stdin --mapDB Virus
ANI 99% eukaryota database (genome size <= 200MB)
sparse query --dbname refseq --default Eukaryota | python SPARSE.py mapDB --dbname refseq --seqlist stdin --mapDB Eukaryota
Custom databases
In order to index a differet set of references into a representative database, see [here](custom.md)
Building custom representative databases¶
You can also custom the representative databases. Here a human genome is used as an example:
We first query its record in a SPARSE refseq database using the assembly accession:
sparse query --dbname refseq_20171014 --assembly_accession GCF_000001405.37 > human.tsv
The resulting file is:
index deleted barcode sha256 size assembly_accession version refseq_category assembly_level taxid organism_name file_path url_path subspecies species genus family order class phylum kingdom superkingdom
107460 - u107460.s107460.r107460.p107460.n107460.m107460.e107460.c107460.a107460 d236b7835a3f10e596f9ce3c1f988b9e897f2dea216fd3dcde880eb91963863e 3253848404 GCF_000001405.37 37 reference genome Chromosome 9606 Homo sapiens - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/GCF_000001405.37_GRCh38.p11_genomic.fna.gz - Homo sapiens Homo Hominidae Primates Mammalia Chordata Metazoa Eukaryota
This file can be used as an input to build a new representative database named “Human”:
sparse mapDB --dbname refseq --mapDB Human --seqlist human.tsv
Metagenomic reads are assigned using these representative databases, details see section on “read-level prediction”.
MASH based taxonomic assignment for genomic assemblies or read sets¶
SPARSE allows ultra-efficient taxonomic assignment with genomic assemblies or read sets, by using MASH to approximate average nucleotide identities (ANI).
genomic assembly (fasta format):
sparse mash --dbname refseq --query <assembly file>
Read set (in fastq format, either gzipped or not) :
sparse mash --dbname refseq --query <read file> --read
Map metagenomic reads onto representative databases¶
Template of how to map metagenomic reads onto a representative database:
sparse predict --dbname </path/to/SPARSE/database> --mapDB <comma delimited MapDB's> --r1 <read_1> --r2 <read_2> --workspace <workspace_name>
Example (single end):
sparse predict --dbname refseq --mapDB representative,subpopulation,Virus --r1 read1.fq.gz --workspace read1
The outputs consist of two files, with detailed information in the [“output” section](output.md).
Extract reference specific reads¶
You first need to find out the indices of the interesting references in the [output files](output.md), and use the indexes to extract related reads.
sparse extract --dbname refseq --workspace read1 --ref_id <comma delimited indices>
For example, we extract all reads specific to reference id 16, which is a Vibrio cholerae genome.
sparse extract --dbname refseq --workspace read1 --ref_id 16
Outputs¶
Output for ‘sparse predict’¶
The taxonomic profiling results for ‘sparse query’ are saved in <workspace>/profile.txt
The first three rows in ‘profile.txt’ summarize the status of the reads from the metagenomic sample:
Total <No. reads> <No. matched reads>
Unmatched <% unmatched reads in total reads> 0.000
Uncertain_match <% unreliable matches in total reads> <% unreliable matches in total matches>
Example:
Total 26726530 23783388.0
Unmatched 11.012 0.000
Uncertain_match 36.102 40.570
This example suggests that 89% of reads are matched against at least one reference in the database. Subsequently, 60% of all matches found are used for taxonomic predictions.
The following lines describe the prediction at different taxonomic levels, in the following format:
<SPARSE group> <% in total reads> <% in matched reads> <taxonomic labels> (<reference IDs>)
Example:
~154 2.1706 2.4392 Bacteria|-|Actinobacteria|Actinobacteria|Micrococcales|Micrococcaceae (15969,66991,66935,66915,67189,110179,40981,154,67166,67220,114405,66878,66930,82153,40861,40710,67029)
u154 2.1701 2.4387 Bacteria|-|Actinobacteria|Actinobacteria|Micrococcales|Micrococcaceae|Rothia (15969,66991,66935,66915,67189,110179,40981,154,67166,67220,114405,66878,66930,82153,40861,40710,67029)
s154 2.1551 2.4217 Bacteria|-|Actinobacteria|Actinobacteria|Micrococcales|Micrococcaceae|Rothia|Rothia dentocariosa (*Rothia sp. HMSC067H10/*Rothia sp. HMSC064D08/*Rothia sp. HMSC071F11/*Rothia sp. HMSC069C01) (15969,66991,66935,66915,67189,110179,40981,154,67166,67220,114405,66878,66930,82153,40861,40710,67029)
~613 1.4988 1.6843 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae (16778,16416,117596,16415,10931,17276,113949,60730,613)
u613 1.4934 1.6782 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae|Veillonella (16778,16416,117596,16415,10931,17276,113949,60730,613)
s613 1.4507 1.6302 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae|Veillonella|Veillonella parvula (*Veillonella sp. 6_1_27/*Veillonella sp. S13054-11/*Veillonella sp. 3_1_44) (16778,16416,117596,16415,10931,17276,113949,60730,613)
r15969 0.4677 0.5256 Bacteria|-|Actinobacteria|Actinobacteria|Micrococcales|Micrococcaceae|Rothia|Rothia dentocariosa|- (15969)
p15969 0.3907 0.4391 Bacteria|-|Actinobacteria|Actinobacteria|Micrococcales|Micrococcaceae|Rothia|Rothia dentocariosa|-|Rothia dentocariosa M567: GCF_000143585.1 (15969)
r16416 0.1838 0.2065 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae|Veillonella|*Veillonella sp. 6_1_27|- (16416)
p16416 0.1631 0.1833 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae|Veillonella|*Veillonella sp. 6_1_27|-|Veillonella sp. 6_1_27: GCF_000163735.1 (16416)
The SPARSE groups are internal hierarchical clustering results stored in the SPARSE database. The group label consists of two components. The prefix presents the ANI level of the cluster and the following number presents the designation of the cluster.
For example, ‘s613’ is a cluster ‘613’ in ‘s’ level (ANI 95%, “species level”) The correlation between prefix and ANI level is:
~ <90% ANI
u 90% ANI
s 95% ANI
r 98% ANI
p 99% ANI
n 99.5% ANI
m 99.8% ANI
e 99.9% ANI
c 99.95% ANI
a 100% ANI
‘s’ (ANI 95%) is normally treated as a ‘gold standard’ criterion for species definition.
For each SPARSE goup, the traditional taxonomic labels follow the format:
<superkingdom>|<kingdom>|<phylum>|<class>|<order>|<family>|<genus>|<species>|<subspecies>|<reference_genome>
These taxonomic labels are summarised from the input database. Sometimes multiple species will be associated with one SPARSE group:
s613 1.4507 1.6302 Bacteria|-|Firmicutes|Negativicutes|Veillonellales|Veillonellaceae|Veillonella|Veillonella parvula (*Veillonella sp. 6_1_27/*Veillonella sp. S13054-11/*Veillonella sp. 3_1_44) (16778,16416,117596,16415,10931,17276,113949,60730,613)
In this example, group s613 is associated with four different species:
Veillonella parvula
*Veillonella sp. 6_1_27
*Veillonella sp. S13054-11
*Veillonella sp. 3_1_44
Informal names are marked with prefix “*”. The most probable species is shown first, and followed by the other three names in a bracket. There is another bracket after the taxonomic labels:
(16778,16416,117596,16415,10931,17276,113949,60730,613)
These are the IDs of the actual reference genomes that were found in the database. They can be used to extract reference specific reads using the command ‘sparse extract’.
Output for ‘sparse report’¶
sparse can provide a report that combines multiple ‘sparse predict’ runs together into a tab-delimited text file. This command also identifies potential pathogens in the predictions.
#Group #Pathogenic ERR1659111 ERR1659110 #Species #Taxon
s3080 non 4.47309775569 4.84028327303 Actinomyces dentalis (*Actinomyces sp. oral taxon 414) Bacteria|-|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|Actinomyces|Actinomyces dentalis (*Actinomyces sp. oral taxon 414)
s1438 non 0.821962806352 3.57658189557 Desulfomicrobium orale Bacteria|-|Proteobacteria|Deltaproteobacteria|Desulfovibrionales|Desulfomicrobiaceae|Desulfomicrobium|Desulfomicrobium orale
s9975 non 2.04489272864 1.85184148971 *Anaerolineaceae bacterium oral taxon 439 Bacteria|-|Chloroflexi|Anaerolineae|Anaerolineales|Anaerolineaceae|-|*Anaerolineaceae bacterium oral taxon 439
s939 non 1.81538010098 0.712860400235 Pseudopropionibacterium propionicum Bacteria|-|Actinobacteria|Actinobacteria|Propionibacteriales|Propionibacteriaceae|Pseudopropionibacterium|Pseudopropionibacterium propionicum
s8820 non 1.67063037869 0.491279312566 *Ottowia sp. Marseille-P4747 (*Ottowia sp. oral taxon 894) Bacteria|-|Proteobacteria|Betaproteobacteria|Burkholderiales|Comamonadaceae|Ottowia|*Ottowia sp. Marseille-P4747 (*Ottowia sp. oral taxon 894)
s2215 non 1.31802856115 0.34575838713 Lautropia mirabilis Bacteria|-|Proteobacteria|Betaproteobacteria|Burkholderiales|Burkholderiaceae|Lautropia|Lautropia mirabilis
s2590 non 0.665641018802 0.612783437737 Actinomyces cardiffensis Bacteria|-|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|Actinomyces|Actinomyces cardiffensis
s2189 non 0.87220732902 0.296597041195 Corynebacterium matruchotii Bacteria|-|Actinobacteria|Actinobacteria|Corynebacteriales|Corynebacteriaceae|Corynebacterium|Corynebacterium matruchotii
s108979 non 0.295928369726 0.857545958706 *Actinomyces sp. oral taxon 897 Bacteria|-|Actinobacteria|Actinobacteria|Actinomycetales|Actinomycetaceae|Actinomyces|*Actinomyces sp. oral taxon 897
The first line shows the samples in the report, as well as additional annotations (starts with ‘#’). #Group and #Taxon are identical to the ‘sparse predict’ output. #Species is a simple extraction of the most probably species in the #Taxon column and #Pathogenic contains potential pathogen predictions encoded as:
non - not a pathogen
* - commensal and normally not a pathogen
** - Possibly a pathogen
*** - Pathogen
**** - Important pathogen, possibly fatal
The numbers shows the abundances of the species in each metagenomic read set. It is normally shown in percentages, unless parameter ‘–absolute’ is applied, which changes the numbers to be absolute read counts.
The last row of the output is a summary of all unknown/uncertain reads without taxonomic classifications.
Citation¶
SPARSE has been pulished as a conference paper in RECOMB 2018. Zhou Z., Luhmann N., Alikhan NF., Quince C., Achtman M. (2018) Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes. In: Raphael B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science, vol 10812. Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-89929-9_15
A preprint can also be found in BioRxiv : https://www.biorxiv.org/content/early/2017/11/07/215707