Parameter settings

All the default parameters are stored in parameter.py. Users do not need to specifiy these parameters in most of cases. Some parameters here may need to be specified during installation, while others can be specified for each database or for each SPARSE run.

Installation parameters

Parameters that need to be specified during installation, You need only point BIN to a folder that contains all the executables of the dependencies, e. g.

  • BIN = ‘/usr/local/bin/’

Alternatively, if you have all executables in the system environmental parameter $PATH, use

  • BIN = ‘’

You can also specify a pointer for each executable file :

  • mash = ‘{BIN}mash’,
  • bowtie2 = ‘{BIN}bowtie2’,
  • bowtie2_build = ‘{BIN}bowtie2-build’,
  • samtools = ‘{BIN}samtools’,
  • malt_run = ‘xvfb-run –auto-servernum –server-num=1 {BIN}malt-run’,
  • malt_build = ‘xvfb-run –auto-servernum –server-num=1 {BIN}malt-build’,

Runtime parameters

The following parameters that can be specified on-fly. You can also specify there default values for each database in: /path/to/sparse/database/dbsetting.cfg

  • mismatch = 0.05 # mismatch parameter is used in the probalistic model. Given a higher value will report less bins
  • n_thread = 20 # number of threads for SPARSE. Higher value can accelerate the program
  • minFreq = 0.0001 # Minimum frequencies of a strain to be reported. Use minFreq = 0.000001 for ancient DNA samples
  • minNum = 10 # Minimum number of specific reads to report a strain. Use * minNum = 5 or less for ancient DNA samples
  • HGT_prior = [[0.05, 0.99, 0.1], [0.02, 0.99, 0.2], [0.01, 0.99, 0.5]] # parameters to identify core genomic regions. Suggest to use default values
  • UCE_prior = [487, 2000] # parameters to identify ultra-conserved elements. Suggest to use default values

Advanced parameters

Parameters to construct SPARSE databases, only for advanced uses:

  • msh_param = ‘-k 23 -s 4000 -S 42’ # change the parameter for the MASH program. reduce k and s accelerate the database indexing while bring in slightly more incorrect clusterings
  • # following three parameters are pointers to corresponding sub-folders. Change them if you want the actual data in a different folder than the database
  • mash_db = ‘{dbname}/mash_db’
  • bowtie_db = ‘{dbname}/bowtie_db’
  • placer_db = ‘{dbname}/placer_db’
  • taxonomy_db = ‘{dbname}/taxonomy’

Parameters for hierarchical clustering levels:

  • barcode_dist = [0.1, 0.05,0.02,0.01, 0.005,0.002,0.001, 0.0005]
  • barcode_tag = [‘u’, ‘s’ ,’r’ ,’p’ , ‘n’ ,’m’ ,’e’ , ‘c’ ,’a’]
  • representative_level = 2

These parameters are for experts, and have not been tested for varied values

  • SPARSE = sparse_folder
  • ipopt = ‘{SPARSE}/EM/solve-model’
  • db_columns = [‘index’, ‘deleted’, ‘barcode’, ‘sha256’, ‘size’]
  • metadata_columns = [‘assembly_accession’, ‘version’, ‘refseq_category’, ‘assembly_level’, ‘taxid’, ‘organism_name’, ‘file_path’, ‘url_path’]
  • taxa_columns = [‘subspecies’, ‘species’, ‘genus’,’family’, ‘order’, ‘class’, ‘phylum’, ‘kingdom’, ‘superkingdom’],