Building custom representative databases

You can also custom the representative databases. Here a human genome is used as an example:

We first query its record in a SPARSE refseq database using the assembly accession:

sparse query --dbname refseq_20171014 --assembly_accession GCF_000001405.37 > human.tsv

The resulting file is:

index       deleted barcode sha256  size    assembly_accession      version refseq_category assembly_level  taxid   organism_name   file_path       url_path        subspecies      species genus   family  order   class   phylum  kingdom superkingdom

107460      -       u107460.s107460.r107460.p107460.n107460.m107460.e107460.c107460.a107460 d236b7835a3f10e596f9ce3c1f988b9e897f2dea216fd3dcde880eb91963863e        3253848404      GCF_000001405.37        37      reference genome        Chromosome      9606    Homo sapiens    -       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.37_GRCh38.p11/GCF_000001405.37_GRCh38.p11_genomic.fna.gz   -       Homo sapiens    Homo    Hominidae       Primates        Mammalia        Chordata        Metazoa Eukaryota

This file can be used as an input to build a new representative database named “Human”:

sparse mapDB --dbname refseq --mapDB Human --seqlist human.tsv

Metagenomic reads are assigned using these representative databases, details see section on “read-level prediction”.