Run the HQTB Workflow
This page explains how to run the complete HTQB (high-quality template based) worfklow
and how to collected the neccessary data.
For more information about the steps of the worfklow, see Overview of the HQTB Workflow.
HQTB: Quickstart
The workflow can either be run directly from the command line or its functions can be called from inside a Python script. The input in both cases is a configuration file that contains all information needed (data file paths and parameters) to run it.
The configuration can be downloaded using the command line:
specimen setup config
This downloads a basic version with minimal parameters suitable for beginners. To download the advanced version that allows to adjust more parameters,
add -t hqtb-advanced to the command. For further options, refer to the manual page (specimen setup config --help).
To download the configuration file using Python, use:
import specimen
specimen.util.set_up.download_config(filename='./my_basic_config.yaml', type='hqtb-basic')
As with the command line access, the type can be changed to hqtb-advanced.
After downloading the configuration file, open it with an editor and change the parameters as needed. Missing entries will be reported when starting the worfklow.
To run the worfklow using the configuration file, use
specimen hqtb run pipeline "config.yaml"
on the command line or
specimen.hqtb.workflow.run_complete(config_file='config.yaml')
from inside a Python script or Jupyter Notebook with “config.yaml” being the path to your configuration file.
Note
Additionally, the worfklow can be run with a wrapper to susequently build multiple models for different genomes using the same parameters.
The wrapper can be accessed using specimen hqtb run wrapper "config.yaml" or specimen.workflow.wrapper_pipeline(config_file='/User/path/to/config.yaml', parent_dir="./").
HQTB: Collecting Data
If you are just starting a new project and do not have all the data ready to go, you can use the setup function of
SPECIMEN to help you collect the data you need.
specimen.util.set_up.build_data_directories('your_folder_name')
folder |
contains |
tags |
|---|---|---|
annotated_genomes
|
template + input
annotated + full
genome files
|
manual, required
|
BioCyc |
BioCyc smart table |
manual, optional |
medium |
media config, external media |
manual, optional |
MetaNetX |
MetaNetX mappings |
automated, required |
pan-core-models |
pan-core models |
manual, optional |
RefSeqs
|
DIAMOND database
for BLAST
|
semi, required
|
template-models |
template models |
manual, required |
universal-models |
universal models |
manual, optional |
Note
Regarding the annotated_genomes folder, the program currently only supports
the file types GBFF and FAA + FNA (from the NCBI and PROKKA annotation pipelines respectively)
as genome annotation formats.
Further details for collecting the data:
-
Downloading a smart table from BioCyc requires a subscription.
The SmartTable needs to have the columns ‘Reactions’, ‘EC-Number’, ‘KEGG reaction’, ‘METANETX’ and ‘Reaction-Direction’.
RefSeq
One way to build a DIAMOND reference database is to download a set of reference sequences from the NCBI database, e.g. in the FAA format.
Use the function
specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory', '/User/Path/for/output/', name = 'database', extention = 'faa')to create a DIAMOND databaseTo speed up the mapping, create an additional mapping file from the e.g.
GBFFfiles from NCBI usingspecimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory', '/User/Path/for/output/', extention = 'gbff')To ensure correct mapping to KEGG, an additional information file can be created by constructing a CSV file with the following columns: ‘NCBI genome’, ‘organism’, ‘locus_tag’ (only the part until the seperator ‘_’, the part, that is the same for all locus tags) and ‘KEGG.organism’
The information of the first three columns can be taken from the previous two steps while
For the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier.
This file is purely optional for running the worfklow but potentially leads to better results.
medium:
The media, either for analysis or gap filling can be entered into the workflow via a config file. The same media file can be used for both or one file for each step can be entered into the workflow. The config files are from the refineGEMs [1] toolbox and access its in-build medium database. Additionally, the config files allow for manual adjustment / external input.
An examplary config file can be accessed using the following command:
download_config(filename='my_media_config.yaml', type='media')
Or via the command line (additional name can be added using the flag
-f <name>):specimen setup config -t media
Note
The setup can be done via the command line as well, refer to specimen setup --help.