specimen.hqtb.core package
specimen.hqtb.core.refinement subpackage
specimen.hqtb.core.refinement.annotation submodule
specimen.hqtb.core.refinement.cleanup submodule
Perform the second refinement step, cleanup, on a model.
The second refinement step resolves the following issues:
(optional) checking direction of reactions with BioCyc
(optional) gapfilling (a) using refineGEMs GeneGapFiller (b) using cobra + universal model with reactions + a set of media
find and/or resolve duplicates (reactions and metabolites)
- Args:
- model (str):
The Path to an sbml model.
- dir (str):
Path to the directory of the output.
- biocyc_db (str, optional):
Path to the BioCyc/MetaCyc reaction database file. Defaults to None, which leads to skipping the direction check.
- run_gene_gapfiller (Union[None,dict], optional):
If a dictionary is given, tries to run the GeneGapFiller. If set to None, the gap-filling will be skipped The dictionary needs to contain the keys saved in
GGF_REQS. Defaults to None.
- check_dupl_reac (bool, optional):
Option to check for duplicate reactions. Defaults to False.
- check_dupl_meta (bool, optional):
Option to check for duplicate metabolites. Defaults to ‘default’, which checks based on MetaNetX first. Further options include ‘skip’ and ‘exhaustive’ (check for all possibilities).
- remove_unused_meta (bool, optional):
Option to remove unused metabolites from the model. Defaults to False.
- remove_dupl_reac (bool, optional):
Option to remove the duplicate reactions. True is only applied, if check_dupl_reac is also True. Defaults to False.
- remove_dupl_meta (bool, optional):
Option to remove the duplicate metabolites. True is only applied, if check_dupl_meta is also True. Defaults to False.
- universal (str, optional):
Path to a universal model for gapfilling. Defaults to None, which skips the gapfilling.
- media_path (str, optional):
Path to a medium config file for gapfilling. Defaults to None.
- namespace (Literal[‘BiGG’], optional):
Namespace to use for the model. Options include ‘BiGG’. Defaults to ‘BiGG’.
- growth_threshold (float, optional):
Growth threshold for the gapfilling. Defaults to 0.05.
- iterations (int, optional):
Number of iterations for the heuristic version of the gapfilling. If 0 or None is given, uses full set of reactions. Defaults to 3.
- chunk_size (int, optional):
Number of reactions to be used for gapfilling at the same time. If None or 0 is given, use full set, not heuristic. Defaults to 10000.
- memote (bool, optional):
Option to run memote on the cleaned model. Defaults to False.
- Raises:
ValueError: Unknown option for check_dupl_meta
KeyError: Missing parameter for GeneGapFiller
specimen.hqtb.core.refinement.extension submodule
specimen.hqtb.core.refinement.smoothing submodule
specimen.hqtb.core submodules
specimen.hqtb.core.analysis
Analyse a model (step 5 of the workflow).
- specimen.hqtb.core.analysis.run(model_path: str, dir: str, media_path: str = None, namespace: Literal['BiGG'] = 'BiGG', pc_model_path: str = None, pc_based_on: Literal['id'] = 'id', test_aa_auxotrophies: bool = True, pathway: bool = True)[source]
SPECIMEN Step 5: Analyse the generated model.
- Args:
- model_path (str):
Path to the model.
- dir (str):
Path to the output directory.
- media_path (str, optional):
Path to a media config file. Using this enables growth simulation. Defaults to None.
- namespace (Literal[‘BiGG’], optional):
Namespace to work on. Defaults to ‘BiGG’.
- pc_model_path (str, optional):
Path to a core-pan model. Defaults to None.
- pc_based_on (Literal[‘id’], optional):
How to compare the model to the core-pan model. Defaults to ‘id’.
- test_aa_auxotrophies (bool, optional):
Option to enable the amino acid auxotrophy simulation. Defaults to True.
- pathway (bool, optional):
Optional to enable KEGG pathway analysis. Defaults to True.
specimen.hqtb.core.bidirectional_blast
Perform a bidirectional blastp using DIAMOND on an input and a template (annotated genomes).
- specimen.hqtb.core.bidirectional_blast.bdbp_diamond(dir: str, template_name: str, input_name: str, template_path: str, input_path: str, sensitivity='sensitive', threads=2)[source]
Perform bidirectional blastp using DIAMOND.
- Args:
- dir (str):
Path to the directory parent to in/out.
- template_name (str):
Name of the template genome.
- input_name (str):
Name of the input genome.
- template_path (str):
Path to the CDS FASTA-file of the template.
- input_path (str):
Path to the CDS FASTA-file of the input.
- sensitivity (str, optional):
Sensitivity mode for DIAMOND. Defaults to ‘sensitive’.
- threads (int, optional):
Number of threads to use when running DIAMOND. Defaults to 2.
- specimen.hqtb.core.bidirectional_blast.create_diamond_db(dir: str, name: str, path: str, threads: int)[source]
Create a DIAMOND database for a given protein FASTA file.
- Args:
- dir (str):
Path to the data directory.
- name (str):
Name of the genome/database.
- path (str):
Path to the FASTA-file.
- threads (int):
Number of threads to use.
- specimen.hqtb.core.bidirectional_blast.extract_bestbdbp_hits(tvq: str, qvt: str, name: str, cov: float = 0.25)[source]
Extract the best directional blastp hits from two tsv files, which were generate by
bdbp_diamond()generated or similar steps.- Args:
- tvq (str):
Path to the template vs. query file.
- qvt (str):
Path to the query vs. template file.
- name (str):
Name (path) of the output file.
- cov (float, optional):
Cut-off value for the coverage. All hits with coverage < cov will be excluded. Defaults to 0.25.
- specimen.hqtb.core.bidirectional_blast.extract_cds(file: str, name: str, dir: str, collect_info: list, identifier: str) str[source]
Extract the CDS from a genbank file (annotated genome). Produces a FASTA-file.
- Args:
- file (str):
File to extract CDS from.
- name (str):
Name of the genome.
- dir (str):
Directory for the ouput.
- collect_info (list):
Feature identifiers to collect information from.
- identifier (str):
Feature identifier to use of the header of the FASTA.
- Returns:
- str:
Name of the FASTA-file
- specimen.hqtb.core.bidirectional_blast.run(template: str, input: str, dir: str, template_name: str | None = None, input_name: str | None = None, temp_header: str | None = None, in_header: str | None = None, threads: int = 2, extra_info: list[str] = ['locus_tag', 'product', 'protein_id'], sensitivity: Literal['sensitive', 'more-sensitive', 'very-sensitive', 'ultra-sensitive'] = 'more-sensitive')[source]
Run the bidirectional blast on a template and input genome (annotated).
- Args:
- template (str):
Path to the annotated genome file used as a template.
- input (str):
Path to the annotated genome file used as a input.
- dir (str):
Path to the output directory.
- template_name (str, optional):
Name of the annotated genome file used as a template.. Defaults to None.
- input_name (str, optional):
Name of the annotated genome file used as input.. Defaults to None.
- temp_header (str, optional):
Feature qualifier of the gbff (NCBI) / faa (PROKKA) of the template to use as header for the FASTA files. If None is given, sets it based on file extension (currently only implemented for gbff and faa). Defaults to ‘protein_id’.
- in_header (str, optional):
Feature qualifier of the gbff (NCBI) / faa (PROKKA) of the input to use as header for the FASTA files. If None is given, sets it based on file extension (currently only implememted for gbff and faa). Defaults to ‘locus_tag’.
- threads (int, optional):
Number of threads to be used for DIAMOND. Defaults to 2.
- extra_info (list[str], optional):
List of feature qualifiers to be extracted from the annotated genome files as additional information. Defaults to [‘locus_tag’, ‘product’, ‘protein_id’].
- sensitivity (Literal[‘sensitive’, ‘more-sensitive’, ‘very-sensitive’, ‘ultra-sensitive’], optional):
Sensitivity mode for DIAMOND blastp run.. Defaults to ‘more-sensitive’.
- Raises:
ValueError: Unknown file extension. Please set value for temp_header manually or check file.
ValueError: Unknown file extension. Please set value for in_header manually or check file.
ValueError: Unknown sensitive mode
- specimen.hqtb.core.bidirectional_blast.run_diamond_blastp(dir: str, db: str, query: str, fasta_path: str, sensitivity: str, threads: int)[source]
Run DIAMOND blastp for a given database name and FASTA - relies on the structure created by
bidirectional_blast.- Args:
- dir (str):
Parent directory of the place to save the files to.
- db (str):
Name of the genome/database used as the database.
- query (str):
Name of the genome used as the query.
- fasta_path (str):
Path to the FASTA-file containing the CDS.
- sensitivity (str):
Sensitivity mode to use for DIAMOND blastp.
- threads (int):
Number of threads that will be used for running DIAMOND
specimen.hqtb.core.generate_draft_model
Generate a draft model from a template model.
The basic idea has been adapted from Norsigian et al. (2020).
- specimen.hqtb.core.generate_draft_model.check_unchanged(draft: Model, bbh: DataFrame) Model[source]
Check the genes names (more correctly, the IDs) for still existing original col_names. Depending on the case, decide if to keep or remove them.
- Args:
- draft (cobra.Model):
The draft model currently in the making.
- bbh (pd.DataFrame):
The table from
run()containing the bidirectional blastp best hits information.
- Returns:
- cobra.Model:
The model after the check and possible removal of genes.
- specimen.hqtb.core.generate_draft_model.edit_template_identifiers(data: DataFrame, edit: Literal['no', 'dot-to-underscore']) DataFrame[source]
Edit the subject IDs to fit the gene IDs of the template model. Requires further extention, if needed edits are not included.
- Args:
- data (pd.DataFrame):
The data frame containing the bidirectional blastp best hits information.
- edit (Literal[‘no’,’dot-to-underscore’]):
Type of edit to perform. Currently possible options: no, dot-to-underscore.
- Returns:
- pd.DataFrame:
The (un)edited DataFrame.
- specimen.hqtb.core.generate_draft_model.gen_draft_model(model: Model, bbh: DataFrame, name: str, dir: str, edit: Literal['no', 'dot-to-underscore'], medium: str = 'default', namespace: Literal['BiGG'] = 'BiGG') Model[source]
Generate a draft model from a template model and the results of a bidirectional blastp (blast best hits) table and save it as a new model.
- Args:
- model (cobra.Model):
The template model.
- bbh (pd.DataFrame):
The bidirectional blastp best hits table.
- name (str):
Name of the newly generated model.
- dir (str):
Path to the directory to save the new model in.
- edit (Literal[‘no’,’dot-to-underscore’):
Type of edit to perform. Currently possible options: no, dot-to-underscore.
- medium (str, optional):
Name of the to be loaded from the refineGEMs database or ‘default’ = the one from the template model. If given the keyword ‘exchanges’, will use all exchange reactions in the model as a medium. Defaults to ‘default’.
- namespace (Literal[‘BiGG’], optional):
Namespace of the model. Defaults to ‘BiGG’.
- Returns:
- cobra.Model:
The generated draft model.
- specimen.hqtb.core.generate_draft_model.pid_filter(data: DataFrame, pid: float) DataFrame[source]
Filter the data based on PID threshold. Entries above the given value are retained.
- Args:
- data (pd.DataFrame):
The data from teh previous step (see
bidirectional_blast) containing at least a ‘PID’ column.
- pid (float):
PID threshold value, given in percentage e.g. 80.0.
- Returns:
- pd.DataFrame:
The filtered data.
- specimen.hqtb.core.generate_draft_model.remove_absent_genes(model: Model, genes: list[str]) Model[source]
Remove a list of genes from a given model.
Note
Genes that are not found in the model are skipped.
- Args:
- model (cobra.Model):
A template model to delete genes from. A copy will be created before deleting.
- genes (list[str]):
Gene identifiers of genes that should be deleted.
- Returns:
- cobra.Model:
A new model with the given genes deleted, if found in the original model.
- specimen.hqtb.core.generate_draft_model.rename_found_homologs(draft: Model, bbh: DataFrame) Model[source]
Rename the genes in the model correnspondingly to the homologous ones found in the query.
- Args:
- draft (cobra.Model):
The draft model with the to-be-renamed genes.
- bbh (pd.DataFrame):
The table from
run()containing the bidirectional blastp best hits information
- Returns:
- cobra.Model:
The draft model with renamed genes.
- specimen.hqtb.core.generate_draft_model.run(template: str, bpbbh: str, dir: str, edit_names: Literal['no', 'dot-to-underscore'] = 'no', pid: float = 80.0, name: str | None = None, medium: str = 'default', namespace: str = 'BiGG', memote: bool = False)[source]
Generate a draft model from a blastp best hits tsv file and a template model.
- Args:
- template (str):
Path to the file containing the template model.
- bpbbh (str):
Path to the blastp bidirectional best hits.
- dir (str):
Path to output directory.
- edit_names (Literal[‘no’,’dot-to-underscore’, optional):
Type of edit to perform. Currently possible options: no, dot-to-underscore. Defaults to ‘no’.
- pid (float, optional):
Threshold value for determining, if a gene is counted as present or absent. Given in percentage, e.g. 80.0 = 80%. Defaults to 80.0.
- name (Union[str,None], optional):
Name of the output model. If not given, takes name from filename. Defaults to None.
- medium (str, optional):
Name of the medium to be loaded from the refineGEMs database or ‘default’ = the one from the template model. If given the keyword ‘exchanges’, will use all exchange reactions in the model as a medium. Defaults to ‘default’.
- namespace (str, optional):
Namespace of the model. Defaults to ‘BiGG’.
- memote (bool, optional):
Option to run memote after creating the draft model. Defaults to False.
- Raises:
ValueError: ‘Edit_names value not in list of allowed values: no, dot-to-underscore’
specimen.hqtb.core.validation
Validate a model (step 4 of the workflow).
Implemented tests in include: - cobra/sbml check using cobrapy
- specimen.hqtb.core.validation.run(dir: str, model_path: str, tests: None | str | list = None, run_all: bool = True)[source]
SPECIMEN Step 4: Validate the model.
Included tests (name : description): - modelpolisher: Semantic control and BiGG annotation fixing with ModelPolisher - cobra: SBML validation using COBRApy
- Args:
- dir (str):
Path to the output directory.
- model_path (str):
Path to the model to be validated
- tests (Union[None, str, list], optional):
Tests to perform. If the test name is either in a string or an element in a list, the corresponding test will be run. Defaults to None.
- run_all (bool, optional):
Run al available tests. If True, overwrites the previous parameter. Defaults to True.