2. Setup of an OligoDatabase

The OligoDatabase is the core data structure for storing, organizing, and processing oligo sequences in any pipeline. It can load oligo sequences from one or more FASTA files, where each sequence is labeled with essential metadata in the header (see the previous tutorial for oligo sequences generation). This database provides a flexible and efficient way to manage oligos and their associated information for downstream filtering, scoring, and selection.

In this tutorial we show how to:

Create an oligo database from FASTA files
Pre-filter oligos by properties
Save and load a database object
Export an oligo database as tsv or FASTA

Imports and setup

[1]:

import os
import pprint

from pathlib import Path

from oligo_designer_toolsuite.database import OligoDatabase
from oligo_designer_toolsuite.oligo_property_calculator import (
    IsoformConsensusProperty,
    PropertyCalculator,
)

[2]:

dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3

Creating the OligoDatabase

An OligoDatabase object is initialized with parameters that define how the oligo data will be managed and stored.

Initialization Parameters:

min_oligos_per_region: Minimum number of oligos required for a region to be retained in the database. Regions with fewer oligos will be logged and excluded.
write_regions_with_insufficient_oligos: Whether to log regions that do not meet the minimum oligo threshold.
max_entries_in_memory: Determines the number of regions loaded into RAM at once, optimizing memory usage.
database_name: Name assigned to the database.
dir_output: Directory where the database and associated files will be saved.
n_jobs: Number of parallel processes to use for database operations.

[3]:

min_oligos_per_region = 3
write_regions_with_insufficient_oligos = True
max_entries_in_memory=n_jobs * 2 + 2
database_name="db_oligos"

[4]:

oligo_database = OligoDatabase(
    min_oligos_per_region=min_oligos_per_region,
    write_regions_with_insufficient_oligos=write_regions_with_insufficient_oligos,
    max_entries_in_memory=max_entries_in_memory,
    database_name=database_name,
    dir_output=dir_output,
    n_jobs=n_jobs,
)

Loading Sequences into the Database

The load_database_from_fasta() method imports sequences into the database from a FASTA file. This method can load both:

Target Sequences: Genomic sequences from which oligos will be designed.
Oligo Sequences: Pre-designed or randomly generated oligos.

The sequence_type parameter determines whether the sequences represent target regions or oligo sequences. For target sequences, the reverse complement will be automatically generated.

Input Requirements for FASTA Files

The input FASTA file must adhere to the following structure:

Header Format: Each sequence must have a header starting with the > character. The header should contain:

region_id: A unique identifier for the genomic region (e.g., gene name or ID). This is mandatory.
additional_information: Optional metadata fields such as transcript ID or exon number, separated by commas.
coordinates: Genomic location in the format chrom:start-end(strand), which is optional.

Sequence Content: The sequence follows the header in standard FASTA format.

Example:

>ASR1::transcrip_id=XM456,exon_number=5::16:54552-54786(+)

AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC

or

>ASR1

AGTTGACAGACCCCAGATTAAAGTGTGTCGCGCAACAC

[5]:

# Loading Target Sequences: Clears the database and loads genomic sequences as targets for oligo design.
gene_ids = ["AARS1", "DECR2", "PRR35"]
files_fasta = ["./data/AARS1.fna", "./data/DECR2.fna", "./data/PRR35.fna"]
oligo_database.load_database_from_fasta(
    files_fasta=files_fasta,
    database_overwrite=True, # Clears the existing database before loading the new sequences.
    sequence_type="target",
    region_ids=gene_ids,
)

# Appending Oligo Sequences: Adds pre-generated oligos (e.g., random sequences) to the existing database without clearing it.
oligo_random_fasta_file = "./data/random_probe_sequences.fna"
oligo_database.load_database_from_fasta(
    files_fasta=oligo_random_fasta_file,
    database_overwrite=False, # Appends the new sequences to the existing database.
    sequence_type="oligo",
    region_ids=None,
)

Once the OligoDatabase is created, oligos are loaded from FASTA files and stored in a nested dictionary structure. This step organizes the oligos, ensuring efficient storage and enabling downstream analysis. The metadata from the FASTA headers is automatically parsed and stored as features for each oligo, enabling flexible filtering and scoring.

Nested Dictionary Structure

The loaded oligos are stored in the OligoDatabase as a nested dictionary with the following structure:

[region_id][oligo_id][oligo_features]

region_id: A unique identifier (e.g., gene name) grouping oligos that belong to the same genomic region.
oligo_id: A unique identifier for each oligo within the region.
oligo_features: A dictionary containing metadata such as sequence, genomic location, and other annotations.

[6]:

region = list(oligo_database.database.keys())[0]
oligo_id_1 = list(oligo_database.database[region].keys())[0]
oligo_id_2 = list(oligo_database.database[region].keys())[1]

sample_oligos_DB = {region: {oligo_id_1: oligo_database.database[region][oligo_id_1],
                           oligo_id_2: oligo_database.database[region][oligo_id_2]}}
pprint.pprint(sample_oligos_DB)

{'AARS1': {'AARS1::1': {'annotation_release': [[110]],
                        'chromosome': [['16']],
                        'end': [[70265662]],
                        'exon_number': [[10, 10]],
                        'gene_id': [['AARS1']],
                        'genome_assembly': [['GRCh38']],
                        'number_total_transcripts': [[2]],
                        'regiontype': [['exon']],
                        'source': [['NCBI']],
                        'species': [['Homo_sapiens']],
                        'start': [[70265623]],
                        'strand': [['-']],
                        'target': 'GAGACACTGCTTGGCTCCTCTATGACACCTATGGGTTTCC',
                        'transcript_id': [['NM_001605.3', 'XM_047433666.1']]},
           'AARS1::2': {'annotation_release': [[110]],
                        'chromosome': [['16']],
                        'end': [[70265661]],
                        'exon_number': [[10, 10]],
                        'gene_id': [['AARS1']],
                        'genome_assembly': [['GRCh38']],
                        'number_total_transcripts': [[2]],
                        'regiontype': [['exon']],
                        'source': [['NCBI']],
                        'species': [['Homo_sapiens']],
                        'start': [[70265622]],
                        'strand': [['-']],
                        'target': 'AGACACTGCTTGGCTCCTCTATGACACCTATGGGTTTCCA',
                        'transcript_id': [['NM_001605.3', 'XM_047433666.1']]}}}

Pre-Filtering Oligos by Properties

After loading the oligos, a pre-filtering step can be performed to eliminate oligos that do not meet basic criteria. This filtering uses properties derived from the metadata in the FASTA headers or calculated using the PropertyCalculator class.

Calculating Properties: Additional properties, such as isoform_consensus (the percentage of transcripts covered by the oligo for the target gene), can be computed for each oligo.
Filtering by Property Threshold: Oligos are filtered based on a threshold for a specific property. For example, to retain only oligos with at least 50% isoform consensus.
Removing Regions with Insufficient Oligos: Regions with fewer than the minimum required number of oligos are removed from the database:

[7]:

isoform_consensus = 50

[8]:

# Calculating Properties
properties = [IsoformConsensusProperty()]
calculator = PropertyCalculator(properties=properties)
oligo_database = calculator.apply(
    oligo_database=oligo_database, sequence_type="oligo", n_jobs=n_jobs
)

# Filtering by Property Threshold
oligo_database.filter_database_by_property_threshold(
    property_name="isoform_consensus", #name of the property that should be used for filtering
    property_thr=isoform_consensus, #threshold for filtering
    remove_if_smaller_threshold=True, #define if the oligo should be removed if the property is greater or smaller than the defined threshold
)

# Removing Regions with Insufficient Oligos
oligo_database.remove_regions_with_insufficient_oligos(pipeline_step="Pre-Filters")

Saving and Retrieving the Database

To preserve the state of the database at any point, the save_database() and load_database() functions can be used. This ensures that progress is not lost and enables resuming the pipeline from intermediate steps.

[9]:

# Save Database
dir_database = oligo_database.save_database(name_database="1_db_oligos_initial")

# Load Database
oligo_database.load_database(dir_database=dir_database, database_overwrite=True, merge_databases_on_sequence_type="oligo")

Exporting the Database

The database can also be exported for analysis in other tools or formats:

Export as TSV: Outputs a table of oligos and their properties (write_database_to_table())
Export as FASTA: Outputs oligo sequences in FASTA format (write_database_to_fasta())
Export as BED: Outputs oligo sequences in BED format (write_database_to_bed())

The setup of the OligoDatabase is critical for several reasons:

Centralized Data Management: Provides a structured repository for oligos and their metadata.
Customizability: Allows for filtering based on the number of oligos per region and specific target regions.
Scalability: Efficiently handles large genomic datasets by managing memory usage and parallel processing.
Flexibility: Supports both genomic and pre-designed oligos, enabling a wide range of experimental setups.

By defining the database structure and loading sequences with proper metadata, this step ensures that the downstream steps (e.g., filtering and scoring) are applied seamlessly and effectively.