5. Oligoset Generation

The final step in designing oligos is to organize them into optimal sets that maximize experimental efficiency and reliability. This step evaluates individual oligos, groups them into sets of non-overlapping oligos and ranks the sets by their overall efficiency scores. The OligosetGenerator ensures that only the best-performing sets of oligos are selected for downstream experimental use.

Key Objectives in Oligoset Generation

  • Scoring Individual Oligos: Each oligo is assigned a score based on its theoretical efficiency in the experimental context. Scores are computed using a class derived from OligoScoringBase. A list of implemented oligo scores is available here.

  • Scoring Oligo Sets: Once individual oligos are scored, they are grouped into sets of oligos based on a set generator. A scoring class derived from SetScoringBase evaluates the overall efficiency of each set. A list of implemented oligo scores is available here.

  • Generating Oligo Sets: Oligos in each set can be selected based on their positional overlap (OligosetGeneratorIndependentSet) or the homogeneity of specified oligo properties (HomogeneousPropertyOligoSetGenerator).

  • Selection Policies: The OligoSelectionPolicy classes define the strategy for selecting and optimizing non-overlapping oligo sets from a pool of candidates. These policies use greedy or graph-based approaches to navigate the large combinatorial space of possible oligo combinations. This ensures the generated sets meet experimental requirements while adhering to constraints like set size and distance between oligos.

In this tutorial we show how one can:

Imports and setup

[1]:
import os
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path
from Bio.SeqUtils import MeltingTemp as mt

from oligo_designer_toolsuite.database import OligoDatabase
from oligo_designer_toolsuite.oligo_efficiency_filter import (
    LowestSetScoring,
    IsoformConsensusScorer,
    NormalizedDeviationFromOptimalTmScorer,
    NormalizedDeviationFromOptimalGCContentScorer,
    OligoScoring
)
from oligo_designer_toolsuite.oligo_property_calculator import (
    GCContentProperty,
    PropertyCalculator,
    TmNNProperty,
)
from oligo_designer_toolsuite.oligo_selection import (
    GraphBasedSelectionPolicy,
    OligosetGeneratorIndependentSet,
)
[2]:
dir_output = os.path.abspath("./results")
Path(dir_output).mkdir(parents=True, exist_ok=True)

n_jobs = 3
[3]:
# parameters
oligo_Tm_opt = 65
oligo_GC_content_opt = 50
oligo_isoform_weight = 2
oligo_Tm_weight = 1
oligo_GC_weight = 1

pre_filter = False
n_attempts = 100000
heuristic = True
heuristic_n_attempts = 100
clique_init_approximation = False

max_graph_size = 5000
distance_between_oligos = 0

oligo_size_opt = 5
oligo_size_min = 3
n_sets = 100

oligo_GC_content_min = 40
oligo_GC_content_max = 60

oligo_Tm_min = 60
oligo_Tm_max = 70

Tm_parameters_oligo = {
    "nn_table": "DNA_NN3", # Allawi & SantaLucia (1997)
    "tmm_table": "DNA_TMM1", #default
    "imm_table": "DNA_IMM1", #default
    "de_table": "DNA_DE1", #default
    "dnac1": 50, #[nM]
    "dnac2": 0, #[nM]
    "saltcorr": 7, # Owczarzy et al. (2008)
    "Na": 39, #[mM]
    "K": 75, #[mM]
    "Tris": 20, #[mM]
    "Mg": 10, #[mM]
    "dNTPs": 0, #[mM] default
}
Tm_parameters_oligo["nn_table"] = getattr(mt, Tm_parameters_oligo["nn_table"])
Tm_parameters_oligo["tmm_table"] = getattr(mt, Tm_parameters_oligo["tmm_table"])
Tm_parameters_oligo["imm_table"] = getattr(mt, Tm_parameters_oligo["imm_table"])
Tm_parameters_oligo["de_table"] = getattr(mt, Tm_parameters_oligo["de_table"])

Tm_chem_correction_param_oligo = {
    "DMSO": 0, #default
    "fmd": 20,
    "DMSOfactor": 0.75, #default
    "fmdfactor": 0.65, #default
    "fmdmethod": 1, #default
    "GC": None, #default
}

Load the database

Like in previous tutorials, we will also be working with OligoDatabase objects. If you don’t know how they work, please check our oligo database tutorial. In this tutorial, we will load an existing database.

[4]:
# Create Database object
min_oligos_per_region = 3
write_regions_with_insufficient_oligos = True
max_entries_in_memory=n_jobs * 2 + 2
database_name="db_oligos"

oligo_database = OligoDatabase(
    min_oligos_per_region=min_oligos_per_region,
    write_regions_with_insufficient_oligos=write_regions_with_insufficient_oligos,
    max_entries_in_memory=max_entries_in_memory,
    database_name=database_name,
    dir_output=dir_output,
    n_jobs=n_jobs,
)

# Load Database
dir_database = os.path.abspath("./data/3_db_oligos_specificity_filter")
oligo_database.load_database(dir_database=dir_database, database_overwrite=True, merge_databases_on_sequence_type="oligo")
Below we show the table of oligos for region AARS1, where we computed oligo properties like melting temperature and GC content for each oligo.
We can compare it later with the oligos that are kept after the set selection.
[5]:
properties = [
    GCContentProperty(),
    TmNNProperty(
        Tm_parameters=Tm_parameters_oligo,
        Tm_chem_correction_parameters=Tm_chem_correction_param_oligo,
        Tm_salt_correction_parameters=None,
    ),
]
calculator = PropertyCalculator(properties=properties)
oligo_database = calculator.apply(
    oligo_database=oligo_database, sequence_type="oligo", n_jobs=n_jobs
)
[6]:
output_table = oligo_database.get_oligo_property_table(properties=["oligo", "oligo_score", "isoform_consensus", "GC_content_oligo", "TmNN_oligo"], flatten=True, region_ids="AARS1")
output_table_plot = output_table.melt(id_vars="oligo_id", value_vars=["isoform_consensus", "GC_content_oligo", "TmNN_oligo"], var_name="properties", value_name="property_value")

sns.displot(data=output_table_plot, x="property_value", hue="properties", kind="kde", fill=True)
plt.title("Distribution of Oligo Properties")
plt.xlabel("Property value")
plt.xlim([0, 100])
plt.tight_layout()
plt.show()

../../_images/_getting_started__tutorials_5-oligoset-generation_9_0.png

Scoring

Each oligo scorer is implemented as a class inheriting from the abstract base class BaseScorer. This ensures all scorer have a standardized apply() method, which takes an OligoDatabase object, region_id and oligo_id as input and returns the calculated score. The OligoScoring class takes one or multiple scorer as inputs, iterates through a given set of region_ids and applies each scorer to the oligos of that region. The individual scores are summed for each oligo.

[ ]:
# oligo scoring
isoform_scorer = IsoformConsensusScorer(score_weight=oligo_isoform_weight)
Tm_scorer = NormalizedDeviationFromOptimalTmScorer(
    Tm_min=oligo_Tm_min,
    Tm_opt=oligo_Tm_opt,
    Tm_max=oligo_Tm_max,
    Tm_parameters=Tm_parameters_oligo,
    Tm_chem_correction_parameters=Tm_chem_correction_param_oligo,
    Tm_salt_correction_parameters=None,
    score_weight=oligo_Tm_weight,
)
GC_scorer = NormalizedDeviationFromOptimalGCContentScorer(
    GC_content_min=oligo_GC_content_min,
    GC_content_opt=oligo_GC_content_opt,
    GC_content_max=oligo_GC_content_max,
    score_weight=oligo_GC_weight,
)
oligos_scoring = OligoScoring(scorers=[isoform_scorer, Tm_scorer, GC_scorer])

# set scoring
set_scoring = LowestSetScoring(ascending=True)

Oligo Selection Policy

The GraphBasedSelectionPolicy uses the scoring strategies, defined above, to select sets of non-overlapping oligos that minimize the overall set score. Key features include:

Pre-Filtering: If pre_filter=True, oligos are pre-filtered before set selection, removing oligos that cannot form sets of at least oligo_size_min oligos. This improves performance for larger sets (e.g., oligo_size_min > 30) but can dramatically slow down small set selection (e.g., oligo_size_min < 30).

Search for Initial Set: The graph-based set selection approach starts with finding an inital set of oligos which fulfills the minimum requirements, i.e. having a size of at least oligo_size_min. If no initial set is found, the selection step is terminated for the respective region. If an initial set is found, this set is used as starting point for selecting optimal sets by minimizing the overall set score. For larger sets (e.g., oligo_size_min > 15) the prformance improves when we use an approximation that finds the largest non-overlapping set in the graph using clique_init_approximation=True, however, if the set size is small (e.g., oligo_size_min < 15) it is more efficient to iterate through all possible non-overlapping sets by setting clique_init_approximation=False.

Heuristic Search: A heuristic approach is employed to optimize set selection within a feasible runtime:

  • heuristic: Enables or disables heuristic optimization for faster results, which might not find the best possible set.

  • heuristic_n_attempts: Maximum number of attempts to find optimal sets.

Generating Oligosets

Using the OligosetGeneratorIndependentSet, the pipeline generates non-overlapping sets of oligos. The generator uses the scoring strategies and selection policies to create optimal sets of a user-defined size.

Set Parameters:

  • set_size_opt: Optimal number of oligos per set.

  • set_size_min: Minimum number of oligos required for a set.

  • n_sets: Number of sets to generate.

Graph Constraints:

  • max_graph_size: Limits the size of the graph for feasible computation.

  • distance_between_oligos: Ensures no overlap between selected oligos.

Note: If min_set_size is set to a large value, consider switching from the graph-based selection policy to the GreedySelectionPolicy, as it increases the likelihood of finding large sets. However, keep in mind that the greedy approach may yield lower-scoring sets compared to the graph-based method.

[8]:
selection_policy = GraphBasedSelectionPolicy(
    set_scoring=set_scoring,
    pre_filter=pre_filter,
    n_attempts=n_attempts,
    heuristic=heuristic,
    heuristic_n_attempts=heuristic_n_attempts,
    clique_init_approximation=clique_init_approximation,
)
probeset_generator = OligosetGeneratorIndependentSet(
    selection_policy=selection_policy,
    oligos_scoring=oligos_scoring,
    set_scoring=set_scoring,
    max_oligos=max_graph_size,
    distance_between_oligos=distance_between_oligos,
)
[9]:
oligo_database = probeset_generator.apply(
    oligo_database=oligo_database,
    sequence_type="oligo",
    set_size_opt=oligo_size_opt,
    set_size_min=oligo_size_min,
    n_sets=n_sets,
    n_jobs=n_jobs,
)

# Save Database
dir_database = "4_db_oligoset_selection"
oligo_database.save_database(name_database=dir_database)
[9]:
'/Users/lisasousa/Desktop/odt_projects/oligo-designer-toolsuite/tutorials/results/db_oligos/4_db_oligoset_selection'

Output Structure

The generated sets are saved in a pandas DataFrame with the following structure:

oligoset_id

oligo_0

oligo_1

oligo_2

oligo_n

set_score_1

set_score_2

0

AGRN_184

AGRN_133

AGRN_832

AGRN_706

0.3445

1.2332

  • oligoset_id: Identifies each oligo set.

  • oligo_0, oligo_1, …: Oligos in the set.

  • set_score_*: Scores representing the set’s efficiency.

[10]:
# Show oligosets for a specific gene
oligo_database.oligosets["AARS1"]
[10]:
oligoset_id oligo_0 oligo_1 oligo_2 oligo_3 oligo_4 set_score_worst set_score_sum
0 0 AARS1::2291 AARS1::10497 AARS1::14964 AARS1::12911 AARS1::11384 0.01 0.030
1 1 AARS1::10497 AARS1::2290 AARS1::14964 AARS1::12911 AARS1::13103 0.01 0.030
2 2 AARS1::10497 AARS1::2291 AARS1::14964 AARS1::12911 AARS1::13103 0.01 0.030
3 3 AARS1::10497 AARS1::2255 AARS1::14964 AARS1::12911 AARS1::13103 0.01 0.032
4 4 AARS1::10497 AARS1::2256 AARS1::14964 AARS1::12911 AARS1::13103 0.01 0.032
5 5 AARS1::10497 AARS1::14964 AARS1::12911 AARS1::13103 AARS1::1780 0.01 0.038

We can now inspect the selected oligos below. The table shows that the selected oligos all have an isoform conses of 100% (i.e. covering all isoforms), have a GC content which equal the optimal GC content and a melting temperature very close to the optimal melting temperature.

[11]:
output_table = oligo_database.get_oligo_property_table(properties=["start", "end", "oligo", "oligo_score", "isoform_consensus", "GC_content_oligo", "TmNN_oligo"], flatten=True, region_ids="AARS1")
output_table.sort_values(by="start")
[11]:
region_id oligo_id start end oligo oligo_score isoform_consensus GC_content_oligo TmNN_oligo
4 AARS1 AARS1::2291 70261065 70261108 CTGATCCCCCACTTTCAGGTCACCGTAGATGGTTCCAATGTGTA 0.002 100.0 50.0 65.01
3 AARS1 AARS1::2290 70261066 70261109 TGATCCCCCACTTTCAGGTCACCGTAGATGGTTCCAATGTGTAG 0.002 100.0 50.0 65.01
2 AARS1 AARS1::2256 70261100 70261143 CAATGTGTAGCACATACCCTCCTCGGACCTGAGCATTCTTCACT 0.004 100.0 50.0 65.02
1 AARS1 AARS1::2255 70261101 70261144 AATGTGTAGCACATACCCTCCTCGGACCTGAGCATTCTTCACTG 0.004 100.0 50.0 65.02
0 AARS1 AARS1::1780 70262378 70262421 AGCCTTCGTCATAGATCTGGCCTCCTTGCTCAGCATAGAAACAG 0.010 100.0 50.0 65.05
9 AARS1 AARS1::14964 70267764 70267805 CCTTCACCATGTCTGGGTCCTTCTTCAGCTCAGGAAATGCAT 0.006 100.0 50.0 65.03
8 AARS1 AARS1::13103 70270282 70270321 GGCCCATCCCTGTGTCAATGCTTTTCTTGGGAAGAGGTTT 0.010 100.0 50.0 65.05
7 AARS1 AARS1::12911 70271805 70271848 TTCCAGATCTCCAGCACATTAGGGTCGTCCTGGTTGACAAGATG 0.010 100.0 50.0 65.05
6 AARS1 AARS1::11384 70276990 70277033 AGAGCCCAGCATCTCGAAGAAGGTGTGATGATAGACATCCTTGC 0.010 100.0 50.0 65.05
5 AARS1 AARS1::10497 70282670 70282713 TGGTGGCAGACGAGTGAACATACGTATGCTCGTTCCTCTTGAAG 0.002 100.0 50.0 65.01

Applying set selection to the OligoDatabase is critical for several reasons:

  • Ensures Experimental Efficiency: Generates sets of high-scoring oligos, ensuring effective target binding without competition.

  • Customizable and Scalable: Users can tailor scoring strategies and selection policies to meet specific experimental needs.

  • Optimized Workflow: Pre-filtering approaches and heuristic methods enable efficient generation of high-quality oligosets, even for large datasets.

This step finalizes the pipeline by providing optimal, ready-to-use oligosets tailored to experimental requirements. These sets can then be directly integrated into downstream experimental protocols.