18:30
Welcome Reception
08:15 - 08:30
Opening Remarks
Confluence of structure-based models and machine learning methods toward gaining a deeper understanding of biomolecular systems dynamics
Biomolecular systems have access to a spectrum of modes of motions under physiological conditions. These motions, referred to as intrinsic dynamics, ensure the adaptation to various interactions in the cell, and largely assist in, if not determine, viable mechanisms of biological function. In recent years, machine learning frameworks have proven uniquely useful in structural biology, permitting us to predict the structures of millions of proteins. However, prediction of structure per se falls short of providing insights into the time-dependent mechanisms of function. There is a need to predict or visualize the intrinsic dynamics of biomolecular structures in order to make inferences on potential mechanisms of function or dysfunction. Not surprisingly, machine learning studies that leverage structural dynamics data now increasingly gain traction. The merger of machine learning and structural dynamics theory and methods is expected to be mutually beneficial for bridging the gap between structure and function. In particular, physics-based theories and models such as elastic network models provide a unique opportunity to efficiently generate data on the dynamics of the proteome which may then be advantageously employed by machine learning methods to assist in inferring mechanisms of protein function, assessing pathogenicity, or estimating binding affinities.
9:30 - 10:30
Session Chair: Ercüment Çiçek
[PT] Cell segmentation for high-resolution spatial transcriptomics
Hao Chen, Dongshunyi Li and Ziv Bar-Joseph
Spatial transcriptomics promises to greatly improve our ability to understand tissue organization and cell-cell interactions. While most current platforms for spatial transcriptomics only provide multi-cellular resolution (10-15 cells per spot), recent technologies provide a much denser spot placement leading to sub-cellular resolution. A key challenge for these newer methods is cell segmentation and the assignment of spots to cells. Traditional, image based, segmentation methods face several drawbacks and do not take full advantage of the information profiled by spatial transcriptomics. Here, we present SCS, which integrates imaging data with sequencing data to improve cell segmentation accuracy. SCS combines information from neighboring spots and employs a transformer model to adaptively learn the relevance of different spots and the relative position of each spot to the center of its cell. We tested SCS on two new sub-cellular spatial transcriptomics technologies and compared its performance to traditional image based segmentation methods. As we show, SCS achieves better accuracy, identifies more cells and leads to more realistic cell size estimation. Analysis of RNAs enriched in different sub-cellular regions based on SCS spot assignments provides information on RNA localization and further supports the segmentation results.
[PT] Unraveling causal gene regulation from the RNA velocity graph using Velorama
Rohit Singh, Alexander Wu, Anish Mudide and Bonnie Berger
Gene regulatory network (GRN) inference that incorporates single-cell RNA-seq (scRNA-seq) differentiation trajectories or RNA velocity can reveal causal links between transcription factors and their target genes. However, current GRN inference methods require a total ordering of cells along a linear pseudotemporal axis, which is biologically inappropriate since trajectories with branches cannot be reduced to a single time axis. Such orderings are especially difficult to derive from RNA velocity studies since they characterize each cell’s state transition separately. Here, we introduce Velorama, a novel conceptual approach to causal GRN inference that newly represents scRNA-seq differentiation dynamics as a partial ordering of cells and operates on the directed acyclic graph (DAG) of cells constructed from pseudotime or RNA velocity measurements. In contrast to previous approaches, Velorama is able to work directly with RNA velocity-based cell-to-cell transition probabilities. On a standard set of synthetic datasets, we first demonstrate Velorama’s use with just pseudotime, finding that it improves area under the precision-recall curve (AUPRC) by 1.5–3x over state-of-the-art approaches. Using RNA velocity instead of pseudotime as the input to Velorama further improves AUPRC by an additional 1.8–3x. We also applied Velorama to study cell differentiation in pancreas, dentate gyrus, and bone marrow from real datasets and obtained intriguing evidence for the relationship between regulator interaction speeds and mechanisms of gene regulatory control during differentiation. We expect Velorama to be a critical part of the RNA velocity toolkit for investigating the causal drivers of differentiation and disease.
[PT] PASTE2: Partial alignment of multi-slice spatially resolved transcriptomics data
Xinhao Liu, Ron Zeira and Ben Raphael
Spatially resolved transcriptomics (SRT) technologies measure mRNA expression at thousands of locations in a tissue slice. However, nearly all SRT technologies measure expression in two-dimensional slices extracted from a three-dimensional tissue, thus losing information that is shared across multiple slices from the same tissue. Integrating SRT data across multiple slices can help recover this information and improve downstream expression analyses, but multi-slice alignment and integration remains a challenging task. Existing methods for integrating SRT data either do not use spatial information or assume that the morphology of the tissue is largely preserved across slices, an assumption that is often violated due to biological or technical reasons. We introduce PASTE2, a method for partial alignment and 3D reconstruction of multi-slice SRT datasets, allowing only partial overlap between aligned slices and/or slice-specific cell types. PASTE2 formulates a novel partial Fused Gromov-Wasserstein Optimal Transport problem, which we solve using a conditional gradient algorithm. PASTE2 includes a model selection procedure to estimate the fraction of overlap between slices, and optionally uses information from histological images that accompany some SRT experiments. We show on both simulated and real data that PASTE2 obtains more accurate alignments than existing methods. We further use PASTE2 to reconstruct a 3D map of gene expression in a Drosophila embryo from a 16-slice Stereo-seq dataset. PASTE2 produces accurate alignments of multi-slice datasets from multiple SRT technologies, enabling detailed studies of spatial gene expression across a wide range of biological applications.
10:30 - 11:00
Coffee Break
11:00 - 12:00
Session Chair: Layla Oesper
[PT] Percolate: an exponential family JIVE model to design DNA-based predictors of drug response
Soufiane Mourragui, Marco Loog, Mirrelijn van Nee, Mark van de Wiel, Marcel Reinders and Lodewyk Wessels
Motivation: Anti-cancer drugs may elicit resistance or sensitivity through mechanisms which involve several genomic layers. Nevertheless, we have demonstrated that gene expression contains most of the predictive capacity compared to the remaining omic data types. Unfortunately, this comes at a price: gene expression biomarkers are often hard to interpret and show poor robustness.
Results: To capture the best of both worlds, i.e. the accuracy of gene expression and the robustness of other genomic levels, such as mutations, copy-number or methylation, we developed Percolate, a computational approach which extracts the joint signal between gene expression and the other omic data types. We developed an out-of-sample extension of Percolate which allows predictions on unseen samples without the necessity to recompute the joint signal on all data. We employed Percolate to extract the joint signal between gene expression and either mutations, copy-number or methylation, and used the out-of sample extension to perform response prediction on unseen samples. We showed that the joint signal recapitulates, and sometimes exceeds, the predictive performance achieved with each data type individually. Importantly, molecular signatures created by Percolate do not require gene expression to be evaluated, rendering them suitable to clinical applications where only one data type is available.
Availability: Percolate is available as a Python 3.7 package and the scripts to reproduce the results are available here.
[PT] Single-cell methylation sequencing data reveal succinct metastatic migration histories and tumor progression models
Yuelin Liu, Xuan C. Li, Farid Rashidi Mehrabadi, Alejandro A. Schäffer, Drew Pratt, David R. Crawford, Salem Malikić, Erin K. Molloy, Vishaka Gopalan, Stephen M. Mount, Eytan Ruppin, Kenneth Aldape and S. Cenk Sahinalp
Recent studies exploring the impact of methylation in tumor evolution suggest that while the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Since changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor’s single-cell methylation lineage tree and jointly identifying lineage-informative CpG sites which harbor changes in methylation status that are retained along the lineage. We apply Sgootr on the single-cell bisulfite-treated whole genome sequencing data of multiregionally-sampled tumor cells from 9 metastatic colorectal cancer patients made available by Bian et al., as well as multiregionally-sampled single-cell reduced-representation bisulfite sequencing data from a glioblastoma patient made available by Chaligne et al.. We demonstrate that the tumor lineages constructed reveal a simple model underlying colorectal tumor progression and metastatic seeding. A comparison of Sgootr against alternative approaches shows that Sgootr can construct lineage trees with fewer migration events and more in concordance with the sequential-progression model of tumor evolution, in time a fraction of that used in prior studies. Interestingly, lineage-informative CpG sites identified by Sgootr are in inter-CpG island (CGI) regions, as opposed to CGI’s, which have been the main regions of interest in genomic methylation-related analyses. Sgootr is implemented as a Snakemake workflow, available at https://github.com/algo-cancer/Sgootr.
[PT] Modeling and predicting cancer clonal evolution with reinforcement learning
Stefan Ivanovic and Mohammed El-Kebir
Cancer results from an evolutionary process that typically yields multiple clones with varying sets of mutations within the same tumor. Accurately modeling this process is key to understanding and predicting cancer evolution. Here, we introduce CloMu (Clone To Mutation), a flexible and low-parameter tree-generative model of cancer evolution. CloMu uses a two-layer neural network trained via reinforcement learning to determine the probability of new mutations based on the existing mutations on a clone. CloMu supports several prediction tasks, including the determination of evolutionary trajectories, tree selection, causality and interchangeability between mutations, and mutation fitness. Importantly, previous methods support only some of these tasks, and many suffer from overfitting on datasets with a large number of mutations. Using simulations, we demonstrate that CloMu either matches or outperforms current methods on a wide variety of prediction tasks. In particular, for simulated data with interchangeable mutations, current methods are unable to uncover causal relationships as effectively as CloMu. On breast cancer and leukemia cohorts, we show that CloMu determines similarities and causal relationships between mutations as well as the fitness of mutations. We validate CloMu's inferred mutation fitness values for the leukemia cohort by comparing them to clonal proportion data not used during training, showing high concordance. In summary, CloMu's low-parameter model facilitates a wide range of prediction tasks regarding cancer evolution on increasingly available cohort-level datasets.
12:00 - 13:20
Lunch Break
13:20 - 14:40
Session Chair: Louxin Zhang
[PT] TREE-QMC: Improving quartet graph construction for scalable and accurate species tree estimation from gene trees
Yunheng Han and Erin K. Molloy
Background: Summary methods are one of the dominant approaches to species tree estimation from genome-scale data sets. They are utilized as part of a pipeline in which the first step is to estimate trees from individual regions of the genome (called gene trees) and the second step is to “summarize” them into a species tree. This approach can fail to produce accurate species trees when the input gene trees are highly discordant due to gene tree estimation error as well as biological processes, like incomplete lineage sorting (ILS).
Methods: Here, we introduce a new summary method, TREE-QMC, that is fast and accurate under such challenging scenarios. TREE-QMC builds upon the algorithmic framework of wQMC, which takes a set of weighted quartets (four- leaf trees) as input and then builds a species tree in a divide-and-conquer fashion. At each step in the divide phase, a branch (split) in the species tree is identified by constructing a graph and then seeking its max cut. We improve upon this approach in two ways. First, we address scalability by providing an algorithm to construct the graph directly from the input gene trees instead of the Theta(n^4) weighted quartets; this enables TREE-QMC to run in O(n^3 k) time with some mild assumptions on subproblem sizes, where n is the number of taxa and k is the number of gene trees. Second, we address accuracy by normalizing the quartet weights to account for “artificial taxa,” which are introduced during the divide phase so that solutions on subproblems can be combined during the conquer phase. We introduce both uniform (n1) and non-uniform (n2) normalization schemes, with the latter up-weighting quartets with leaves labeled by species more closely related to the subproblem (n0 denotes no normalization).
Results: We explore the utility of TREE-QMC for multi-locus species tree estimation, benchmarking it against the current leading methods. Our simulation study shows TREE-QMC-n2 is at least as accurate and often more accurate than the dominant method ASTRAL-III (and its improvement FASTRAL), while being highly competitive in terms of runtime. Moreover, TREE-QMC-n2 is at least as accurate as wQFM, while scaling to much larger data sets. We also re-analyze an avian data set; the estimated species trees differ only on the short branches suggestive of ILS, with TREE-QMC recovering a tree closer to the published reference tree than the other methods.
Click for the preprint.
[PT] Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash
Mahmudur Rahman Hera, N. Tessa Pierce-Ward and David Koslicki
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. This approach has been successfully applied to metagenomic taxonomic profiling in the widely used tool "sourmash gather". While experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis to derive various statistics of FracMinHash, and prove that while FracMinHash is not unbiased (in the sense that its expected value is not equal to the quantity it attempts to estimate), this bias is easily corrected for both the containment and Jaccard index versions. Next, we show how FracMinHash can be used to compute point estimates as well as confidence intervals for evolutionary mutation distance between a pair of sequences by assuming a simple mutation model. We also investigate edge cases where these analyses may fail, to effectively warn the users of FracMinHash indicating the likelihood of such cases. Our analyses show that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely when compared to traditional MinHash, and the point estimates and confidence intervals perform significantly better in estimating mutation distances. A python-based implementation of the algorithms and theorems we derive is freely available at "https://github.com/KoslickiLab/mutation-rate-ci-calculator". The results presented in this paper can be reproduced using the code at "https://github.com/KoslickiLab/FracMinHash-reproducibles".
[PT] Efficient taxa identification using a pangenome index
Omar Ahmed, Massimiliano Rossi, Christina Boucher and Ben Langmead
Tools that classify sequencing reads against a database of reference sequences require efficient index data structures. The r-index is a compressed full-text index that answers substring presence/absence, count and locate queries in space proportional to the amount of distinct sequence in the database: O(r) space where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e.~taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents D = {T1, T2, ... , Td} over an alphabet of size sigma, we extend the r-index with O(rd) additional words to support document listing queries for a pattern S[1..m] that occurs in ndoc documents in D in O(mloglogw(sigma) + ndoc) time and O(rd) space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to 3 times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared to other approaches. Finally, we present strategies for compacting this structure in applications where read lengths or match lengths can be bounded. Our source code can be found at https://github.com/oma219/docprofiles, and our experimental code can be found at https://github.com/oma219/docprof-experiments.
[PT] Leveraging family data to design Mendelian Randomization that is provably robust to population stratification
Nathan LaPierre, Boyang Fu, Steven Turnbull, Eleazar Eskin and Sriram Sankararaman
Mendelian Randomization (MR) is a widely-used analytical tool that uses genetic variants ("genetic instruments") to determine whether one trait (the "exposure") has a causal effect on another (the "outcome"). The validity of MR rests on three key assumptions: (i) that the genetic instrument affects the exposure; (ii) that the genetic instrument is independent of confounders of the exposure-outcome relationship; (iii) that the genetic instrument affects the outcome only through its effect on the exposure. Unfortunately, the latter two assumptions are often violated in practice, due to several factors including horizontal pleiotropy and population stratification (and related phenomena such as assortative mating and dynastic effects). These biases can lead to bias and false positive findings in MR studies. In this abstract, we introduce MR-Twin, a test for causal effects between pairs of traits that is able to leverage family-based genetic data to provably control for population stratification, cross-trait assortative mating, and dynastic effects, and avoids false positives due to weak instrument bias. MR-Twin tests for a causal effect by comparing an appropriate statistic computed on the offspring in the families to their "digital twins", which are simulated offspring created by sampling offspring genotypes from parental genotypes. By testing for causal effects while conditioning on parental genotypes, MR-Twin avoids confounding caused by population stratification, since offspring genotypes are independent of population information given the genotypes of their parents.
14:40 - 15:10
Coffee Break
15:10 - 16:10
Session Chair: Rayan Chikhi
[PT] Entropy predicts sensitivity of pseudo-random seeds
Benjamin Dominik Maier and Kristoffer Sahlin
In sequence similarity search applications such as read mapping, it is desired that seeds match between a read and reference in regions with mutations or read errors (seed sensitivity). K-mers are likely the most well-known and used seed construct in bioinformatics, and many studies on, e.g., spaced k-mers aim to improve sensitivity over k-mers. Spaced k-mers are highly sensitive when substitutions largely dominate the mutation rate but quickly deteriorate when indels are present. Recently, we developed a pseudo-random seeding construct, strobemers, which were empirically demonstrated to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why.
In this study, we demonstrate that a seed’s entropy (randomness) is a good predictor for seed sensitivity. We propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. We also present three new strobemer seed constructs, mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to demonstrate that our new seed constructs improve sequence-matching sensitivity to other strobemers. We implement strobemers into minimap2 and observe slightly faster alignment time and higher accuracy than using k-mers at various error rates. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. In addition, we show that the three new seed constructs are practically useful. Finally, in cases where our entropy model does not predict the observed sensitivity well, we explain why and how to improve the model in future work.
[PT] Seed-chain-extend alignment is accurate and runs in close to O(m log n) time for similar sequences: a rigorous average-case analysis
Jim Shaw and Yun William Yu
Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment employed by modern sequence aligners. While effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ∼ n that is indexed (or seeded) and a mutated substring of length ∼ m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear gap cost chaining and quadratic time gap extension is O(mn^(f(θ)) log n) where f (θ) < 2.43·θ holds as a loose bound. In fact, for reasonable θ = 0.05, f(θ) < 0.08, indicating nearly quasilinear running time in practice. The alignment also turns out to be good; we prove that more than 1 − O(1/√m) fraction of the homologous bases are recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, i.e. only a subset of all k-mers is selected. Under the open syncmer sketching method, one can sketch with decreasing density as a function of n and achieve asymptotically smaller chaining time, yet the same bounds for extension time and recoverability hold. In other words, sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and conjecture that f (θ) can be further reduced.
[PT] Ultra-fast genome-wide inference of pairwise coalescence times
Regev Schweiger and Richard Durbin
The pairwise sequentially Markovian coalescent (PSMC) algorithm and its extensions infer the coalescence time of two homologous chromosomes at each genomic position. This inference is utilized in reconstructing demographic histories, detecting selection signatures, genome-wide association studies, constructing ancestral recombination graphs and more. Inference of coalescence times between each pair of haplotypes in a large dataset is of great interest, as they may provide rich information about the population structure and history of the sample.
We introduce a new method, Gamma-SMC, which is >14 times faster than current methods. To obtain this speed up, we represent the posterior coalescence time distributions succinctly as a Gamma distribution with just two parameters; while in PSMC and its extensions, these are held as a vector over discrete intervals of time. Thus, Gamma-SMC has constant time complexity per site, without dependence on a number of discrete time states. Additionally, due to this continuous representation, our method is able to infer times spanning many orders of magnitude, and as such is robust to parameter misspecification. We describe how this approach works, illustrate its performance on simulated and real data, and use it to study recent positive selection in the 1000 Genomes Project dataset.
Studying single-cell dynamics and fate decisions across modalities, time and space
Single-cell technologies are revolutionizing our understanding of cellular dynamics across biological processes. However, analyzing and interpreting these data poses computational and conceptual challenges, in particular with recent developments regarding spatio-temporal profiling and lineage tracing. Here, I will discuss approaches for studying single-cell dynamics and fate decisions across molecular modalities, time, and space.
After a brief review of pseudotemporal ordering and RNA velocity, I will show how optimal transport can be used consistently across biological applications, including temporal, spatial, and spatio-temporal single cell problems, such as aligning multi-modal single-cell data across space and time. In addition I will discuss how to map cells across time points by leveraging lineage as well as state information, and demonstrate application to in-vivo lineage-traced single-cell studies to uncover precise differentiation trajectories and gene dynamics. Finally I will discuss CellRank and a recent extension beyond RNA velocity to learn dynamics based on any pseudotime, developmental potential, real-time information, and metabolic labeling data.
17:10
Business Meeting
8:30 - 9:30
Session Chair: Krister Swenson
[PT] Statistically consistent rooting of species trees under the multispecies coalescent model
Yasamin Tabatabaee, Sebastien Roch and Tandy Warnow
Rooted species trees are used in several downstream applications of phylogenetics. Most species tree estimation methods produce unrooted trees and additional methods are then used to root these unrooted trees. Recently, Quintet Rooting (QR) (Tabatabaee et al., ISMB and Bioinformatics 2022), a polynomial-time method for rooting an unrooted species tree given unrooted gene trees under the multispecies coalescent, was introduced. QR, which is based on a proof of identifiability of rooted 5-taxon trees in the presence of incomplete lineage sorting, was shown to have good accuracy, improving over other methods for rooting species trees when incomplete lineage sorting was the only cause of gene tree discordance, except when gene tree estimation error was very high. However, the statistical consistency of QR was left as an open question. Here, we present QR-STAR, a polynomial-time variant of QR that has an additional step for determining the rooted shape of each quintet tree. We prove that QR-STAR is statistically consistent under the multispecies coalescent model, and our simulation study shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting.
[PT] Startle: a star homoplasy approach for CRISPR-Cas9 lineage tracing
Palash Sashittal, Henri Schmidt, Michelle Chan and Ben Raphael
CRISPR-Cas9 based genome editing combined with single-cell sequencing enables the tracing of the history of cell divisions, or cellular lineage, in tissues and whole organisms. While standard phylogenetic approaches may be applied to reconstruct cellular lineage trees from this data, the unique features of the CRISPR-Cas9 editing process motivate the development of specialized models that describe the evolution of cells with CRISPR-Cas9 induced mutations. Here, we introduce the star homoplasy model, a novel evolutionary model that constrains a phylogenetic character to mutate at most once along a lineage, capturing the non-modifiability property of CRISPR-Cas9 mutations. We derive a combinatorial characterization of star homoplasy phylogenies by identifying a relationship between the star homoplasy model and the binary perfect phylogeny model. We use this characterization to develop an algorithm, Startle (Star tree lineage estimator), that computes a maximum parsimony star homoplasy phylogeny. We demonstrate that Startle infers more accurate phylogenies on simulated CRISPR-based lineage tracing data compared to existing methods; particularly on data with high amounts of dropout and homoplasy. Startle also infers more parsimonious phylogenies with fewer metastatic migrations on a lineage tracing dataset from mouse metastatic lung adenocarcinoma.
[PT] MD-Cat: Phylogenetic dating under a flexible categorical model using expectation-maximization
Uyen Mai, Eduardo Charvel and Siavash Mirarab
Dating phylogenetic trees to obtain branch lengths in the unit of time is essential for many downstream applications but has remained challenging. Dating requires inferring mutation rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a clock model that defines a distribution over rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical clock model vastly differs from the true distribution. Notably, existing methods tend to assume rigid, often unimodal rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation-Maximization (EM) algorithm to co-estimate rate categories and branch lengths in the time unit. Our model has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with nonmodal or multimodal clock models.
Genomics, Imaging, and AI. Three technologies which are changing basic research and clinical practice
The landscape of biological science has been changing over the last two decades. Genomics and Imaging are the two dominant high dimensional measurement technologies for living systems. They provide complex, high dimensional data which can be utilised in a variety of ways. Coupled with this the rise of Machine Learning, including the application of Deep neural networks and other technologies grouped as artificial intelligence has given a new toolkit to navigate, explore and provide understanding of this data. In this talk I will outline this change and opportunity for this field.
10:30 - 11:00
Coffee Break
11:00 - 12:00
Session Chair: Iman Hajirasouliha
[PT] VStrains: de novo reconstruction of viral strains via iterative path extraction from assembly graphs
Runpeng Luo and Yu Lin
With the high mutation rate in viruses, a mixture of closely related viral strains (called viral quasispecies) often co-infect an individual host. Reconstructing individual strains from viral quasispecies is a key step to characterizing the viral population, revealing strain-level genetic variability, and providing insights into biomedical and clinical studies. Reference-based approaches of reconstructing viral strains suffer from the lack of high-quality references due to high mutation rates and biased variant calling introduced by a selected reference. De novo methods require no references but face challenges due to errors in reads, the high similarity of quasispecies, and uneven abundance of strains.
In this paper, we propose VStrains, a de novo approach for reconstructing strains from viral quasispecies. VStrains incorporates contigs, paired-end reads, and coverage information to iteratively extract the strain-specific paths from assembly graphs. We benchmark VStrains against multiple state-of-the-art de novo and reference-based approaches on both simulated and real datasets. Experimental results demonstrate that VStrains achieves the best overall performance on both simulated and real datasets under a comprehensive set of metrics such as genome fraction, duplication ratio, NGA50, error rate, etc.
[PT] Spectrum preserving tilings enable sparse and modular reference indexing
Jason Fan, Jamshed Khan, Giulio Ermanno Pibiri and Robert Patro
The reference indexing problem for k-mers is to pre-process a collection of reference genomic sequences R so that the position of all occurrences of any queried k-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics.
In this work, we introduce the spectrum preserving tiling (SPT), a general representation of R that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in R. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over a SPT decomposes the reference indexing problem for kmers into: (1) a k-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index k-mer-sets can be used to efficiently implement the k-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the k-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique k-mers in R.
To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool: pufferfish2. When indexing 30,000 bacterial genomes, pufferfish2 reduces the size of the tile-to-occurrence mapping from 86.3G to 34.6G while incurring only a 3.6x slowdown when querying k-mers from a sequenced readset.
[PT] Extremely-fast construction and querying of compacted and colored de Bruijn graphs with GGCAT
Andrea Cracco and Alexandru I. Tomescu
Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted graphs Bruijn graphs are a variant built on a collection of sequences, and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs. Compared to Cuttlefish~2 (Genome Biology, 2022), the state-of-the-art for constructing compacted de Bruijn graphs, GGCAT has a speedup of up to 3.4x for k = 63 and up to 20.8x for k = 255. Compared to Bifrost (Genome Biology, 2020), the state-of-the-art for constructing the colored variant, GGCAT achieves a speedup of up to 33.3x for k = 27. GGCAT is up to 480x faster than BiFrost for batch sequence queries on colored graphs. GGCAT is based on a new approach merging the k-mer counting step with the unitig construction step, and on many practical optimizations. GGCAT is implemented in Rust and is freely available at https://github.com/algbio/ggcat.
12:00 - 13:20
Lunch Break
13:20 - 14:40
Session Chair: Qian Peng
[PT] FastRecomb: Fast inference of genetic recombination rates in biobank scale data
Ardalan Naseri, William Yue, Shaojie Zhang and Degui Zhi
While rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps are of interest. While the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage- disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10k resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly due to the fact that FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
[PT] Phenotypic subtyping via contrastive learning
Aditya Gorla, Sriram Sankararaman, Esteban Burchard, Jonathan Flint, Noah Zaitlen and Elior Rahmani
Defining and accounting for subphenotypic structure has the potential to increase statistical power and provide a deeper understanding of the heterogeneity in the molecular basis of complex disease. Existing phenotype subtyping methods primarily rely on clinically observed heterogeneity or metadata clustering. However, they generally tend to capture the dominant sources of variation in the data, which often originate from variation that is not descriptive of the mechanistic heterogeneity of the phenotype of interest; in fact, such dominant sources of variation, such as population structure or technical variation, are, in general, expected to be independent of subphenotypic structure. We instead aim to find a subspace with signal that is unique to a group of samples for which we believe that subphenotypic variation exists (e.g., cases of a disease). To that end, we introduce Phenotype Aware Components Analysis (PACA), a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. In the context of disease, PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls. We evaluated PACA using an extensive simulation study, as well as on various subtyping tasks using genotypes, transcriptomics, and DNA methylation data. Our results provide multiple strong evidence that PACA allows us to robustly capture weak unknown variation of interest while being calibrated and well-powered, far superseding the performance of alternative methods. This renders PACA as a state-of-the-art tool for defining de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.
[PT] Minimal Positional Substring Cover: A haplotype threading alternative to Li & Stephens Model
Ahsan Sanaullah, Degui Zhi and Shaojie Zhang
The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels containing hundreds to thousands of haplotypes, the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity (O(MN), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed as a fast method for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC (an MPSC with the maximum total length), and h-MPSC (h coverage guaranteed) solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
[HT] PCA outperforms popular hidden variable inference methods for molecular QTL mapping
Heather Zhou, Lei Li, Yumei Li, Wei Li and Jingyi Jessica Li
Background: Estimating and accounting for hidden variables is widely practiced as an important step in molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose.
Results: Here we benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. We show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better-performing, and much easier to interpret and use.
Conclusions: To help researchers use PCA in their QTL analysis, we provide an R package PCAForQTL along with a detailed guide, both of which are freely available at https://github.com/heatherjzhou/PCAForQTL. We believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research.
14:40 - 16:00
Poster Session I and Coffee Break
15:30 - 17:00
17:00 - 18:00
Session Chair: Rohit Singh
[PT] Computing shortest hyperpaths for pathway inference in cellular reaction networks
Spencer Krieger and John Kececioglu
Signaling and metabolic pathways, which consist of a series of reactions producing target molecules from source compounds, are cor- nerstones of cellular biology. The cellular reaction networks containing such pathways can be precisely modeled by directed hypergraphs, where each reaction corresponds to a hyperedge, directed from its set of re- actants to its set of products. Given such a network represented by a directed hypergraph, inferring the most likely set of reactions that pro- duce a given target from a given set of sources corresponds to finding a shortest hyperpath, which is NP-complete. The best methods currently available for shortest hyperpaths either offer no guarantee of optimality, or exclude hyperpaths containing cycles even though cycles are abundant in real biological pathways.
We derive a novel graph-theoretic characterization of hyperpaths, leveraged in a new formulation of the general shortest hyperpath problem as an integer linear program that for the first time handles hyperpaths containing cycles, and present a novel cutting-plane algorithm that can solve this integer program to optimality in practice. This represents a major advance over the best prior exact algorithm, which was limited to acyclic hyperpaths (and hence fails to find a solution for the many biolog- ical instances where all hyperpaths are in fact cyclic). In comprehensive experiments over thousands of instances from the standard NCI-PID and Reactome databases, we demonstrate that our cutting-plane algorithm quickly finds an optimal hyperpath, with a median running-time of under ten seconds and a maximum time of around ten minutes, even on large instances with many thousands of reactions. Source code implementing our cutting-plane algorithm for shortest hyperpaths in a new tool called Mmunin is available free for research use at http://mmunin.cs.arizona.edu.
[PT] A fast and scalable method for inferring phylogenetic networks from trees by aligning lineage taxon strings
Louxin Zhang, Niloufar Abhari, Caroline Colijn and Yufeng Wu
A method is developed for inferring the minimum tree-child network from multiple trees by aligning the lineage taxon strings that are obtained from the trees for each taxon. This algorithmic innovation enables us to get around the scalable limitation of the existing programs for phylogenetic network inference.
[HT] DeepND: Deep multitask learning of gene risk for comorbid neurodevelopmental disorders
Ilayda Beyreli, Oguzhan Karakahya and A. Ercument Cicek
Autism spectrum disorder (ASD) and intellectual disability (ID) are comorbid neurodevelopmental disorders with complex genetic architectures. Despite large-scale sequencing studies, only a fraction of the risk genes was identified for both. We present a network-based gene risk prioritization algorithm, DeepND, that performs cross-disorder analysis to improve prediction by exploiting the comorbidity of ASD and ID via multitask learning. Our model leverages information from human brain gene co-expression networks using graph convolutional networks, learning which spatiotemporal neurodevelopmental windows are important for disorder etiologies and improving the state-of-the-art prediction in single- and cross-disorder settings. DeepND identifies the prefrontal and motor-somatosensory cortex (PFC-MFC) brain region and periods from early- to mid-fetal and from early childhood to young adulthood as the highest neurodevelopmental risk windows for ASD and ID. We investigate ASD- and ID associated copy-number variation (CNV) regions and report our findings for several susceptibility gene candidates. DeepND can be generalized to analyze any combinations of comorbid disorders.
20:00
Gala Dinner
8:30 - 9:30
Session Chair: Nadia Pisanti
[PT] Translation rate prediction and regulatory motif discovery with multi-task learning
Weizhong Zheng, John H.C. Fong, Yuk Kei Wan, Athena H.Y. Chu, Yuanhua Huang, Alan S.L. Wong and Joshua Ho
Many studies have found that sequence in the 5' untranslated regions (UTRs) impacts the translation rate of an mRNA, but the regulatory grammar that underpins this translation regulation remains elusive. Deep learning methods deployed to analyse massive sequencing datasets offer new solutions to motif discovery. However, existing works focused on extracting sequence motifs in individual datasets, which may not be generalisable to other datasets from the same cell type. We hypothesise that motifs that are genuinely involved in controlling translation rate are the ones that can be extracted from diverse datasets generated by different experimental techniques. In order to reveal more generalised cis-regulatory motifs for RNA translation, we develop a multi-task translation rate predictor, MTtrans, to integrate information from multiple datasets. Compared to single-task models, MTtrans reaches a higher prediction accuracy in all the benchmarked datasets generated by various experimental techniques. We show that features learned in human samples are directly transferable to another dataset in yeast systems, demonstrating its robustness in identifying evolutionarily conserved sequence motifs. Furthermore, our newly generated experimental data corroborated the effect of most of the identified motifs based on MTtrans trained using multiple public datasets, further demonstrating the utility of MTtrans for discovering generalisable motifs. MTtrans effectively integrates biological insights from diverse experiments and allows robust extraction of translation-associated sequence motifs in 5’UTR.
[PT] Efficient minimizer orders for large values of k using minimum decycling sets
David Pellow, Lianrong Pu, Baris Ekim, Lior Kotlar, Ron Shamir and Yaron Orenstein
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long sub-sequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Unfortunately, generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus cannot help in the many applications that require minimizer orders for larger k.
Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders, new orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k. Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k. We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
[PT] Dashing 2: genomic sketching with multiplicities and locality-sensitive hashing
Daniel Baker and Ben Langmead
A genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing dataset. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. While existing tools can easily compare 10,000s of genomes, relevant datasets can reach millions of sequences and beyond. Popular tools also fail to consider k-mer multiplicities, making them less applicable in quantitative settings. We describe a method called Dashing 2 that builds on the SetSketch data structure. SetSketch is related to HyperLogLog, but discards use of leading zero count in favor of a truncated logarithm of adjustable base. Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences. Dashing 2 is free, open source software available at https://github.com/dnbaker/dashing2.
9:30 - 10:00
Coffee Break
Leveraging population and statistical genetics in the biobank era for gaining new insights into human complex traits
Large-scale human genetic datasets contain considerable variability in both linkage disequilibrium (LD) and identity by descent (IBD) among individuals. These signatures can be leveraged in order to gain new insights into human complex trait architecture, and algorithms inspired by the scale at which human genomes are being sequenced increasingly must be motivated by population genetic processes to yield new understanding of the deep history of our species. I will introduce recent methodological advances developed by my group that directly leverage LD and IBD to characterize complex trait architecture and improve long-range inter-chromosomal phasing and parent-of-origin association studies. These approaches, and the evolutionary lens in general, open up new avenues for understanding the causes and consequences of human genetic variation.
11:00 - 12:00
Session Chair: Ardalan Naseri
[HT] Genome-wide mapping of somatic mutation rates uncovers drivers of cancer
Maxwell Sherman, Adam Yaari, Oliver Priebe, Felix Dietlein, Po-Ru Loh and Bonnie Berger
Identification of cancer driver mutations that confer a proliferative advantage is central to understanding cancer; however, searches have often been limited to protein-coding sequences and specific non-coding elements (for example, promoters) because of the challenge of modeling the highly variable somatic mutation rates observed across tumor genomes. Here we present Dig, a method to search for driver elements and mutations anywhere in the genome. We use deep neural networks to map cancer-specific mutation rates genome-wide at kilobase-scale resolution. These estimates are then refined to search for evidence of driver mutations under positive selection throughout the genome by comparing observed to expected mutation counts. We mapped mutation rates for 37 cancer types and applied these maps to identify putative drivers within intronic cryptic splice regions, 5′ untranslated regions and infrequently mutated genes. Our high-resolution mutation rate maps, available for web-based exploration, are a resource to enable driver discovery genome-wide.
[PT] Information-theoretic Classification Accuracy: A criterion that guides data-driven combination of ambiguous outcome labels in multi-class classification
Chihao Zhang, Yiling Elaine Chen, Shihua Zhang and Jingyi Jessica Li
Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets. While practitioners commonly combine ambiguous outcome labels for all data points (instances) in an ad hoc way to improve the accuracy of multi-class classification, there lacks a principled approach to guide the label combination for all data points by any optimality criterion. To address this problem, we propose the information-theoretic classification accuracy (ITCA), a criterion that balances the trade-off between prediction accuracy (how well do predicted labels agree with actual labels) and classification resolution (how many labels are predictable), to guide practitioners on how to combine ambiguous outcome labels. To find the optimal label combination indicated by ITCA, we propose two search strategies: greedy search and breadth-first search. Notably, ITCA and the two search strategies are adaptive to all machine-learning classification algorithms. Coupled with a classification algorithm and a search strategy, ITCA has two uses: improving prediction accuracy and identifying ambiguous labels. We first verify that ITCA achieves high accuracy with both search strategies in finding the correct label combinations on synthetic and real data. Then we demonstrate the effectiveness of ITCA in diverse applications including medical prognosis and cell type classification. We also provide theoretical insights into ITCA by studying the oracle and the linear discriminant analysis classification algorithms. Python package itca (available at https://github.com/JSB-UCLA/ITCA) implements ITCA and the search strategies.
[PT] Unsupervised deep peak caller for ATAC-seq
Yudi Zhang, Ha Vu, Geetu Tuteja and Karin Dorman
The assay for transposase-accessible chromatin with sequencing (ATAC-seq) is a common assay to identify chromatin accessible regions by using a Tn5 transposase that can access, cut, and ligate adapters to DNA fragments for subsequent amplification and sequencing. These sequenced regions are quantified and tested for enrichment in a process referred to as “peak calling”. Most unsupervised peak calling methods are based on simple statistical models and suffer from elevated false positive rates. Newly developed supervised deep learning methods can be successful, but they rely on high quality labeled data for training, which can be difficult to obtain. Moreover, though biological replicates are recognized to be important, there are no established approaches for using replicates in the deep learning tools, and the approaches available for traditional methods either cannot be applied to ATAC-seq, where control samples may be unavailable, or are post-hoc and do not capitalize on potentially complex, but reproducible signal in the read enrichment data. Here, we propose a novel peak caller that uses unsupervised contrastive learning to extract shared signals from multiple replicates. Raw coverage data are encoded to obtain low-dimensional embeddings and optimized to minimize a contrastive loss over biological replicates. These embeddings are passed to another contrastive loss for learning and predicting peaks and decoded to denoised data under an autoencoder loss. We compared our Replicative Contrastive Learner (RCL) method with other existing methods on ATAC-seq data, using annotations from ChromHMM genome and transcription factor ChIP-seq as noisy truth. RCL consistently achieved the best performance.
12:00 - 13:20
Lunch Break
13:20 - 15:00
Session Chair: Öznur Taştan
[PT] Pisces: A cross-modal contrastive learning approach to synergistic drug combination prediction
Jiacheng Lin, Hanwen Xu, Addie Woicik, Jianzhu Ma and Sheng Wang
Drug combination therapy is a promising solution to many complicated diseases. Since experimental measurements cannot be scaled to millions of candidate combinations, many computational approaches have been developed to identify synergistic drug combinations. While most of the existing approaches either use SMILES-based features or molecular-graph- based features to represent drugs, we found that neither of these two feature modalities can comprehensively characterize a pair of drugs, necessitating the integration of these two types of features. Here, we propose Pisces, a cross-modal contrastive learning approach for synergistic drug combination prediction. The key idea of our approach is to model the combination of SMILES and molecular graphs as four views of a pair of drugs, and then apply contrastive learning to embed these four views closely to obtain high-quality drug pair embeddings. We evaluated Pisces on a recently released GDSC-Combo dataset, including 102,893 drug combinations and 125 cell lines. Pisces outperformed five existing drug combination prediction approaches under three settings, including vanilla cross validation, stratified cross validation for drug combinations, and stratified cross validation for cell lines. Our case study and ablation studies further confirmed the effectiveness of our novel contrastive learning framework and the importance of integrating the SMILES-based features and the molecular- graph-based features. Pisces has obtained the state-of-the-art results on drug synergy prediction and can be potentially used to model other pairs of drugs applications, such as drug-drug interaction. Availability: Implementation of Pisces and comparison approaches can be accessed at https://github.com/linjc16/Pisces.
[PT] MTGL-ADMET: A novel multi-task graph learning framework for ADMET prediction enhanced by status-theory and maximum flow
Bing-Xue Du, Yi Xu, Siu Ming Yiu, Hui Yu and Jian-Yu Shi
It is a vital step to evaluate drug-like compounds in terms of absorption, distribution, metabolism, excretion, and toxicity (ADMET) in drug design. Classical single-task learning based on abundant labels has achieved inspiring progress in predicting individual ADMET endpoints. Multi-task learning (MTL), having the low requirement of endpoint labels, can predict multiple ADMET endpoints simultaneously. Nonetheless, it is still an ongoing issue that the performance of existing MTL- based approaches depends on how appropriate participating tasks are. Furthermore, there is a need to elucidate what substructures are crucial to specific ADMET endpoints. To address these issues, this work constructs a Multi-Task Graph Learning framework for predicting multiple ADMET properties of drug-like small molecules (MTGL-ADMET) under a new paradigm of MTL, ‘one primary, multiple auxiliaries’. It first leverages the status theory and the maximum flow to select appropriate auxiliary tasks of a specific ADMET endpoint task. Then, it designs a novel primary-centered multi-task learning model, which consists of a task-shared atom embedding module, a task-specific molecular embedding module, a primary task-centered gating module, and a multi-task predictor. The comparison with state-of-the-art MTL-based methods demonstrates the superiority of MTGL-ADMET. More elaborate experiments validate its contributions, including the status theory-based auxiliary selection algorithm and the novel MTL architecture. Furthermore, a case study illustrates the interpretability of MTGL- ADMET by indicating crucial substructures w.r.t. the primary task. It’s anticipated that this work can boost pharmacokinetic and toxicity analysis in drug discovery. The code and data underlying this article are freely available at https://github.com/dubingxue/MTGL-ADMET.
[PT] Drug synergistic combinations predictions via large-scale pre-training and graph structure learning
Zhihang Hu, Qinze Yu, Yucheng Guo, Taifeng Wang, Irwin King, Xin Gao, Le Song and Yu Li
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. However, identifying novel drug combinations through wet-lab experiments is resource intensive due to the vast combinatorial search space. Recently, computational approaches, specifically deep learning models have emerged as an efficient way to discover synergistic combinations. While previous methods reported fair performance, their models usually do not take advantage of multi-modal data and they are unable to handle new drugs or cell lines. In this study, we collected data from various datasets covering various drug-related aspects. Then, we take advantage of large-scale pre-training models to generate informative representations and features for drugs, proteins, and diseases. Based on that, a message-passing graph is built on top to propagate information together with graph structure learning flexibility. This is first introduced in the biological networks and enables us to generate pseudo-relations in the graph. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods on synergistic prediction benchmark datasets. We are also capable of inferencing new drug combination data in a test on an independent set released by AstraZeneca, where 10% of improvement over previous methods is observed. In addition, we’re robust against unseen drugs and surpass almost 15% AU ROC compared to the second-best model. We believe our framework contributes to both the future wet-lab discovery of novel drugs and the building of promising guidance for precise combination medicine.
[PT] CDGCN: Conditional de novo drug generative model using graph convolution networks
Shikha Mallick and Sahely Bhadra
De novo drug design is a crucial part of drug discovery which is a highly expensive and slow process. Many deep learning methods have been proposed to automate and accelerate it. However, most of the current state-of-the-art methods are limited to generating novel drugs specific to proteins that already have known drugs or limited to generating molecules which lack certain desirable drug-like properties like high binding affinity or low binding energy. We introduce our graph generative model, CDGCN (Conditional de novo drug generative model using Graph Convolution Networks), for de novo drug generation for novel proteins, which takes as input a protein sequence of amino acids and generates novel molecular structures having desirable drug-like properties. CDGCN generates desirable molecules for a protein using a sequential decoding scheme by learning the distribution of generation paths of its ligands. We show that CDGCN can quickly generate novel and chemically valid drug-like molecules which have a higher binding affinity with their target proteins as compared to the state-of-the-art methods. The best binding energy between a novel protein and its novel drug-like molecules generated by CDGCN was observed to be at least -7.3 kcal/mol whereas for the state-of-the-art method it was observed to be -6.2 kcal/mol.
Availability and implementation: Code and data are available at https://github.com/mshik/CDGCN.
[PT] DebiasedDTA: A framework for improving the generalizability of drug-target affinity prediction models
Rıza Özçelik, Alperen Bağ, Berk Atıl, Melih Barsbey, Arzucan Özgür and Elif Ozkirimli
Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models drops for previously unseen biomolecules, i.e. the prediction models cannot generalize to biomolecules outside of the dataset. The latest approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading prediction performance. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve the generalizability of affinity prediction models. DebiasedDTA reweights the training samples to mitigate the effect of dataset biases and is applicable to most DTA prediction models. The results suggest that models trained in the DebiasedDTA framework can achieve improved generalizability in predicting the interactions of the previously unseen biomolecules, as well as performance improvements on those previously seen. Extensive experiments with different biomolecule representations, model architectures, and datasets demonstrate that DebiasedDTA can upgrade DTA prediction models irrespective of the biomolecule representation, model architecture, and training dataset. Last but not least, we release DebiasedDTA as an open-source python library to enable other researchers to debias their own predictors and/or develop their own debiasing methods. We believe that this python library will corroborate and foster research to develop more generalizable DTA prediction models.
15:00 - 16:20
Poster Session II and Coffee Break
Genetic effects on gene expression dosage underlying cellular and physiological phenotypes
Detailed characterization of molecular and cellular effects of genetic variants is essential for understanding biological processes that underlie genetic associations to disease. A particularly scalable approach has been linking genetic variants to effects in the transcriptome, which is amenable for scalable measurements in human populations and in model systems, including at the single cell level. Here, I will describe recent advances in our long-term work to characterize genetic associations to the transcriptome and other molecular traits, as well as our recent work on CRISPR-based perturbation of gene expression levels in cellular models. Altogether, integrating insights from these diverse approaches uncovers functional genetic architecture of human traits and the molecular and cellular mechanisms that mediate these effects.
17:20 - 18:20
Session Chair: Sriram Sankararaman
[PT] Enabling trade-offs in privacy and utility in genomic data beacons and summary statistics
Rajagopal Venkatesaramani, Zhiyu Wan, Brad Malin and Yevgeniy Vorobeychik
The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even such limited releases are susceptible to likelihood-ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood-ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood-ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the dataset and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is either in the form of summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public datasets.
[PT] Accurate evaluation of transcriptomic re-identification risks using discriminative sequence models
Shuvom Sadhuka, Daniel Fridman, Bonnie Berger and Hyunghoon Cho
Gene expression data provides molecular insights into the functional impact of genetic variation, for example through expression quantitative trait loci (eQTL). With an improving understanding of the association between genotypes and gene expression comes a greater concern that gene expression profiles could be matched to genotype profiles of the same individuals in another dataset, known as a linking attack. Prior works demonstrating such a risk could analyze only a fraction of known eQTLs that are independent due to restrictive model assumptions, leaving the full extent of this risk incompletely understood. Here, we introduce discriminative sequence models (DSMs), a novel probabilistic framework for predicting a sequence of genotypes based on gene expression data. By modeling the joint distribution over all variants in a genomic region, DSMs enable an accurate assessment of the power of linking attacks that leverage all known eQTLs with necessary calibration for linkage disequilibrium and redundant predictive signals. We demonstrate improved linking accuracy of DSMs compared to two existing approaches, suggesting that DSMs help uncover a substantial additional risk overlooked by previous studies. We also present a new measure of genotypic information leakage, termed empirical information gain (EIG). Our work provides a unified framework for assessing the privacy risks of sharing diverse types of omics data beyond transcriptomics.
[HT] Sequre: A high-performance framework for secure multiparty computation enables biomedical data sharing
Haris Smajlović, Ariya Shajii, Bonnie Berger, Hyunghoon Cho and Ibrahim Numanagić
Secure multiparty computation (MPC) is a cryptographic tool that allows computation on top of sensitive biomedical data without revealing private information to the involved entities. Here, we introduce Sequre, an easy-to-use, high-performance framework for developing performant MPC applications. Sequre offers a set of automatic compile-time optimizations that significantly improve the performance of MPC applications and incorporates the syntax of Python programming language to facilitate rapid application development. We demonstrate its usability and performance on various bioinformatics tasks showing up to 3–4 times increased speed over the existing pipelines with 7-fold reductions in codebase sizes.
8:30 - 9:30
Session Chair: Tomas Vinar
[HT] SVDSS: Structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads
Luca Denti, Parsoa Khorsand, Paola Bonizzoni, Fereydoun Hormozdiari and Rayan Chikhi
Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome. SVDSS is open source and publicly available at https://github.com/Parsoa/SVDSS.
[HT] Needle: A gast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments
Mitra Darja Darvish, Enrico Seiler, Svenja Mehringer, René Rahn and Knut Reinert
The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.
As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in < 2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. Availability and implementation: https://github.com/seqan/needle.
[HT] BLEND: A fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis
Can Firtina, Jisung Park, Mohammed Alser, Jeremie S. Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan and Onur Mutlu
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either (i) increasing the use of the costly sequence alignment or (ii) limited sensitivity.
We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND (i) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and (ii) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4×–83.9× (on average 19.3×), has a lower memory footprint by 0.9×–14.1× (on average 3.8×), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8×–4.1× (on average 1.7×) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND.
9:30 - 10:00
Coffee Break
Assembly and analysis of genome sequences from across the tree of life
We are at the start of the third major phase of genome sequencing, following the sequencing of human and other key species with Sanger technologies in the 1990-2005, and short read sequencing 2005-2020. With single molecule long reads we can now assemble essentially complete and high accuracy genomes from arbitrary species. Large scale projects such as the Vertebrate Genomes Project, Darwin Tree of Life Project and European Reference Genome Atlas, are all scaling up to hundreds and thousands of species aiming to contribute to ultimately sequencing O(1 million) accessible species in the next decade under the umbrella of the Earth Biogenome Project. I will talk about challenges and opportunities for computational methods in this context, including in assembly and the analysis of variation within and between species.
11:00 - 12:00
Session Chair: Yaron Orenstein
[PT] PIsToN: Evaluating protein binding interfaces with transformer networks
Vitalii Stebliankin, Azam Shirali, Prabin Baral, Prem Chapagain and Giri Narasimhan
The computational studies of protein binding are widely used to investigate fundamental biological processes and facilitate the development of modern drugs, vaccines, and therapeutics. Scoring functions aim to predict complexes that would be formed by the binding of two biomolecules and to assess and rank the strength of the binding at the interface. Despite past efforts, the accurate prediction and scoring of protein binding interfaces remain a challenge. The physics-based methods are computationally intensive and often have to trade accuracy for computational cost. The possible limitations of current machine learning (ML) methods are ineffective data representation, network architectures, and limited training data. Here, we propose a novel approach called PIsToN (evaluating Protein binding Interfaces with Transformer Networks) that aim to distinguish native-like protein complexes from decoys. Each protein interface is transformed into a collection of 2D images (interface maps), where each image corresponds to a geometric or biochemical property in which pixel intensity represents the feature values. Such a data representation provides atomic-level resolution of relevant protein characteristics. To build hybrid machine learning models, additional empirical-based energy terms are computed and provided as inputs to the neural network. The model is trained on thousands of native and computationally-predicted protein complexes that contain challenging examples. The multi-attention transformer network is also endowed with explainability by highlighting the specific features and binding sites that were the most important for the classification decision. The developed PIsToN model significantly outperforms existing state-of-the-art scoring functions on well-known datasets.
Privacy-Preserving Federated Neural Network Learning for Disease-Associated Cell Classification
Sinem Sav, Jean-Philippe Bossuat, Juan R. Troncoso-Pastoriza, Manfred Claassen and Jean-Pierre Hubaux
Training accurate and robust machine learning models requires a large amount of data that is usually scattered across data silos. Sharing or centralizing the data of different healthcare institutions is, however, unfeasible or prohibitively difficult due to privacy regulations. We address the problem of privacy-preserving training and evaluation of neural networks in an N-party, federated learning setting. Our solutions enable the computation of training under encryption by relying on multiparty homomorphic encryption.
In the first part of this talk, I will present POSEIDON, the first of its kind in the regime of privacy-preserving neural network training. It preserves the confidentiality of the training data, the model, and the evaluation data, under a passive-adversary model and collusions between up to N−1 parties. Then, I will demonstrate the applicability of under-encryption training on biomedical analysis for disease-associated cell classification and single-cell analysis. For this, we design a system, PriCell, to enable training of a published state-of-the-art convolutional neural network in a decentralized and privacy-preserving manner. We compare the accuracy achieved by PriCell with the centralized and non-secure solutions and show that PriCell guarantees privacy without reducing the utility of the data.
[PT] Vector-Clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models
Claire Mcwhite and Mona Singh
Multiple sequence alignment is a critical step in the study of protein sequence and function. Typically, multiple sequence alignment algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino-acid similarities. While successful, standard methods struggle on sets of proteins with low sequence identity - the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverage massive sequence datasets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to multiple sequence alignment, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of multiple sequence alignment algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
12:00 - 13:20
Lunch Break
13:20 - 14:20
Session Chair: Mehmet Koyutürk
[PT] HOGVAX: Exploiting peptide overlaps to maximize population coverage in vaccine design with application to SARS-CoV-2
Sara C. Schulte, Alexander Dilthey and Gunnar W. Klau
Peptide vaccines present a safe and cost-efficient alternative to traditional vaccines. Their efficacy depends on the peptides included in the vaccine and the ability of major histocompatibility complex (MHC) molecules to bind and present these peptides. Due to the high diversity of MHC alleles, their diverging peptide binding specificities, and physical constraints on the maximum length of peptide vaccine constructs, choosing a set of peptides that effectively achieve immunization across a large proportion of the population is challenging.
Here, we present HOGVAX, a combinatorial optimization approach to select peptides that maximize population coverage. The key idea behind HOGVAX is to exploit overlaps between peptide sequences to include a large number of peptides in limited space and thereby also cover rare MHC alleles. We formalize the vaccine design task as a theoretical problem, which we call the Maximum Scoring k-Superstring Problem (MSKS). We show that MSKS is NP-hard, reformulate it into a graph problem using the hierarchical overlap graph (HOG), and present a haplotype-aware variant of MSKS to take linkage disequilibrium between MHC loci into account. We give an integer linear programming formulation for the graph problem and provide an open source implementation.
We demonstrate on a SARS-CoV-2 case study that HOGVAX-designed vaccine formulations contain significantly more peptides than vaccine sequences built from concatenated peptides. We predict over 98\,\% population coverage and high numbers of per-individual presented peptides, leading to robust immunity against new pathogens or viral variants.
[HT] Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning
H. Tomas Rube, Chaitanya Rastogi, Siqian Feng, Judith F. Kribelbauer, Allyson Li, Basheer Becerra, Lucas A. N. Melo, Bach Viet Do, Xiaoting Li, Hammaad H. Adam, Neel H. Shah, Richard Mann and Harmen Bussemaker
Protein–ligand interactions are increasingly profiled at high throughput using affinity selection and massively parallel sequencing. However, these assays do not provide the biophysical parameters that most rigorously quantify molecular interactions. Here we describe a flexible machine learning method, called ProBound, that accurately defines sequence recognition in terms of equilibrium binding constants or kinetic rates. This is achieved using a multi-layered maximum-likelihood framework that models both the molecular interactions and the data generation process. We show that ProBound quantifies transcription factor (TF) behavior with models that predict binding affinity over a range exceeding that of previous resources; captures the impact of DNA modifications and conformational flexibility of multi-TF complexes; and infers specificity directly from in vivo data such as ChIP-seq without peak calling. When coupled with an assay called KD-seq, it determines the absolute affinity of protein–ligand interactions. We also apply ProBound to profile the kinetics of kinase–substrate interactions. ProBound opens new avenues for decoding biological networks and rationally engineering protein–ligand interactions.
[HT] Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models
Simon Rasmussen, Søren Brunak, Rosa Allesøe and Ricardo Hernandez Medina
The application of multiple omics technologies in biomedical cohorts has the potential to reveal patient-level disease characteristics and individualized response to treatment. However, the scale and heterogeneous nature of multi-modal data makes integration and inference a non-trivial task. We developed a deep-learning-based framework, multi-omics variational autoencoders (MOVE), to integrate such data and applied it to a cohort of 789 people with newly diagnosed type 2 diabetes with deep multi-omics phenotyping from the DIRECT consortium. Using in silico perturbations, we identified drug–omics associations across the multi-modal datasets for the 20 most prevalent drugs given to people with type 2 diabetes with substantially higher sensitivity than univariate statistical tests. From these, we among others, identified novel associations between metformin and the gut microbiota as well as opposite molecular responses for the two statins, simvastatin and atorvastatin. We used the associations to quantify drug–drug similarities, assess the degree of polypharmacy and conclude that drug effects are distributed across the multi-omics modalities.
14:20 - 14:50
Coffee Break
14:50 - 15:50
Session Chair: Mohammed Alser
[PT] Aligning distant sequences to graphs using long seed sketches
Amir Joudaki, Alexandru Meterez, Harun Mustafa, Ragnar Groot Koerkamp, André Kahles and Gunnar Rätsch
Sequence-to-graph alignment is an important step in applications such as variant genotyping, read error correction and genome assembly. When a query sequence requires a substantial number of edits to align, approximate alignment tools that follow the seed-and-extend approach require shorter seeds to get any matches. However, in large graphs with high variation, relying on a shorter seed length leads to an exponential increase in spurious matches. We propose a novel seeding approach relying on long inexact matches instead of short exact matches. We demonstrate experimentally that our approach achieves a better time-accuracy trade-off in settings with up to a 25% mutation rate.
We achieve this by sketching a subset of graph nodes and storing them in a k-nearest neighbor index. While sketches are more robust to indels, finding the nearest neighbor of a sketch in a high-dimensional space is more computationally challenging than finding exact seeds. We demonstrate that if we store sketch vectors in a K-nearest neighbor index, we can circumvent the curse of dimensionality. Our long sketch-based seed scheme contrasts existing approaches and highlights the important role that tensor sketching can play in bioinformatics applications. Our proposed seeding method and implementation have several advantages: i) We empirically show that our method is efficient and scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length. ii) For queries with an edit distance of 25% relative to their length, on the 1 billion node graph, longer sketch-based seeds yield a 4 times increase in recall compared to exact seeds. iii) Conceptually, our seeder can be incorporated into other aligners, implying a novel direction for sequence-to-graph alignment.
[PT] Sequence to graph alignment using gap-sensitive co-linear chaining
Ghanshyam Chandra and Chirag Jain
Co-linear chaining is a widely used technique in sequence alignment tools that follow seed-filter-extend methodology. It is a mathematically rigorous approach to combine short exact matches. For co-linear chaining between two sequences, efficient subquadratic-time chaining algorithms are well-known for linear, concave and convex gap cost functions [Eppstein et al. JACM’92]. However, developing extensions of chaining algorithms for directed acyclic graphs (DAGs) has been challenging. Recently, a new sparse dynamic programming framework was introduced that exploits small path cover of pangenome reference DAGs, and enables efficient chaining [Makinen et al. TALG’19, RECOMB’18]. However, the underlying problem formulation did not consider gap cost which makes chaining less effective in practice. To address this, we develop novel problem formulations and optimal chaining algorithms that support a variety of gap cost functions. We demonstrate empirically the ability of our provably-good chaining implementation to align long reads more precisely in comparison to existing aligners. For mapping simulated long reads from human genome to a pangenome DAG of 95 human haplotypes, we achieve 98.7% precision while leaving < 2% reads unmapped.
[PT] mapquik: Efficient low-divergence mapping of long reads in minimizer space
Baris Ekim, Kristoffer Sahlin, Paul Medvedev, Bonnie Berger and Rayan Chikhi
DNA sequencing data continues to progress towards longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g. PacBio HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively-sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultra-fast mapping while retaining high sensitivity. Figure 1 gives an overview of the algorithmic pipeline of mapquik, compared with those of the state-of-the-art methods that use minimizers as seeds.
We demonstrate that mapquik significantly accelerates the seeding and chaining steps—fundamental bottlenecks to read mapping—for both the human and maize genomes with > 96% sensitivity and near-perfect specificity. On the human genome, for both simulated and real HiFi reads, mapquik achieves a 30× speed-up over the state-of-the-art tool minimap2, and on the maize genome, a 350× speed-up over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled not only by minimizer-space seeding but also a novel heuristic O(n) pseudo-chaining algorithm, which improves over the long-standing O(n log n) bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data
15:50
Awards and Closing