LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

arXiv:2606.04552v1 Announce Type: new Abstract: Genomic foundation models increasingly adopt large language model architectures, yet almost universally rely on fixed tokenization schemes such as $k$-mers, BPE, or single nucleotides, which impose arbitrary sequence boundaries that may obscure biologically relevant structure. We present LDARNet, a 120M-parameter hierarchical genomic foundation model that adapts H-Net-style dynamic chunking from autoregressive generation to masked language modeling, combining BiMamba-2 state-space layers with local attention, bidirectional routing, and a ratio-ba
Large language model architectures are increasingly being applied to new domains like genomics, leading to an immediate need for adaptive representation networks to overcome limitations of fixed tokenization schemes.
This development represents a significant step forward in genomic modeling, potentially unlocking deeper biological insights and accelerating drug discovery or synthetic biology applications.
The shift from fixed tokenization to adaptive, learnable tokenization in genomic foundation models allows for more biologically relevant structural analysis and improved model performance.
- · Biomedical Research
- · Pharmaceutical Industry
- · AI/ML Bio-startups
- · Synthetic Biology
- · Traditional genomic sequencing methods
- · Fixed k-mer tokenization approaches
Improved accuracy and efficiency in genomic data interpretation and prediction.
Faster development of new therapeutics and biotechnologies due to enhanced understanding of genetic mechanisms.
The potential for AI to dramatically reshape personalized medicine and bio-engineering fields.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL