Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

arXiv:2606.19626v1 Announce Type: cross Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, compositio
The proliferation of advanced AI models highlights the limitations of current tokenization methods for specialized technical language, necessitating more semantically rich approaches.
This work introduces a knowledge-based approach that can significantly improve AI's understanding and processing of complex technical and engineering data, leading to more accurate and reliable AI applications in scientific and industrial domains.
AI models will be able to process technical documentation, scientific papers, and engineering specifications in Brazilian Portuguese with greater semantic precision, reducing errors and improving contextual understanding.
- · AI developers
- · Engineering sectors
- · Scientific research institutions
- · Brazilian technology firms
- · AI systems relying solely on statistical tokenization for technical texts
Improved performance of AI systems in technical translation, data extraction, and knowledge representation for specialized fields.
Accelerated development of AI assistants and autonomous systems capable of understanding and generating highly technical content.
Potential for new AI-driven tools that can 'reason' about physical quantities and engineering designs, leading to innovations in various industrial sectors.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL