
arXiv:2512.03086v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations wit
The proliferation of Large Language Models (LLMs) combined with the persistent challenge of data scarcity in specialized programming domains motivates novel approaches to synthetic data generation.
Improving code translation capabilities for low-resource languages and emerging frameworks directly impacts software development efficiency and the reach of AI tools into critical, often legacy, systems.
The ability to generate high-quality, verified code translation data autonomously could accelerate the adoption of LLMs for complex software migration and enhance reliability in specialized computing environments.
- · LLM developers
- · Organizations with legacy codebases
- · Specialized computing platforms (e.g., CUDA)
- · Software engineering sector
- · Manual code translation services
- · Developers reliant on traditional open-source data availability
More robust and versatile LLMs for code translation, reducing development time and cost for cross-platform compatibility.
Accelerated modernization of critical infrastructure and scientific computing currently bound by outdated programming languages.
Reduced barriers to entry for new programming languages and frameworks as LLMs can more easily bridge knowledge gaps and facilitate adoption.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI