OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

arXiv:2606.10285v1 Announce Type: new Abstract: OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language mode
The release of OpenRTLSet corresponds with the increasing capabilities of large language models and the growing demand for automated hardware design tools, addressing a current gap in open-source access to diverse Verilog datasets.
This dataset significantly lowers the barrier for entry into AI-driven hardware design, accelerating research and development in a critical technology sector that underpins much of advanced computing.
Hardware design workflows can now be more efficiently automated and generalized using AI, moving from manual RTL coding towards LLM-assisted or autonomous generation, potentially democratizing access to chip design capabilities.
- · AI research community
- · Hardware design startups
- · Semiconductor industry
- · Open-source hardware ecosystem
- · Proprietary EDA tool vendors (long-term if not adapted)
- · Traditional manual RTL design workflows
The new dataset facilitates the rapid development of advanced LLMs specifically fine-tuned for hardware description languages like Verilog.
Improved AI capabilities in hardware design could lead to faster chip iteration cycles, more complex designs, and potentially new architectures.
Democratized chip design, fueled by AI, could disrupt the existing semiconductor supply chain and foster a new era of diverse and specialized hardware.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL