Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark

arXiv:2604.20462v3 Announce Type: replace-cross Abstract: Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 250
The proliferation of Behaviour-Driven Development (BDD) and Gherkin in software engineering has created a growing problem of test duplication, making robust detection methods increasingly critical.
This development offers a potential solution to a significant pain point in software development, improving efficiency and reducing maintenance costs for engineering teams.
The ability to detect duplicate Gherkin steps across organizations using paraphrase-robust methods and a public benchmark could standardize and streamline BDD practices.
- · Software developers
- · Organizations using BDD and Gherkin
- · AI/ML researchers in software engineering
- · Software teams with inefficient BDD practices
- · Manual code reviewers
Reduced technical debt and improved software quality through automated detection of redundant test steps.
Increased adoption of sophisticated natural language processing techniques within software development tooling.
Enhanced overall productivity and faster release cycles for complex software projects globally.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL