arXiv:2511.06701v3 Announce Type: replace-cross Abstract: AI-Scientist systems risk manufacturing spurious discoveries through uncontrolled multiple testing. We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded domain-specific language (the Research monad) that makes it impossible to test a hypothesis without updating the error budget, and a declarative scaffold, backed by an OS-level sandbox, that makes validation data physically absent from the environment in which LLM-generated code runs. We ground the design in a machine-checked Lean~4 form
Source: arXiv cs.AI — read the full report at the original publisher.
