When annotators disagree on a label, the disagreement itself carries signal—and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation—whether the model identifies which items elicit disagreement—requires N ≈ 20–50 annotators to converge, while distributional match (KL divergence) saturates by N ≈ 10 (87–95% of improvement across five model…

Source: Apple Machine Learning Research — read the full report at the original publisher.

This is a curated wire item. The Continuum Brief does not republish full third-party articles; this entry links to the original source.