Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

arXiv:2605.20368v1 Announce Type: cross Abstract: Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,
The increasing pressure to secure sensitive organizational data, coupled with privacy concerns regarding cloud-based AI services, drives the development of local, open-source solutions.
This development offers a practical, privacy-preserving alternative for organizations dealing with sensitive information, reducing reliance on external cloud infrastructure for AI-driven security analyses.
Organizations can now deploy sophisticated, fine-tuned large language models locally for security document classification, enhancing data sovereignty and reducing data egress risks.
- · Organizations with sensitive data
- · Open-source AI community
- · Data privacy advocates
- · On-premise AI solution providers
- · Cloud-native security AI providers
- · Rule-based security tools
Increased adoption of local AI models for organizational security and data privacy.
Reduced dependence on major cloud AI providers for sensitive data processing, particularly in regulated industries.
Potential for new business models centered around deploying and maintaining secure, local AI infrastructure for enterprises.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI