
arXiv:2607.00477v1 Announce Type: new Abstract: A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoders had identical frozen LightGBM learners in the downstream phase, allowing for controlled comparisons of their performance to each other. CatBoost and TabNet were included as comparisons across paradigms using different learners. The entity embeddings produced the highest
The continuous evolution of AI and machine learning techniques necessitates ongoing research into optimal methods for critical applications like fraud detection, especially as data complexity scales.
Improving the accuracy and interpretability of high-cardinality fraud detection models directly impacts financial security, reducing losses for institutions and protecting consumers.
This research provides robust comparisons of encoding methods, suggesting that entity embeddings offer superior performance in complex fraud detection scenarios, potentially shifting best practices for data scientists.
- · Financial Institutions
- · Fraud Detection Software Providers
- · Data Scientists
- · Online Retailers
- · Fraudsters
Increased adoption of advanced categorical encoding techniques like entity embeddings in fraud detection systems due to validated performance gains.
Reduced financial losses from fraud for businesses and consumers, leading to more secure digital transactions and financial systems.
A potential shift in focus for AI/ML development in fraud detection, emphasizing hybrid models that combine interpretability with advanced learned representations.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG