arXiv:2511.09493v1 Announce Type: cross 
Abstract: Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given $k$ models and a prompt, achieves risk competitive with the average risk of the safest $s$ of the $k$ models, where $s$ is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models' ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.

تم اقتراح خوارزمية جديدة للتصميم التوافقي للذكاء الاصطناعي التوليدي، والتي تعزز الأمان من خلال تجميع مخرجات نماذج متعددة. تتيح هذه الطريقة للنموذج المجمّع أن يرث الأمان من أكثر مجموعة أمانًا، مما يحقق مستويات مخاطر تنافسية. تمتنع الخوارزمية عن إنتاج المخرجات عندما يكون هناك اتفاق غير كافٍ بين النماذج، مما يعالج المخاطر الكامنة التي لا يمكن اكتشافها من خلال الفحص وحده. هذه الطريقة مهمة لتحسين بروتوكولات أمان الذكاء الاصطناعي.

Se ha propuesto un nuevo algoritmo de muestreo por consenso para la IA generativa, que mejora la seguridad al agregar las salidas de múltiples modelos. Este método permite que el modelo agregado herede la seguridad del subconjunto más seguro, alcanzando niveles de riesgo competitivos. El algoritmo se abstiene de generar salidas cuando hay un acuerdo insuficiente entre los modelos, abordando así riesgos inherentes que no se pueden detectar solo mediante inspección. Este enfoque es significativo para mejorar los protocolos de seguridad de la IA.

Un nouvel algorithme d'échantillonnage par consensus pour l'IA générative a été proposé, améliorant la sécurité en agrégeant les sorties de plusieurs modèles. Cette méthode permet au modèle agrégé d'hériter de la sécurité du sous-ensemble le plus sûr, atteignant des niveaux de risque compétitifs. L'algorithme s'abstient de générer des sorties lorsque l'accord entre les modèles est insuffisant, abordant ainsi des risques inhérents non détectables par inspection seule. Cette approche est significative pour améliorer les protocoles de sécurité de l'IA.

A new consensus sampling algorithm for generative AI has been proposed, which enhances safety by aggregating outputs from multiple models. This method allows the aggregated model to inherit safety from the safest subset of models, achieving competitive risk levels. The algorithm abstains from generating outputs when there is insufficient agreement among models, addressing inherent risks that cannot be detected by inspection alone. This approach is significant for improving AI safety protocols.

Consensus Sampling for Safer Generative AI

Was this article worth reading? Share it

Ready to build your own newsroom?