Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
PositiveArtificial Intelligence
The publication of RFTC on arXiv highlights a pressing concern in AI: the threat of stealthy data poisoning during the fine-tuning of large language models (LLMs). This issue can lead to compromised safety in downstream applications. Traditional detection methods have struggled, either relying on classifier-style signals ill-suited for generative tasks or degrading model quality through rewriting. RFTC offers a robust solution by leveraging TF-IDF clustering to identify poisoned examples based on their response patterns. By comparing model outputs with a reference model and clustering suspicious responses, RFTC effectively flags true backdoor samples. The results from two machine translation datasets and one QA dataset demonstrate that RFTC not only enhances detection accuracy but also improves the performance of the fine-tuned models. This advancement is crucial for maintaining the reliability and safety of AI systems, especially as they become increasingly integrated into various app…
— via World Pulse Now AI Editorial System
