An adversarial attack that recovers supposedly unlearned multi-modality knowledge from MLLMs via prompt-suffix optimization and fine-tuning, exposing vulnerabilities in machine unlearning defenses.
We find that LLMs can get Brain Rot just like human after browsing enormous brainless social media.
A study revealing safety-specific pitfalls of multi-model synthetic preference data in DPO alignment.
The first automated guardrail for agents.
Benchmark for medical hallucination by LLMs.
A comprehensive privacy assessment of LLMs.
A comprehensive trustworthiness assessment of compressed LLMs.
A new method for safely and robustly injecting watermark after training without training data.
Tracking IP leakage in federated learning.
We uncover the security risk of data-free distillation from a poisoned teacher and propose the first countermeasure.