A token-level confidence-calibrated negative preference alignment method for LLM unlearning that removes undesirable knowledge without requiring retention data or contrastive pairs.
We find that LLMs can get Brain Rot just like human after browsing enormous brainless social media.
A study revealing safety-specific pitfalls of multi-model synthetic preference data in DPO alignment.
The first automated guardrail for agents.
A training-free approach that calibrates chain-of-thought reasoning in LLMs, improving accuracy while reducing computational overhead.
Benchmark for medical hallucination by LLMs.
We examined how superficial LLM alignments are thru a linear distillation method.
We develop a chatbot for reminiscence therapy
A comprehensive privacy assessment of LLMs.
A comprehensive trustworthiness assessment of compressed LLMs.