Trustworthy

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

A study revealing safety-specific pitfalls of multi-model synthetic preference data in DPO alignment.

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

The first automated guardrail for agents.

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Benchmark for medical hallucination by LLMs.

LLM-PBE: Assessing Data Privacy in Large Language Models

A comprehensive privacy assessment of LLMs.

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

A comprehensive trustworthiness assessment of compressed LLMs.

Safe and Robust Watermark Injection with a Single OoD Image

A new method for safely and robustly injecting watermark after training without training data.

Who Leaked the Model? Tracking IP Infringers in Accountable Federated Learning

Tracking IP leakage in federated learning.

Revisiting Data-Free Knowledge Distillation with Poisoned Teachers

We uncover the security risk of data-free distillation from a poisoned teacher and propose the first countermeasure.

How Robust is Your Fairness? Evaluating and Sustaining Fairness under Unseen Distribution Shifts

Increasing concerns have been raised on deep learning fairness in recent years. Existing fairness-aware machine learning methods mainly focus on the fairness of in-distribution data. However, in real-world applications, it is common to have …

MECTA: Memory-Economic Continual Test-Time Model Adaptation

Continual Test-time Adaptation (CTA) is a promising art to secure accuracy gains in continually-changing environments. The state-of-the-art adaptations improve out-of-distribution model accuracy via computation-efficient online test-time gradient …