1

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

A training-free method that robustifies LLM safety alignment against fine-tuning by extrapolating low-rank safety subspaces, significantly reducing attack success rates while preserving model utility.

More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment

A study revealing safety-specific pitfalls of multi-model synthetic preference data in DPO alignment.

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

The first automated guardrail for agents.

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

A training-free approach that calibrates chain-of-thought reasoning in LLMs, improving accuracy while reducing computational overhead.

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

Benchmark for medical hallucination by LLMs.

Extracting and Understanding the Superficial Knowledge in Alignment

We examined how superficial LLM alignments are thru a linear distillation method.

GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing

We develop a chatbot for reminiscence therapy

LLM-PBE: Assessing Data Privacy in Large Language Models

A comprehensive privacy assessment of LLMs.

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

A comprehensive trustworthiness assessment of compressed LLMs.

Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark

Zeroth-order optimization for LLM.