Junyuan Hong
Junyuan Hong
Research
Publications
Experiences
Teaching
Safety
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
A training-free method that robustifies LLM safety alignment against fine-tuning by extrapolating low-rank safety subspaces, significantly reducing attack success rates while preserving model utility.
MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Benchmark for medical hallucination by LLMs.
Extracting and Understanding the Superficial Knowledge in Alignment
We examined how superficial LLM alignments are thru a linear distillation method.
Cite
×