Preference Alignment

CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

A token-level confidence-calibrated negative preference alignment method for LLM unlearning that removes undesirable knowledge without requiring retention data or contrastive pairs.