论文阅读分类

2025

10-29

LANGUAGE MODELS ARE HOMER SIMPSON! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

10-29

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

10-29

Refusal in Language Models Is Mediated by a Single Direction

10-29

Representation Bending for Large Language Model Safety

10-29

SALORA: SAFETY-ALIGNMENT PRESERVED LOW-RANK ADAPTATION

10-29

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

10-29

Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models

10-29

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

10-29

Superficial Safety Alignment Hypothesis

10-29

The Blessing and Curse of Dimensionality in Safety Alignment

0%