论文阅读分类

2025

11-03

S³: Social-network Simulation System with Large Language Model-Empowered Agents

10-29

Advancing LLM Safe Alignment with Safety Representation Ranking

10-29

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

10-29

Aligning Large Language Models with Human Preferences through Representation Engineering

10-29

CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation

10-29

DeAL: Decoding-time Alignment for Large Language Models

10-29

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

10-29

How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States

10-29

Improving Alignment and Robustness with Circuit Breakers

10-29

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

0%