← all repositories

RobustNLP/CipherChat

A framework for evaluating the generalizability of safety alignment in large language models using cipher-encoded prompts.

628 stars Python LLMOps · EvalLanguage Models
CipherChat
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

CipherChat is a systematic evaluation framework that tests whether safety alignments in LLMs, which are trained on natural language human feedback, can be bypassed using non-natural ciphers. The framework teaches the model to comprehend cipher language by designating it as a cipher expert, then probes for safety vulnerabilities across different cipher methods and instruction domains. Results are provided as query-response pairs that can be loaded and analyzed.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.