RobustNLP/CipherChat
A framework for evaluating the generalizability of safety alignment in large language models using cipher-encoded prompts.

CipherChat is a systematic evaluation framework that tests whether safety alignments in LLMs, which are trained on natural language human feedback, can be bypassed using non-natural ciphers. The framework teaches the model to comprehend cipher language by designating it as a cipher expert, then probes for safety vulnerabilities across different cipher methods and instruction domains. Results are provided as query-response pairs that can be loaded and analyzed.