agencyenterprise/PromptInject
Research framework that evaluates LLM robustness to adversarial prompt injection attacks like goal hijacking and prompt leaking.

PromptInject is a modular prompt assembly framework designed to systematically evaluate vulnerabilities in large language models under adversarial conditions. It provides quantitative metrics for measuring how easily LLMs like GPT-3 can be misaligned through handcrafted inputs targeting goal hijacking and prompt leaking attacks. The framework was developed for AI safety research and received a Best Paper Award at the NeurIPS 2022 ML Safety Workshop.