The rapid advancement of large language models (LLMs) has catalyzed the deployment of LLM-powered agents across numerous applications, raising new concerns regarding their safety and trustworthiness. Existing methods for enhancing the safety of LLMs are not directly transferable to LLM-powered agents due to their diverse objectives and output modalities. In this paper, we propose GuardAgent, the first LLM agent as a guardrail to other LLM agents. Specifically, GuardAgent oversees a target LLM agent by checking whether its inputs/outputs satisfy a set of given guard requests defined by the users. GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines. In both steps, an LLM is utilized as the core reasoning component, supplemented by in-context demonstrations retrieved from a memory module. Such knowledge-enabled reasoning allows GuardAgent to understand various textual guard requests and accurately translate them into executable code that provides reliable guardrails. Furthermore, GuardAgent is equipped with an extendable toolbox containing functions and APIs and requires no additional LLM training, which underscores its generalization capabilities and low operational overhead. Additionally, we propose two novel benchmarks: an EICU-AC benchmark for assessing privacy-related access control for healthcare agents and a Mind2Web-SC benchmark for safety evaluation for web agents. We show the effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0% accuracy in moderating invalid inputs and outputs for the two types of agents, respectively. We also show that GuardAgent is able to define novel functions in adaption to emergent LLM agents and guard requests, which underscores its strong generalization capabilities.
Disclaim: The blog is automatically generated by AI and could contain misinformation.
The rapid rise of large language model (LLM) agents has brought new safety and security challenges, especially as these agents are deployed in sensitive domains like healthcare and web automation. Traditional guardrails for LLMs focus on moderating text, but LLM agents require more flexible and reliable safeguards due to their diverse actions and outputs.
GuardAgent is the first LLM agent designed to act as a guardrail for other LLM agents. It dynamically checks whether a target agent’s actions comply with user-defined safety requests. GuardAgent works in two main steps:
This approach enables GuardAgent to flexibly adapt to new agents and safety requirements, providing reliable, code-based guardrails without retraining the underlying LLMs.
Figure: GuardAgent safeguards target agents by analyzing safety requests, planning, and generating guardrail code for enforcement.
GuardAgent introduces two new benchmarks:
On these benchmarks, GuardAgent achieves impressive results:
Performance Table:
Core LLM | Method | EICU-AC LPA | Mind2Web-SC LPA |
---|---|---|---|
GPT-4 | GuardAgent | 98.7% | 90.0% |
GPT-4 | Model-Guarding-Agent | 97.5% | 82.5% |
GPT-4 | Hardcoded Rules | 81.0% | 77.5% |
Llama3 | GuardAgent | 98.4% | 84.5% |
Table: GuardAgent outperforms baselines on both benchmarks (LPA = Label Prediction Accuracy).
Figure: GuardAgent strictly enforces access control, avoiding mistakes made by model-based baselines.
Unlike hardcoded rules or simple prompt-based moderation, GuardAgent leverages code generation and execution, making it robust to ambiguous or complex safety requirements. Its memory module and extendable toolbox allow it to generalize to new tasks and agents, while its non-invasive design ensures that the original agent’s utility is preserved.
Figure: GuardAgent achieves high accuracy across all roles and rules in both benchmarks.
GuardAgent represents a significant step toward trustworthy and safe deployment of LLM agents in real-world applications. Its flexible, code-based approach can be adapted to a wide range of domains, from healthcare privacy to web automation safety.
Learn more: arXiv paper | Competition | Project page