Code Agent Sample: Lua Dungeon Crawler
1. Sample Overview
What Is This Sample?
This sample demonstrates a code agent — an LLM that generates executable code in response to natural language commands. Unlike tool-calling agents that select from predefined functions, code agents write arbitrary programs, enabling more flexible and creative behavior.
The sample is a text-based dungeon crawler where the player controls movement directly (WASD), but commands an AI companion using natural language. When the player types a command like “go get the sword,” the LLM generates Lua code that implements that behavior. The AI companion then executes this code each game tick until the task is complete.
NOTE: This sample makes use of a CUDA-based backend, and therefore will not work on non-NVIDIA hardware. Please reference other documentation here on how to switch backends.
Required Models
The code agent sample requires the following models:
Plugin |
Model Name |
GUID |
|---|---|---|
nvigi.plugin.gpt.ggml.* |
Qwen3 8B Instruct |
545F7EC2-4C29-499B-8FC8-61720DF3C626 |
See the top-level documentation that shipped with your development pack for information on how to download these models.
NOTE: Other code-capable models may work. Models with strong instruction-following and code generation abilities tend to perform best.
How to Run
Because of the SDK layout, once the components are built they will be under the _artifacts tree; in order for the SDK to run an app like the sample, all of these DLLs and the executable must be in the same directory. We do this by copying the DLLs and EXE into the bin\x64 directory, and running the app from within that directory, so all plugins are available.
From Command Line:
Open a command prompt in
<SDK_ROOT>Run:
bin\x64\nvigi.codeagentlua.exe data/nvigi.models data/nvigi.test/nvigi.codeagentlua/codeagent_prompt_lua.txt
In Debugger:
Edit project settings for
nvigi/samples/nvigi.codeagentluaSet “Command” to
<SDK_ROOT>\bin\x64\nvigi.codeagentlua.exeSet “Command Arguments” to
data/nvigi.models data/nvigi.test/nvigi.codeagentlua/codeagent_prompt_lua.txtSet “Working Directory” to
<SDK_ROOT>/bin/x64Build and run
copy_sdk_binaries.bat <cfg>after each build
Basic Gameplay
w/a/s/d — Move the player (P) up/left/down/right
Enter — Type a command for the AI companion (A)
q — Quit the game
Objective: Navigate the dungeon, collect weapons, and defeat all monsters.
Entities:
P— Player (you)A— AI, your companionG— Gorgon (defeated with sword)B— Bat (defeated with bow + arrow)^— Sword>— Bow/— Arrow
Combat: Walking adjacent to a monster triggers combat. You need the right weapon or you take damage.
Example Commands
Commands that tend to work well:
“Follow me”
“Go get the sword”
“Pick up the bow, then pick up an arrow, then kill a bat”
“Explore the dungeon”
“Stay here”
“How many monsters are in the dungeon?”
The AI will generate Lua code that runs each game tick, moving one space at a time toward completing the goal.
Depending on the LLM, it might require simpler commands. For instance, using GPT5 you can likely say “kill all of the bats” and it will reason out that it needs to get the bow, then the arrows then go kill all of the bats. Using smaller local LLMs, that same command will likely send the AI to attack the bats unarmed. Sometimes it is necessary for SLMs to even break compound tasks into multiple simple tasks. At times, “Go get the sword, then kill the gorgon” might work, but it will likely work more often as “Go get the sword”, wait for that task to complete, then follow it with “Go kill the gorgon”. Experiment with it and see what works.
Whenever you run a command, you can look at the file located in <WORKING_DIR>/ai_func_out.txt to see the function that was generated.
2. Code Agents vs Tool-Calling Agents
Modern LLM applications often need to interact with external systems — querying databases, calling APIs, or controlling game characters. There are two primary approaches: tool-calling agents and code agents.
What Is a Tool-Calling Agent?
A tool-calling agent works by having the LLM select from a predefined set of functions. The application provides a schema describing available tools (name, parameters, descriptions), and the LLM outputs structured JSON indicating which tool to call and with what arguments.
User: "What's the weather in Austin?"
LLM Output: { "tool": "get_weather", "args": { "city": "Austin" } }
Application: Calls get_weather("Austin"), returns "72°F, sunny"
LLM: "The weather in Austin is 72°F and sunny."
Advantages:
Predictable — Only predefined functions can be called
Easy to secure — Each tool is explicitly implemented and validated
Structured output — JSON parsing is well-understood
Lower risk — The LLM cannot execute arbitrary logic
Disadvantages:
Limited flexibility — Complex behaviors require many predefined tools
Combinatorial explosion — As capabilities grow, tool count explodes
No novel solutions — The LLM cannot invent behaviors you didn’t anticipate
Slow - Each function call is likely multiple inference calls to the LLM.
No logic - We rely on the LLM for all logic, when code is superior at exact solutions.
Verbose prompts - the json specifications make for large and difficult to read function declarations.
What Is a Code Agent?
A code agent generates executable code rather than selecting from predefined functions. Given a natural language request, the LLM writes a program that implements the desired behavior. The application then executes that code in a sandboxed environment.
User: "Go get the sword, then come back to me."
LLM Output:
function update_func(player, ai, monsters, items)
if not ai:has_item("sword") then
for _, item in ipairs(items) do
if item.name == "sword" then
local path = ai:find_path_to(item.position)
if path and #path > 0 then
ai:move(path[1])
return "Moving toward sword"
end
end
end
else
local path = ai:find_path_to(player.position)
if path and #path > 0 then
ai:move(path[1])
return "Returning to player"
end
end
return "Done"
end
Application: Executes this Lua function each game tick
Advantages:
Highly flexible — Can express any logic the language supports
Compositional — Complex behaviors emerge from simple primitives
Novel solutions — LLM can invent behaviors not explicitly designed
Fewer APIs — Expose primitives, let the LLM compose them
More Readable Prompts - Common scripting function declarations are the documentation.
Smaller prompts - Being able to write code allows the system prompt to be more concise, and faster to load.
Faster - one inference call can generate code that executes for the duration of the task.
Disadvantages:
Security risk — Arbitrary code execution is inherently dangerous
Unpredictable — Generated code may have bugs or unexpected behavior
Harder to validate — Cannot easily verify correctness before execution
Requires sandboxing — Must carefully restrict what the code can do
When to Use Each
Scenario |
Recommended Approach |
|---|---|
Limited, well-defined actions |
Tool-calling |
High-security requirements |
Tool-calling |
Simple request-response patterns |
Tool-calling |
Complex, multi-step behaviors |
Code agent |
Creative or exploratory tasks |
Code agent |
Game AI or robotics control |
Code agent |
When you can sandbox effectively |
Code agent |
Hybrid approaches are also possible: use tool-calling for high-risk operations (payments, deletions) and code agents for low-risk, creative tasks (game AI, data analysis).
This sample demonstrates a code agent because controlling a game character requires flexible, multi-step logic that would be tedious to express as individual tool calls. The character needs to pathfind, make decisions based on game state, and adapt to changing circumstances — all of which are naturally expressed as code.
3. Security Considerations
Executing LLM-generated code is inherently risky. Before evaluating scripting languages, we must understand the threats we need to mitigate.
The Risk of Code Agents
Unlike tool-calling agents where the application controls exactly what functions can be invoked, code agents execute arbitrary programs. The LLM might generate code that:
Accidentally or intentionally accesses the filesystem, network, or system resources
Consumes unbounded memory, crashing the application
Enters infinite loops, hanging the application
Exploits language features to escape the sandbox
Corrupts application state in unexpected ways
These risks exist even when the LLM is “well-intentioned” — bugs in generated code can trigger any of these behaviors. A robust code agent must defend against both malicious and accidental misuse.
Threat Model
We identified six categories of threats that a code agent sandbox must address:
Dangerous Function Access
Threat: Generated code calls functions that access the filesystem, execute shell commands, load arbitrary modules, or interact with the network.
Examples (pseudocode):
execute_shell("rm -rf /")— Run shell commandsread_file("/etc/passwd")— Read sensitive filesimport("network_library")— Load dangerous moduleseval(arbitrary_code_string)— Execute arbitrary code
Mitigation: The sandbox must either not load dangerous libraries, or selectively remove dangerous functions after loading.
Memory Exhaustion
Threat: Generated code allocates unbounded memory, exhausting system resources and crashing the application.
Examples (pseudocode):
while true: list.append(large_string)— Infinite allocations = ""; for i in 1..billion: s = s + "x"— String concatenation bomb
Mitigation: The runtime must enforce a memory limit and fail gracefully when exceeded.
Stack Overflow
Threat: Generated code uses deep or infinite recursion, overflowing the call stack.
Examples (pseudocode):
function f(): f()— Infinite recursionMutually recursive functions that never terminate
Mitigation: The runtime must track call depth and abort execution when a threshold is exceeded.
Infinite Loops / Hangs
Threat: Generated code enters an infinite loop, hanging the application and preventing further user interaction.
Examples (pseudocode):
while true: pass— Infinite loop with no I/Ofor i in 1..infinity: pass— Extremely long loop
Mitigation: The runtime must enforce a time limit or instruction count limit and abort long-running code.
Prototype/Metatable Manipulation
Threat: Generated code manipulates object prototypes or metatables to bypass sandbox restrictions or corrupt internal data structures. Many scripting languages allow customizing how objects behave through prototype chains, metatables, or similar mechanisms.
Examples (pseudocode):
set_prototype(entity, malicious_handler)— Override entity behaviorget_prototype(entity).write_handler = null— Disable write protection
Mitigation: Remove or restrict access to prototype/metatable manipulation functions.
Game State Corruption
Threat: Generated code directly modifies application state in ways that break game logic or cause crashes.
Examples (pseudocode):
player.health = -999— Invalid health valuemonster.position = null— Break rendering/pathfindingai.items = "not a list"— Type corruption
Mitigation: Expose application state through controlled interfaces that validate or reject invalid modifications.
Summary
The six categories this sample has addressed are:
Threat |
Required Mitigation |
|---|---|
Dangerous functions |
Selective library loading, function removal |
Memory exhaustion |
Custom allocator with limits |
Stack overflow |
Call depth tracking |
Infinite loops |
Timeout / instruction counting |
Metatable manipulation |
Remove metatable functions |
State corruption |
Controlled state access, validation |
This is not exhaustive and can expand or contract depending on your use case and the language you choose to have the ai write in.
The next section evaluates scripting languages against these requirements.
–
4. Scripting Engine Evaluation
Choosing a scripting language for a code agent requires balancing several concerns: security (can we sandbox it?), embeddability (can we integrate it into C++?), LLM familiarity (will models generate correct code?), and runtime characteristics (performance, memory footprint).
Candidates Considered
We evaluated three scripting languages:
Python
Pros:
LLMs are extremely proficient at generating Python code
Rich standard library
Familiar to most developers
Excellent documentation and community
Cons:
Difficult to sandbox securely — the standard library has many escape hatches
Larger than desired runtime footprint (~10-20MB+ for embedded Python)
Larger than desired compile time footprint (~150MB+ for embedded Python)
Complex C++ integration (reference counting, GIL, etc.)
Hard to limit memory and CPU usage reliably
Python’s sandboxing challenges are well-documented. Even “restricted” Python environments have historically been bypassed. For a code agent where untrusted LLM-generated code runs, this is a significant concern.
If you do consider python, pay special attention to these areas
If you are still using the GIL (python <= 3.13), then threaded implementations become difficult. It can become impossible to cleanly restart the python interpreter.
Subinterpreters can be restarted, but at a cost of lost memory.
Subprocesses can be restarted cleanly, handle memory loss properly, but can be difficult to debug.
ChaiScript
Pros:
Designed specifically for C++ embedding
Header-only library, easy to integrate
Clean syntax, similar to JavaScript
No external dependencies
Cons:
LLMs have limited training data on ChaiScript — generated code often has syntax errors
Smaller community and less documentation
Sandboxing support is limited
Less battle-tested than alternatives
ChaiScript’s obscurity was a significant problem. Models frequently hallucinated non-existent functions or used incorrect syntax, requiring extensive prompt engineering and retry logic.
If you do consider Chaiscript, pay special attention to these areas:
ChaiScript also has limited ability to properly sandbox memory usage.
Special care must be paid when using Chaiscript in a threaded environment with how variable scoping works.
Lua
Pros:
Designed for embedding from the start
Tiny footprint (~200KB)
Small compile time footprint (a few MB)
Excellent sandboxing support — can selectively load libraries and remove functions
Custom allocator support for memory limiting
Debug hooks for timeout/recursion control
LLMs generate reasonable Lua code (widely used in game modding, WoW, Roblox, etc.)
Battle-tested in thousands of game engines
Cons:
Syntax can be unfamiliar (1-indexed arrays,
~=for not-equal,:vs.)Smaller standard library than Python
LLMs occasionally make syntax errors (though fewer than ChaiScript)
Why Lua Won
Lua was selected for this sample because it best balances our requirements:
Requirement |
Python |
ChaiScript |
Lua |
|---|---|---|---|
Sandboxing |
❌ Difficult |
⚠️ Limited |
✅ Excellent |
Memory limiting |
❌ Hard |
❌ No |
✅ Custom allocator |
Timeout control |
❌ Complex |
❌ No |
✅ Debug hooks |
LLM code quality |
✅ Excellent |
❌ Poor |
⚠️ Good |
Embedding ease |
⚠️ Complex |
✅ Easy |
✅ Easy |
Runtime size |
❌ Large |
✅ Small |
✅ Tiny |
While Python would produce better LLM-generated code, it cannot be safely sandboxed. ChaiScript embeds easily but is difficult to effectively secure and LLMs struggle with its syntax. Lua provides the best combination: reasonable LLM output quality with excellent security controls.
The occasional Lua syntax errors from LLMs can be mitigated with retry logic (see Section 6), making it a practical choice for production code agents.
5. How Lua Addresses Each Threat
This section explains how the sample implements each security mitigation using Lua’s features. The relevant code is in lua_bindings.cpp.
Selective Library Loading
Unlike luaL_openlibs() which loads everything, we selectively load only safe libraries:
// Load safe libraries
luaL_requiref(L, "_G", luaopen_base, 1); // Basic functions
luaL_requiref(L, LUA_TABLIBNAME, luaopen_table, 1); // Table manipulation
luaL_requiref(L, LUA_STRLIBNAME, luaopen_string, 1); // String functions
luaL_requiref(L, LUA_MATHLIBNAME, luaopen_math, 1); // Math functions
luaL_requiref(L, LUA_UTF8LIBNAME, luaopen_utf8, 1); // UTF-8 support
luaL_requiref(L, LUA_OSLIBNAME, luaopen_os, 1); // OS (then sanitized)
// NOT loading:
// - io: File I/O
// - debug: Can break sandbox
// - package: Can load arbitrary modules
// - coroutine: Could complicate timeout handling
By never loading dangerous libraries, those functions simply don’t exist — there’s nothing to exploit.
Dangerous Function Removal
Even safe libraries contain dangerous functions. After loading, we remove them:
// Remove dangerous os functions (keep time, date, difftime, clock)
removeFromLib(L, "os", "execute"); // Shell commands
removeFromLib(L, "os", "exit"); // Terminate program
removeFromLib(L, "os", "remove"); // Delete files
removeFromLib(L, "os", "rename"); // Move files
removeFromLib(L, "os", "getenv"); // Environment variables
// Remove dangerous base functions
removeGlobal(L, "dofile"); // Execute Lua file
removeGlobal(L, "loadfile"); // Load Lua file
removeGlobal(L, "load"); // Execute arbitrary strings
removeGlobal(L, "require"); // Module loading
The removeFromLib and removeGlobal helpers simply set these to nil, making them undefined.
Custom Memory Allocator
Lua allows replacing its memory allocator via lua_newstate(). We provide a custom allocator that tracks usage and enforces a limit:
static void* luaLimitedAlloc(void* ud, void* ptr, size_t osize, size_t nsize)
{
LuaMemoryTracker* tracker = static_cast<LuaMemoryTracker*>(ud);
// Calculate new usage
size_t delta = nsize - (ptr ? osize : 0);
size_t newUsage = tracker->currentUsage + delta;
// Refuse allocation if it would exceed limit
if (newUsage > tracker->limit)
{
return nullptr; // Lua throws out-of-memory error
}
// Perform allocation and track usage
void* newPtr = realloc(ptr, nsize);
if (newPtr) tracker->currentUsage = newUsage;
return newPtr;
}
// Create Lua state with 100MB limit
lua_State* L = lua_newstate(luaLimitedAlloc, tracker);
When the limit is reached, Lua receives a null pointer and throws a recoverable out-of-memory error.
Debug Hooks for Timeout and Recursion
Lua’s debug hooks let us intercept execution at key points. We use three hook types:
static void luaTimeoutHook(lua_State* L, lua_Debug* ar)
{
// Track call depth for recursion limit
if (ar->event == LUA_HOOKCALL || ar->event == LUA_HOOKTAILCALL)
{
g_luaCallDepth++;
if (g_luaCallDepth > LUA_MAX_CALL_DEPTH) // 200
{
luaL_error(L, "Stack overflow (recursion depth exceeded)");
}
}
else if (ar->event == LUA_HOOKRET)
{
g_luaCallDepth--;
}
// Check timeout every N instructions
if (ar->event == LUA_HOOKCOUNT)
{
auto elapsed = now - g_luaStartTime;
if (elapsed > LUA_TIMEOUT_MS) // 1000ms
{
luaL_error(L, "Execution timed out");
}
}
}
// Install hook before calling Lua code
lua_sethook(L, luaTimeoutHook, LUA_MASKCALL | LUA_MASKRET | LUA_MASKCOUNT, 1000);
This catches both infinite loops (via timeout) and infinite recursion (via call depth tracking).
Metatable Protection
Metatables control how Lua objects behave. If code could modify our Entity metatable, it could bypass protections. We remove the metatable functions entirely:
removeGlobal(L, "getmetatable"); // Can't inspect metatables
removeGlobal(L, "setmetatable"); // Can't modify metatables
removeGlobal(L, "rawget"); // Can't bypass __index
removeGlobal(L, "rawset"); // Can't bypass __newindex
removeGlobal(L, "rawequal"); // Can't bypass __eq
removeGlobal(L, "rawlen"); // Can't bypass __len
Without these functions, code can only interact with entities through our controlled __index and __newindex metamethods.
Entity Field Write Protection
Our __newindex metamethod controls what happens when code writes to entity fields. We block writes to built-in fields while allowing custom fields for AI state:
static int entity_newindex(lua_State* L)
{
Entity* e = checkEntity(L, 1);
const char* key = luaL_checkstring(L, 2);
// Block writes to built-in fields
if (strcmp(key, "name") == 0 || strcmp(key, "position") == 0 ||
strcmp(key, "health") == 0 || /* ... */)
{
return luaL_error(L, "Cannot modify '%s' - use methods instead", key);
}
// Allow custom fields (stored in shadow table)
getEntityCustomFields(L, e);
lua_pushstring(L, key);
lua_pushvalue(L, 3);
lua_settable(L, -3);
return 0;
}
This prevents monster.health = 0 (cheating) while allowing ai.my_target = monster (state tracking).
Summary
Threat |
Lua Feature Used |
Implementation |
|---|---|---|
Dangerous functions |
Selective loading |
|
Dangerous functions |
Function removal |
|
Memory exhaustion |
Custom allocator |
|
Stack overflow |
Debug hooks |
|
Infinite loops |
Debug hooks |
|
Metatable abuse |
Function removal |
Remove |
State corruption |
Metamethods |
|
6. Design Decisions for LLM Success
Security is necessary but not sufficient — the LLM must also generate correct code. This section covers design decisions that improve code generation success rates.
API Design: Entity Methods vs Global Functions
We expose functionality as both entity methods and global functions, but the system prompt emphasizes methods:
-- Method syntax (preferred, documented in prompt)
ai:move("w")
ai:has_item("sword")
local path = ai:find_path_to(player.position)
-- Global function syntax (also works, but not documented)
move_entity(ai, "w")
has_item(ai, "sword")
local path = find_path_astar(ai.position, player.position, entities)
Why methods work better:
More idiomatic — LLMs trained on Lua (game mods, Roblox, etc.) see method syntax frequently
Cleaner code —
ai:move("w")vsmove_entity(ai, "w")is more readableFewer parameters — Methods automatically use the entity and global maze, reducing chances for error
Better autocomplete patterns — LLMs predict
ai:then method name more reliably
We still register the global functions as a fallback — if the LLM accidentally generates move_entity(ai, "w"), it works. Defense in depth for correctness.
Prompt Engineering
The system prompt (codeagent_prompt_lua.txt) is carefully structured to prevent common LLM errors:
1. Explicit examples for ambiguous cases:
NOTE: These methods take POSITIONS, not strings or names!
WRONG: ai:find_path_to("sword") -- strings don't work!
WRONG: ai:distance_to(monster.weakness) -- that's a string!
RIGHT: ai:find_path_to(item.position) -- use .position
Without this, LLMs frequently hallucinate that find_path_to("sword") will find the sword by name.
2. Clear item location rules:
Items exist in ONE of two places (never both):
1) ON THE GROUND: Found in the global "items" table
2) IN AN INVENTORY: Found in entity.items (use entity:has_item() to check)
LLMs often confuse ground items with inventory items without explicit guidance.
3. Function signature with parameter types:
entity:find_path_to(position)
--[[
Parameters:
position: A position table {row, col} - use entity.position for entities
Returns:
table: Array of direction strings, or empty {} if unreachable
]]
Documenting return types and parameter formats reduces hallucinations.
4. Lua-specific reminders:
- Lua tables are 1-indexed! Use: for i, item in ipairs(array) do ... end
- Method call: entity:method() (passes entity as first arg)
- Check empty table: if #path > 0 then ... end
LLMs trained primarily on Python/JavaScript often forget Lua’s 1-indexing.
Error Handling and Retry
LLM-generated code can fail on the first attempt. We implement a retry loop that feeds errors back to the model in an attempt to autocorrect.
for (int attempt = 0; attempt <= MAX_RETRIES && !success; attempt++)
{
if (attempt == 0)
{
fullPrompt = "Write an update_func that satisfies: \"" + prompt + "\"";
}
else
{
// Include the failed code and error message
fullPrompt = "Your previous code:\n" + code +
"\nfailed with this error:\n" + lastError +
"\n\nGenerate corrected code.";
}
code = llmCreateAIFunc(fullPrompt);
// Try to compile
if (luaL_dostring(L, code.c_str()) != LUA_OK)
{
lastError = lua_tostring(L, -1);
continue; // Retry with error context
}
// Try runtime
std::string result = callLuaAIFunc(L, ...);
if (result.starts_with("Function failed"))
{
lastError = result;
continue; // Retry with error context
}
success = true;
}
This catches both compile-time errors (syntax mistakes) and runtime errors (nil indexing, type errors), giving the LLM a chance to self-correct.
Reference Semantics with Userdata
We pass C++ entities to Lua as userdata pointers, not copies:
// Push entity as pointer (reference semantics)
static void pushEntity(lua_State* L, Entity* entity)
{
Entity** udata = (Entity**)lua_newuserdata(L, sizeof(Entity*));
*udata = entity;
luaL_getmetatable(L, ENTITY_META);
lua_setmetatable(L, -2);
}
Why this matters:
Changes persist — When Lua code calls
ai:move("w"), the C++ entity’s position actually changes. No sync-back needed.No stale data — Reading
monster.healthalways returns the current valueEfficient — No copying of entity data back and forth
The alternative (copying entities to Lua tables) would require syncing changes back to C++ after every Lua call, which is error-prone and inefficient.
Shadow Tables for Custom Fields
The AI needs to store persistent state across function calls (e.g., “Tell me the number of unique positions I have occupied?”). We support custom fields on entities using a shadow table:
// Shadow table: maps entity pointers to their custom fields
static void getEntityCustomFields(lua_State* L, Entity* entity)
{
pushShadowTable(L); // Global table in registry
lua_pushlightuserdata(L, entity); // Use pointer as key
lua_gettable(L, -2);
if (lua_isnil(L, -1))
{
// Create new table for this entity's custom fields
lua_newtable(L);
lua_pushlightuserdata(L, entity);
lua_pushvalue(L, -2);
lua_settable(L, -4); // shadow[entity] = {}
}
lua_remove(L, -2);
}
This allows:
-- AI can store custom state for tracking
player.visited_positions = player.visited_positions or {}
local pos_key = player.position[1] .. "," .. player.position[2]
player.visited_positions[pos_key] = true
-- Count unique positions
local count = 0
for _ in pairs(player.visited_positions) do count = count + 1 end
return "You have visited " .. count .. " unique positions"
-- Built-in fields still work (read-only)
local pos = player.position -- reads from C++
The shadow table uses weak keys, so when an entity is garbage collected, its custom fields are automatically cleaned up.
Caveat: While this feature works correctly, the current SLMs rarely generate code that uses it effectively, at least with Lua and our current SLM. Stateful tasks like “track unique positions visited” require the model to correctly initialize state, update it each call, and avoid resetting it with local. In practice, SLM-generated code often gets this wrong — especially with smaller models. Consider this an available capability rather than a reliable feature for LLM-generated code. Also consider future models or cloud models might be far superior at using such techniques.
Summary
Decision |
Problem Solved |
Implementation |
|---|---|---|
Method syntax |
Cleaner code, better LLM predictions |
|
Explicit examples in prompt |
Prevent common hallucinations |
“WRONG/RIGHT” examples |
Error feedback retry |
Self-correction of mistakes |
Loop with error context |
Userdata pointers |
Changes persist, no sync needed |
|
Shadow tables |
Custom state without corrupting entities |
Registry table with weak keys |
7. Alternative Architecture — Curated Function Libraries
While this sample demonstrates runtime code generation, some applications may require stricter control over what code can execute. An alternative approach combines LLM code generation during development with human curation for production.
The Hybrid Approach
Instead of generating code at runtime, you can:
Development Phase: Have your team interact with the code agent during development, making the kinds of requests end users would make. The LLM generates functions as usual, but each generated function is cached and logged.
Curation Phase: Developers review the generated functions, validating and approving the ones that work correctly. Over time, you build a library of vetted, production-ready functions.
Production Phase: When an end user makes a request, use semantic matching (embeddings, similarity search) to find a pre-approved function that matches their intent. If a match exists, use it. If not, either:
Strict mode: Return an error (“I don’t know how to do that yet”)
Fallback mode: Generate code at runtime (with all the sandboxing protections)
Benefits
Human oversight — Every function that runs in production has been reviewed
Predictable behavior — Users get tested, validated code paths
Reduced latency — No LLM inference needed for common requests
Security confidence — No runtime code generation in strict mode
Continuous improvement — New requests in fallback mode become candidates for curation
Implementation Considerations
Function storage: Store approved functions with metadata (original prompt, function code, semantic embedding of the request).
Semantic matching: Use an embedding model to convert user requests to vectors, then find the nearest approved function. Set a similarity threshold — below it, consider the request “unmatched.”
Parameterization: Some functions may need light parameterization (e.g., “go get the sword” vs “go get the bow”). Consider whether exact matches are sufficient or if you need template-based functions.
Fallback policy: Decide whether unmatched requests should fail gracefully or trigger runtime generation. This is a security/flexibility tradeoff.
When to Use This Approach
Scenario |
Recommendation |
|---|---|
High-security production environment |
Strict mode (no runtime generation) |
Internal tools with trusted users |
Fallback mode acceptable |
Games with predictable command patterns |
Curated library works well |
Open-ended creative applications |
Runtime generation may be necessary |
This approach lets you capture the benefits of LLM code generation during development while maintaining tight control over what runs in production.
8. Conclusion
Code agents offer a powerful alternative to tool-calling agents for tasks that require flexible, multi-step logic. By having the LLM generate executable code rather than selecting from predefined functions, we can build AI companions that adapt to novel situations and compose behaviors in ways we didn’t explicitly anticipate.
However, this power comes with responsibility. Executing LLM-generated code requires careful sandboxing to prevent:
Dangerous function access (file I/O, shell execution)
Resource exhaustion (memory, CPU, stack)
Sandbox escapes (metatable manipulation)
Application state corruption
Lua proved to be an excellent choice for this sample, offering:
Selective library loading and function removal
Custom memory allocators for hard limits
Debug hooks for timeout and recursion control
A syntax that LLMs can generate with reasonable accuracy
Beyond security, we found that API design and prompt engineering significantly impact LLM success rates. Method syntax, explicit WRONG/RIGHT examples, and error-feedback retry loops all contribute to more reliable code generation.
This sample demonstrates that code agents are practical today, but require thoughtful engineering. The techniques shown here — sandboxing, API design, prompt engineering, and graceful error handling — provide a foundation for building code agents in your own applications.
Key Takeaways:
Code agents vs tool-calling — Choose based on task complexity and security requirements
Language choice matters — Prioritize sandboxing capability, then LLM familiarity
Defense in depth — Multiple layers of protection for each threat category
Design for LLM success — API design and prompts are as important as the runtime
Expect iteration — Error feedback and retry loops improve success rates significantly
We encourage you to experiment with this sample, try different commands, and explore how the AI companion responds. The code is designed to be readable and modifiable — use it as a starting point for your own code agent implementations.