Blog

October 15, 2025
6 min read

SICA - Self Improving Coding Agent

SICA result

We code AI agent that codes. What if we code AI agents that code another agent which codes better than us? 😖 Some researchers in Bristol university, UK with crazy ideas and deep packet had this very idea. The result is in this research paper. No kidding, their 15 iterations cost them $7000. It improved the Agent's performance three-fold.

Self Improving Coding Agent aka SICA is a coding agent that can rewrite its own code. The rationale behind this idea is "What if we overlooked something?" By letting LLMs think on their own and explore new possibilities, we might discover better solutions. You might be familiar with AlphaGo's move-37. The AI invented a new move in a game that had been played for thousands of years. This means that, due to our limited cognitive ability, we overlooked some winning moves and never played them until AlphaGo unveiled one in Game-2. The SICA project is a similar search for overlooked solutions. What if there is a better way to solve a coding challenge that we are still missing?

Here is a quote from the paper.

A motivation for a self-improving system is that the improvements in coding abilities may be leveraged during subsequent improvement steps, hopefully compounding.

This is a very simple project.

Develop a very basic AI agent that can read, write and execute code
Give it 50 test-cases as a benchmark and ask it to solve them.
Based on the results, Ask it to rewrite it's own code
Goto Step-2

Non Gradient based learning

This research introduces a novel learning method. In a typical machine-learning algorithm, we adjust the weights after every iteration based on a utility function or a loss function. But in SICA, the weights are fixed. The learning happens online and the adjustment happens in the code with fixed weights.

The utility function is simple and direct.

$ U = w_{score}p_{score} + w_{cost}(1 − min(1, p_{cost}/$10)) + w_{time}(1 − min(1, p_{time}/300s)) $

$ p_{score} $ --> Benchmark solving score. Normalized between 0 and 1.
$ p_{cost} $ --> Cost in $ it took to achieve the result. Anything beyond $10 adds same loss.
$ p_{time} $ --> Time taken to solve the problems. Anything runs for more than 5 mins incur same loss.
$ w_{score} $ --> Weight of the score. Fixed to 0.5
$ w_{cost} $ --> Weight of the cost. Fixed to 0.25
$ w_{time} $ --> Weight of the time. Fixed to 0.25

They added a small penalty on timeouts - Not captured here for simplicity.

The human developed basic AI agent is given the 50 test-case benchmark. The percentage of successful solving is normalized between 0 and 1 - $ p_{score} $, time taken to complete the task - $ p_{cost} $ and the LLM token cost incurred to complete the task - $ p_{time} $ are measured and fed to the utility function. The commit-sha of the code (identifier of the current version), execution logs and the utility score are stored in an archive.

Then LLM is asked to review the logs and rewrite the agent's code to improve the performance. Once again the complete benchmark is run on the new code, utility value is captured. If the utility value increases, that is the trajectory forward. This new code will be taken as baseline and next iteration will be started. Otherwise, fallback to the previous version itself.

In this novel non-gradient based approach we don't fit the model by adjusting weights. With fixed weights in place, we update the model's code. The coefficients of the linear function remains fixed and we fix the functions that produce the variables.

Guardrails

Neither AI nor humans are matured enough to give AI full autonomy. So, SICA would lead to catastrophe unless proper failsafe mechanisms are taken. For starters, if the Agent introduces a compiler error into it's code, it can never recover. So, two instances of the agent - one acts as the researcher and other acts as the lab rat. The code that selects the forward propagation and the code that runs the benchmark are executed in different instances. So, if the benchmarking agent fails into unrecoverable loop, the utility agent have a failsafe to either fix the error or pick a different agent from the archive.

There is always a risk of hallucination when work is this big scale. So, the researchers employed a overseer agent which is outside of the self-improving agent. It runs asynchronous, called every 30 seconds to check whether the self-improving agent is running on correct path. If it finds a sign of hallucination, the current run of the self-improving agent is killed and the iteration starts over. There employed a complex mechanism to present the complete state and trajectory of the self-improving agent to overseer agent which is detailed in the paper.

Execution, Results

When we mark the order of iteration in the x-axis and the utility performance in the y-axis, we get below graph.

SICA result

The experiment ran for 15 iterations. The basic human-developed agent with only open, read and write capability was added with new tools upon every iteration. Over a period the performance improved from 17% to 53%.

At every inflection point, there was a new breakthrough in the agent's approach. For example, At the first iteration, the agent developed a smart-edit tool. After that the performance improved about 5%. Significant leap in performance occurred when the agent developed an AST symbol locator in the 9th iteration. 10% improvement observed after that.

This is the easter egg we're searching for. The LLM understood the core problem and implemented the right tools to solve the problem. More than its proof that LLM can be self relying, it sets a direction for human developers. Like AlphaGo enlightened us with a new move, SICA introduced the AST symbol locator. It would've taken months of manual trial-and-error to foresee the use of such tool in daily coding task and measure it's impact. But SICA has done that in less than 24 hours. With more iterations, SICA has the potential to implement more innovative tool.

The Influence

Before going overly optimistic about AI, we should understand that the success of SICA is achieved with some intelligent influence by the developers. They have introduced two synthetic benchmarks. 1) File editing - Ask LLM to make known changes and score it based on the matching. 2) Code navigation - Ask interesting questions about symbol definition and references.

The major problem in solving the 50 test-case benchmark was editing files at correct place without syntax errors and navigating the code-base to update relevant code. But LLM wasn't intelligent enough to give first preference to these two problems. Or it didn't do it in the small number of initial iterations. So, to accelerate the attention, the researches introduced these synthetic benchmarks. It's an intelligent approach to define a utility over the problem instead of just telling it as a problem in the prompt.

Conclusion

The SICA project has opened doors to groundbreaking advancements in AI, showcasing the potential of non-gradient-based learning, collaborative LLM systems, and the strategic use of synthetic benchmarks to guide AI focus. By iteratively improving its own code, SICA demonstrated significant performance gains, culminating in the creation of innovative tools like the AST symbol locator. As highlighted in Antonio Gulli's "Agentic Design Patterns," such agentic systems illustrate how AI can surpass human limitations, offering transformative approaches to autonomous learning and development. With robust safeguards in place, SICA sets a precedent for building safer, smarter, and more adaptive AI systems.

September 26, 2025
18 min read

Agent-001 Part-3

Series

In the first part of this series, we explored the problem statement and how to leverage an LLM within a script. The second part covered guiding the LLM to produce structured responses and building automation around those outputs. In this post, we’ll dive into the Agentic model.

With the Agentic model, we don’t prescribe a fixed workflow. Instead, we expose a set of tools to the LLM and provide instructions on when and how to use them. The LLM can then autonomously decide which tools to invoke, in what order, and as many times as needed. Since the LLM operates independently—much like James Bond—we refer to it as an Agent.

As the developer creating these tools for the LLM, you’re essentially playing the role of Q. Pretty cool, right? 😎

The Agentic Architecture

First let's create the tools that we're going to expose to the LLM. In our case we're building two tools.

Browser - browser.py
Send Email send_email.py

The Browser tool enables the LLM to fetch up-to-date information about a joke, especially when it references recent events that may not be included in the model’s training data. This helps prevent misclassification of jokes that could be offensive due to current global contexts. The LLM can invoke the browser whenever it encounters unfamiliar references.

The send-email tool is responsible for queuing emails to the outbox, and its implementation remains unchanged from the previous post. Both tools are implemented as standalone Python scripts, each accepting command-line arguments to perform their respective actions.

To facilitate integration and add input validation, we also created lightweight wrapper functions around these scripts. While not strictly required, these wrappers give developers more control over parameter handling before executing the underlying scripts.

For example, the run_browse function accepts two parameters: term (the search query) and joke (the context). It then invokes browser.py and returns the script’s output.

agent.py: run_browse
def run_browse(term: str, joke: str) -> str:
    """Invoke the browse.py tool with the search term in the context of the joke and return its stdout."""
    browser_arg = f"Define the term '{term}' in the context of this joke: '{joke}'"
    cmd = ["python", "./browser.py", browser_arg]
    logger.info("Running browse tool for term: %s", term)
    try:
        out = subprocess.check_output(
            cmd, stderr=subprocess.STDOUT, text=True, timeout=600
        )
        logger.debug("browse output: %s", out)
        return out
    except subprocess.CalledProcessError as e:
        logger.error("browse.py failed: %s", e.output)
        return ""
    except Exception:
        logger.exception("Error running browse.py")
        return ""

The send_email is same as the one explained in the part-2. So, I'm not going to refer it here.

Expose the tools to the LLM

With our two functions (tools) ready, the next step is to make the LLM aware of them. There are two main ways to provide this information:

Embedding tool descriptions directly in the prompt.
Supplying tool definitions as part of the API call.

In this example, we use both methods. First, we enhance the SYSTEM_PROMPT with clear, unambiguous descriptions of each tool. Precise instructions are essential—any ambiguity can lead to LLM hallucinations. Here’s how we update the SYSTEM_PROMPT to include these details:

agent.py: SYSTEM_PROMPT
SYSTEM_PROMPT = f"""
    You are an helpful assistant that helps me to send a funny morning email to my colleagues.
    You will be provided with a programmer joke.
    Your task is to:
    (1) Decide the safe of the joke (safe: safe/dark/offensive).
    (2) Identify to which group the joke to be sent ({GROUPS.keys()}).
    (3) And briefly explain the joke in 1 paragraph.
    You have multiple steps to complete your task.
    IMPORTANT:
      - If there is ANY technical term you are not 100% certain about, FIRST call the `browse` tool before final JSON.
      - If safe == "safe" you MUST attempt the `send_email` tool once before giving the final JSON.
      - Final JSON ONLY after required tool usage (or explicit determination no browse needed AND email attempted when safe).
    Your final response must be a single JSON object with keys: safe (string), category (string), explanation (string) and is_email_sent (boolean).

    The category must be one of these values: system, oops, web, Other.

    Below you can find relevant keywords for each group to help you decide the correct category:
    {json.dumps({k: v["keywords"] for k, v in GROUPS.items()}, indent=4)}

    The safe value must be one of these values: safe, dark, offensive.
    The explanation must be a brief explanation of the joke.

    You have two tools in your toolbox:
    1) A `browse` tool to look up technical terms you don't understand in the context of the joke. You can use this tool to disambiguate the meaning of the joke before classifying it or deciding whether it is safe for work.
    2) An `send_email` tool to send the joke to the relevant team group once you are confident it's safe and correctly categorized.
    Use the `browse` tool first if you need to look up any terms.
    Only use the `send_email` tool once you are confident in your classification and explanation.

    If the Joke is classified as dark, store that in dark.json in the {OUTPUT_DIR} directory. This is for me to forward to my friends later in the day.

In addition to embedding tool descriptions in the prompt, we’ll also provide function-call definitions directly in the API request. Some LLM APIs may not support passing tool information via the API, in which case prompt heuristics alone are sufficient. However, OpenAI APIs allow us to specify available tools using a JSON schema. We’ll take advantage of this capability.

Let’s define a JSON structure that specifies each function’s name, type, and parameters, making them explicit to the LLM:

agent.py: FUNCTION_TOOLS
FUNCTION_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "browse",
            "description": "Look up a technical term within the context of the joke to disambiguate meaning before classification.",
            "parameters": {
                "type": "object",
                "properties": {
                    "term": {
                        "type": "string",
                        "description": "The technical term or phrase to research.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "(Optional) The original joke for extra context.",
                    },
                },
                "required": ["term"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send the joke via email to the relevant team group once you are confident it's safe and correctly categorized.",
            "parameters": {
                "type": "object",
                "properties": {
                    "group_label": {
                        "type": "string",
                        "enum": ALLOWED_CATEGORIES,
                        "description": "Category/team to notify.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "The original joke.",
                    },
                    "explanation": {
                        "type": "string",
                        "description": "Reason the joke is relevant and safe.",
                    },
                },
                "required": ["group_label", "joke", "explanation"],
            },
        },
    },
]

How is this information communicated to the LLM? As described in part 2, the system prompt—containing the instruction heuristics—is included in the message sequence. Additionally, the JSON construct specifying the tools is attached to the API payload when making the API call.

agent.py: classify_and_act_on_joke
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)

agent.py: chat_completion
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"

As shown above, when the tools argument is provided to the chat_completion function (which applies here), the API payload includes a tools key containing the JSON definition of available tools.

In summary, tool information is communicated to the LLM through both the system prompt and the tools field in the API payload.

The agentic loop

Although we've made the tools available to the LLM, it can't directly execute them—these tools exist on our local system. To bridge this gap, we need an environment where the LLM's tool invocation requests are executed and the results are returned. This orchestration happens within what’s called the agentic loop.

The agentic loop operates as follows:

Make the initial LLM call, providing the problem statement and tool information.
Inspect the LLM’s response for tool calls. If present, execute the requested tool and append the result to the message history.
Call the LLM again with the updated messages and repeat step 2.
If no tool calls are detected, consider the task complete and exit the loop.

This loop allows the LLM to function autonomously, deciding which tools to use and when, without developer intervention. The main logic is implemented in the classify_and_act_on_joke function.

To prevent the LLM from entering an infinite loop, we set a maximum number of cycles—here, 10. If the LLM doesn’t finish within these iterations, the loop exits automatically.

agent.py: classify_and_act_on_joke
    max_cycles = 10
    email_sent_flag: bool = False
    last_email_attempt_reason: str = ""
    for cycle in range(max_cycles):
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)

As you see above, the first LLM call is made inside the for loop. Then we capture the response and check for tool calls.

agent.py: classify_and_act_on_joke
        msg = _assistant_message(data)
        tool_calls = msg.get("tool_calls") or []
        content = msg.get("content") or ""

        # ALWAYS append assistant message so tool_call references remain valid
        messages.append(
            {k: v for k, v in msg.items() if k in ("role", "content", "tool_calls")}
        )

        if tool_calls:

When the LLM responds, any tool calls are included in a separate tool_calls key in the structured output (for OpenAI models, the main response is under content, and tool invocations are under tool_calls). We check if tool_calls is present and not empty to determine if a tool needs to be executed.

At line 317, the LLM response is appended to the messages array. This step is essential because LLMs do not retain conversational context between calls. To maintain context, every message in the conversation—including the initial system_prompt, each user_prompt, and every llm_response—must be included in the messages list for each API call.

If tool calls are detected, we parse the tool call data to extract the function name and parameters, then invoke the appropriate tool with the parameters provided by the LLM.

agent.py: classify_and_act_on_joke
                fn = tc["function"]["name"]
                raw_args = tc["function"].get("arguments") or "{}"
                try:
                    args = (
                        json.loads(raw_args) if isinstance(raw_args, str) else raw_args
                    )
                except Exception:
                    args = {}
                if fn == "browse":
                    term = args.get("term", "")
                    logger.info(f" 🌐  Browsing for term: {term}")
                    tool_result = run_browse(term, joke)
                elif fn == "send_email":
                    group_label = args.get("group_label") or "Other"
                    explanation = args.get("explanation", "")
                    logger.info(f" ✉️  Sending email to group: {group_label}")
                    sent = send_email(group_label, joke, explanation)
                    tool_result = {
                        "sent": bool(sent),
                        "reason": "ok" if sent else "failed",
                    }
                    email_sent_flag = email_sent_flag or bool(tool_result.get("sent"))
                    last_email_attempt_reason = tool_result.get("reason", "")
                else:
                    tool_result = {"error": f"Unknown tool {fn}"}

The result of the tool execution is captured in the variable tool_result. Now, let's append the result in the message as a new user message and start back the loop.

agent.py: classify_and_act_on_joke
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tc.get("id"),
                        "name": fn,
                        "content": tool_result
                        if isinstance(tool_result, str)
                        else json.dumps(tool_result),
                    }
                )
            continue  # next cycle after tools

This loop will run until the LLM doesn't make any tool-call or it exhaust the maximum calls. You can find the full code at the bottom of the page.

The Agent Architecture

We now have a fully functional agent. Let’s break down the core components that make up this architecture:

Tool Implementations: These are standalone utilities that the LLM can invoke. Any command-line tool that a human could use can be exposed to the LLM, though in this example we focus on non-interactive tools. If you wish to support interactive tools (like vim), you’ll need to simulate user interaction within your execution environment, typically by leveraging LLM APIs to handle the input/output flow.
Tool Awareness: The LLM needs to know what tools are available. In our example, we provided this information through both prompt heuristics (in the system prompt) and a tool definition in JSON included as part of the API payload.
Execution Environment: This is where the LLM’s tool invocation requests are executed. In our case, we ran commands directly on the local system. However, for safety, production systems typically use a sandbox environment with only the necessary tools and data.
LLM Model: Here, we used GPT-5 from Azure OpenAI as the reasoning engine.
Agent Loop: This is the main interaction point between the LLM and the environment. The loop orchestrates the conversation, tool calls, and result handling. In fact, the agent loop itself can be considered the core of the agent, with the other components serving as supporting structures. As mentioned earlier, this loop can be implemented in under 100 lines of code.

Together, these components form what’s often called agent scaffolding. There’s no universal best approach—scaffolding should be tailored to the specific task for optimal results. Designing effective scaffolding is as much an art as it is engineering, and it’s a key skill for agentic developers.

Conclusion

Series

Thank you for joining me on this three-part journey into building agentic systems with LLMs. In the first post, we explored the foundational problem and learned how to integrate an LLM into a script to process and analyze data. The second part focused on guiding the LLM to produce structured outputs and demonstrated how to automate actions based on those outputs, laying the groundwork for more complex workflows. In this final installment, we delved into the agentic model, where the LLM is empowered to autonomously select and invoke tools, orchestrated through an agentic loop.

Throughout the series, we covered key concepts such as tool creation, prompt engineering, exposing tool definitions to the LLM, and managing the agentic loop for autonomous decision-making. By combining these elements, you can build flexible, powerful agents capable of handling a wide range of tasks with minimal intervention.

I hope this series have provided you with both the technical know-how and the inspiration to experiment with agentic architectures in your own projects. Thank you for reading, and best of luck on your agentic encounters—may your agents be resourceful, reliable, and always ready for the next challenge!

Code

agent.py

agent.py
import os
import sys
import json
import time
import logging
import subprocess
import datetime
import glob
import signal
import re
from pathlib import Path
from typing import Dict, Any, Optional

from dotenv import load_dotenv

load_dotenv()

import requests

from datetime import datetime, timezone

OUTPUT_DIR = Path("/tmp/agent-001/")
STATE_FILE = OUTPUT_DIR / "state.json"
DARK_FILE = OUTPUT_DIR / "dark.json"

# Azure OpenAI settings - must be provided as environment variables
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# Groups mapping (labels expected from the model)
GROUPS = {
    "system": {
        "email": "system@example.com",
        "description": "OS and Platform developers, System administrators and DevOps team",
        "keywords": [
            "operating systems",
            "Linux",
            "Unix",
            "Windows",
            "macOS",
            "DevOps",
            "SysAdmin",
            "infrastructure",
            "cloud",
            "virtualization",
            "containers",
            "Kubernetes",
            "networking",
        ],
    },
    "oops": {
        "email": "oops@example.com",
        "description": "Application and services developers",
        "keywords": [
            "application",
            "services",
            "java",
            "python",
            "c#",
            "go",
            "ruby",
            "php",
            "node.js",
            "dotnet",
            "API",
            "microservices",
            "REST",
            "SOAP",
        ],
    },
    "web": {
        "email": "web-team@example.com",
        "description": "Web technology, front-end, back-end, react, angular, javascript, css developers",
        "keywords": [
            "Web technology",
            "front-end",
            "back-end",
            "react",
            "angular",
            "javascript",
            "css",
            "HTML",
            "web development",
            "UX",
            "UI",
            "web design",
            "web frameworks",
        ],
    },
    "Other": {
        "email": "all@example.com",
        "description": "Everything else, general audience",
        "keywords": [],
    },
}
ALLOWED_CATEGORIES = list(GROUPS.keys())

SYSTEM_PROMPT = f"""
    You are an helpful assistant that helps me to send a funny morning email to my colleagues.
    You will be provided with a programmer joke.
    Your task is to:
    (1) Decide the safe of the joke (safe: safe/dark/offensive).
    (2) Identify to which group the joke to be sent ({GROUPS.keys()}).
    (3) And briefly explain the joke in 1 paragraph.
    You have multiple steps to complete your task.
    IMPORTANT:
      - If there is ANY technical term you are not 100% certain about, FIRST call the `browse` tool before final JSON.
      - If safe == "safe" you MUST attempt the `send_email` tool once before giving the final JSON.
      - Final JSON ONLY after required tool usage (or explicit determination no browse needed AND email attempted when safe).
    Your final response must be a single JSON object with keys: safe (string), category (string), explanation (string) and is_email_sent (boolean).

    The category must be one of these values: system, oops, web, Other.

    Below you can find relevant keywords for each group to help you decide the correct category:
    {json.dumps({k: v["keywords"] for k, v in GROUPS.items()}, indent=4)}

    The safe value must be one of these values: safe, dark, offensive.
    The explanation must be a brief explanation of the joke.

    You have two tools in your toolbox:
    1) A `browse` tool to look up technical terms you don't understand in the context of the joke. You can use this tool to disambiguate the meaning of the joke before classifying it or deciding whether it is safe for work.
    2) An `send_email` tool to send the joke to the relevant team group once you are confident it's safe and correctly categorized.
    Use the `browse` tool first if you need to look up any terms.
    Only use the `send_email` tool once you are confident in your classification and explanation.

    If the Joke is classified as dark, store that in dark.json in the {OUTPUT_DIR} directory. This is for me to forward to my friends later in the day.
"""

# Define tool (function) schemas for GPT-4.1 function calling
FUNCTION_TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "browse",
            "description": "Look up a technical term within the context of the joke to disambiguate meaning before classification.",
            "parameters": {
                "type": "object",
                "properties": {
                    "term": {
                        "type": "string",
                        "description": "The technical term or phrase to research.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "(Optional) The original joke for extra context.",
                    },
                },
                "required": ["term"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "send_email",
            "description": "Send the joke via email to the relevant team group once you are confident it's safe and correctly categorized.",
            "parameters": {
                "type": "object",
                "properties": {
                    "group_label": {
                        "type": "string",
                        "enum": ALLOWED_CATEGORIES,
                        "description": "Category/team to notify.",
                    },
                    "joke": {
                        "type": "string",
                        "description": "The original joke.",
                    },
                    "explanation": {
                        "type": "string",
                        "description": "Reason the joke is relevant and safe.",
                    },
                },
                "required": ["group_label", "joke", "explanation"],
            },
        },
    },
]

# Ensure directories exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("agent")


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}


def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


def _extract_json(text: str) -> Optional[dict]:
    """Try to extract the first JSON object from a text blob."""
    try:
        return json.loads(text)
    except Exception:
        m = re.search(r"\{.*\}", text, re.S)
        if m:
            try:
                return json.loads(m.group(0))
            except Exception:
                return None
    return None


def chat_completion(
    messages, tools=None, temperature=0.0, max_tokens=800
) -> Dict[str, Any]:
    """Call Azure OpenAI chat completion returning the full JSON, supporting tool (function) calls."""
    time.sleep(3 + (2 * os.urandom(1)[0] / 255.0))  # jitter
    if not AZURE_ENDPOINT or not AZURE_KEY:
        raise RuntimeError(
            "Azure OpenAI credentials (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY) not set"
        )

    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {"Content-Type": "application/json", "api-key": AZURE_KEY}
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)
    if resp.status_code >= 400:
        logging.error(
            "Azure OpenAI 4xx/5xx response %s: %s", resp.status_code, resp.text
        )
        resp.raise_for_status()
    return resp.json()


def _assistant_message(data: Dict[str, Any]) -> Dict[str, Any]:
    try:
        return data["choices"][0]["message"]
    except Exception:
        raise RuntimeError(f"Unexpected response format: {data}")


def _parse_final_json(content: str) -> Optional[Dict[str, Any]]:
    obj = _extract_json(content)
    if not obj:
        return None
    # Minimal validation (is_email_sent may be absent; we'll add later)
    required = {"safe", "category", "explanation"}
    if not required.issubset(obj.keys()):
        return None
    if obj.get("category") not in GROUPS.keys():
        return None
    if obj.get("safe") not in {"safe", "dark", "offensive"}:
        return None
    return obj


def _append_dark_joke(joke: str, parsed: Dict[str, Any]) -> None:
    """Persist dark jokes to DARK_FILE as an array of entries."""
    try:
        if DARK_FILE.exists():
            arr = json.loads(DARK_FILE.read_text(encoding="utf-8"))
            if not isinstance(arr, list):  # recover if corrupted
                arr = []
        else:
            arr = []
        arr.append(
            {
                "joke": joke,
                "ts": datetime.now(timezone.utc).isoformat(),
                "explanation": parsed.get("explanation", ""),
            }
        )
        DARK_FILE.write_text(json.dumps(arr, indent=2), encoding="utf-8")
    except Exception:
        logger.exception("Failed to append dark joke to %s", DARK_FILE)


def classify_and_act_on_joke(joke: str, state: Dict[str, Any]) -> Dict[str, Any]:
    """Tool (function) calling loop with GPT-4.1 until final JSON classification.

    Guarantees:
      * If classification is safe, an email attempt is performed (tool call or forced local send) before returning.
      * If classification is dark, joke is stored in dark.json.
      * Adds is_email_sent boolean to final JSON.
    """
    messages: list[dict[str, Any]] = [
        {"role": "system", "content": f"{SYSTEM_PROMPT}"},
        {"role": "user", "content": f"joke: {joke}"},
    ]

    max_cycles = 10
    email_sent_flag: bool = False
    last_email_attempt_reason: str = ""
    for cycle in range(max_cycles):
        try:
            data = chat_completion(messages, tools=FUNCTION_TOOLS)
        except Exception:
            logger.exception("chat_completion failed")
            time.sleep(5)
            continue
        msg = _assistant_message(data)
        tool_calls = msg.get("tool_calls") or []
        content = msg.get("content") or ""

        # ALWAYS append assistant message so tool_call references remain valid
        messages.append(
            {k: v for k, v in msg.items() if k in ("role", "content", "tool_calls")}
        )

        if tool_calls:
            for tc in tool_calls:
                if tc.get("type") != "function":
                    continue
                fn = tc["function"]["name"]
                raw_args = tc["function"].get("arguments") or "{}"
                try:
                    args = (
                        json.loads(raw_args) if isinstance(raw_args, str) else raw_args
                    )
                except Exception:
                    args = {}
                if fn == "browse":
                    term = args.get("term", "")
                    logger.info(f" 🌐  Browsing for term: {term}")
                    tool_result = run_browse(term, joke)
                elif fn == "send_email":
                    group_label = args.get("group_label") or "Other"
                    explanation = args.get("explanation", "")
                    logger.info(f" ✉️  Sending email to group: {group_label}")
                    sent = send_email(group_label, joke, explanation)
                    tool_result = {
                        "sent": bool(sent),
                        "reason": "ok" if sent else "failed",
                    }
                    email_sent_flag = email_sent_flag or bool(tool_result.get("sent"))
                    last_email_attempt_reason = tool_result.get("reason", "")
                else:
                    tool_result = {"error": f"Unknown tool {fn}"}
                messages.append(
                    {
                        "role": "tool",
                        "tool_call_id": tc.get("id"),
                        "name": fn,
                        "content": tool_result
                        if isinstance(tool_result, str)
                        else json.dumps(tool_result),
                    }
                )
            continue  # next cycle after tools

        if content:
            parsed = _parse_final_json(content)
            if parsed:
                # Enforce side-effects BEFORE returning.
                if parsed["safe"] == "safe" and not email_sent_flag:
                    # Model skipped tool call; perform mandatory send_email now.
                    group_label = parsed.get("category", "Other")
                    explanation = parsed.get("explanation", "")
                    sent = send_email(group_label, joke, explanation)

                if parsed["safe"] == "dark":
                    _append_dark_joke(joke, parsed)

                parsed["is_email_sent"] = bool(email_sent_flag)
                if email_sent_flag and not parsed["explanation"]:
                    parsed["explanation"] = parsed.get(
                        "explanation", "Sent without explanation provided"
                    )
                logging.info(" ✅  Task complete")
                logging.info(f"joke: {joke}")
                logging.info(f"safe: {parsed['safe']}")
                logging.info(f"category: {parsed['category']}")
                if parsed["safe"] == "safe":
                    logging.info(
                        "email_sent=%s reason=%s",
                        parsed["is_email_sent"],
                        last_email_attempt_reason,
                    )
                time.sleep(1)
                return parsed
            else:
                messages.append(
                    {
                        "role": "user",
                        "content": "Return only the final JSON object now.",
                    }
                )
                continue

    logging.warning(
        "Exceeded max tool cycles without valid final JSON; returning fallback"
    )
    return {
        "safe": "dark",
        "category": "Other",
        "explanation": "Model failed to return final JSON in time",
    }


def run_browse(term: str, joke: str) -> str:
    """Invoke the browse.py tool with the search term in the context of the joke and return its stdout."""
    browser_arg = f"Define the term '{term}' in the context of this joke: '{joke}'"
    cmd = ["python", "./browser.py", browser_arg]
    logger.info("Running browse tool for term: %s", term)
    try:
        out = subprocess.check_output(
            cmd, stderr=subprocess.STDOUT, text=True, timeout=600
        )
        logger.debug("browse output: %s", out)
        return out
    except subprocess.CalledProcessError as e:
        logger.error("browse.py failed: %s", e.output)
        return ""
    except Exception:
        logger.exception("Error running browse.py")
        return ""


def send_email(group_label: str, joke: str, explanation: str) -> bool:
    """Call send_email.py tool. group_label must be one of GROUPS keys."""
    group_email = GROUPS.get(group_label, GROUPS["Other"])["email"]
    # Use current interpreter for portability (virtualenv compatibility)
    cmd = [sys.executable, "send_email.py", group_email, joke, explanation]
    logger.info("Sending email to %s for group %s", group_email, group_label)
    try:
        subprocess.check_call(cmd)
        return True
    except subprocess.CalledProcessError:
        logger.exception("send_email.py returned non-zero")
        return False
    except Exception:
        logger.exception("Error running send_email.py")
        return False


def process_joke_file(path: Path, state: Dict[str, Any]) -> None:
    logger.info("\n\n*** ***")
    logger.info("Processing joke file: %s", path)
    joke = path.read_text(encoding="utf-8").strip()
    file_id = path.name

    if file_id in state.get("processed", {}):
        logger.info("Already processed %s, skipping", file_id)
        return

    try:
        result = classify_and_act_on_joke(joke, state)
    except Exception:
        logger.exception("LLM tool-driven processing failed for %s", file_id)
        sys.exit(1)
        # result = {"safe": False, "category": "Other", "explanation": "LLM error"}

    # Mark processed
    state.setdefault("processed", {})[file_id] = {
        "agent": "003",
        "joke": joke,
        "processed_at": datetime.now(timezone.utc).isoformat(),
        "result": result,
    }
    save_state(state)


def main_loop(poll_interval: int = 60):
    state = load_state()
    logger.info("Agent started, watching %s", OUTPUT_DIR)

    while True:
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            process_joke_file(Path(f), state)
            # return
        # Sleep and be responsive to shutdown
        for _ in range(int(poll_interval)):
            time.sleep(1)


if __name__ == "__main__":
    main_loop()

September 25, 2025
8 min read

Agent-001 Part-2

Series

In the first part of this series, we explored the programming jokes API and built a simple automation to extract the meaning of each joke. In this part, we'll automate the cultural-appropriateness check and email notifications using an LLM.

Developers prefer structured data because it's machine-readable and easy to automate. However, LLMs are primarily designed for conversational, natural language output. With the increasing use of LLMs in programming and automation, model providers have started prioritizing structured outputs for developers. For instance, starting with GPT-4, OpenAI has trained its models to follow user instructions more strictly.

For more details on how OpenAI improved programmer workflows in GPT-5, see my earlier blog: GPT-5 for Programmers.

We'll take advantage of this by instructing the LLM to respond in a structured JSON format. Since we're asking for the meaning of multiple jokes, it's best to separate the instructions for output structure from the actual jokes. The output instructions are generic, while the jokes vary each time. Mixing both in a single prompt would generate unique text combinations, reducing the effectiveness of the KV cache. Therefore, we'll place the output instructions in a special prompt know as system prompt and the jokes in the user prompt. Here's how we construct our system prompt,

automate_with_ai.py: SYSTEM_PROMPT
SYSTEM_PROMPT = (
    "You are an helpful assistant that explains a programmer joke and identify whether it is culturally appropriate to be shared in a professional office environment.\n"
    "Goals:\n"
    "(1) Decide whether the joke is funny or not (funny: true/false).\n"
    "(2) Categorize the joke into one of these categories: 'Safe for work', 'Offensive', 'Dark humor'.\n"
    "(3) And briefly explain the joke in 1 paragraph.\n"
    "Your response must be a single JSON object with keys: funny (bool), category (string), explanation (string).\n"
)

As shown above, we delegate the task of determining whether a joke is funny and appropriate for the workplace to the LLM itself. Crucially, we instruct the LLM to return its output strictly in JSON format.

Then in our process_joke_file we make two modifications.

Include the system prompt in the message
Parse the LLM output as a JSON

automate_with_ai.py: process_joke_file
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"joke: `{joke}`"},
        ]
        response = chat_completion(messages)["choices"][0]["message"]["content"]
        result = _parse_final_json(response)

We have also created an external script, send_email.py (full code available at the end of this post). This script takes two arguments—the joke and its explanation—and queues an email in the outbox. The send_email function in our code is responsible for invoking this script.

Since the LLM now returns structured JSON output, we can easily inspect its response and, based on its assessment, call the send_email function directly from our code.

automate_with_ai.py process_joke_file
        result = _parse_final_json(response)

        if result['funny'] and result['category'] == 'Safe for work':
            # Send email
            if send_email(joke, result['explanation']):
                logger.info("Email sent for joke %s", file_id)
            else:
                logger.error("Failed to send email for joke %s", file_id)

Conclusion

Series

In this post, we took a significant step forward by automating the evaluation of jokes for cultural appropriateness and streamlining the email sending process. By leveraging the LLM’s ability to return structured JSON, we eliminated the need for tedious manual checks and made it straightforward to plug the model’s output directly into our automation pipeline. This approach not only saves time but also reduces the risk of human error.

Yet, it’s important to recognize that what we’ve built so far is still traditional automation. The LLM serves as a smart evaluator, but all the decision-making logic and possible actions are hardcoded by us. The workflow is predictable and limited to the scenarios we’ve anticipated.

But what if the LLM could do more than just provide information? Imagine a system where the LLM can actively decide which actions to take, adapt to new situations, and orchestrate workflows on its own. This is the promise of agentic workflows—where the LLM becomes an autonomous agent, capable of selecting from a toolkit of actions and dynamically shaping the automation process.

In the next part of this series, we’ll dive into building such agentic systems. We’ll explore how to empower LLMs to not just inform, but to act—unlocking a new level of flexibility and intelligence in automation.

Complete code

automate_with_ai.py

automate_with_ai.py
import os
import sys
import json
import time
import logging
import datetime
import glob
import signal
import re
from pathlib import Path
from typing import Dict, Any, Optional

from dotenv import load_dotenv
load_dotenv()

import requests

OUTPUT_DIR = Path("/tmp/agent-001/")
STATE_FILE = OUTPUT_DIR / "state.json"

# Azure OpenAI settings - must be provided as environment variables
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# Ensure directories exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("agent")

shutdown_requested = False


def _signal_handler(signum, frame):
    global shutdown_requested
    logger.info("Signal %s received, will shut down gracefully", signum)
    shutdown_requested = True


signal.signal(signal.SIGINT, _signal_handler)
signal.signal(signal.SIGTERM, _signal_handler)


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}


def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


SYSTEM_PROMPT = (
    "You are an helpful assistant that explains a programmer joke and identify whether it is culturally appropriate to be shared in a professional office environment.\n"
    "Goals:\n"
    "(1) Decide whether the joke is funny or not (funny: true/false).\n"
    "(2) Categorize the joke into one of these categories: 'Safe for work', 'Offensive', 'Dark humor'.\n"
    "(3) And briefly explain the joke in 1 paragraph.\n"
    "Your response must be a single JSON object with keys: funny (bool), category (string), explanation (string).\n"
)


def _extract_json(text: str) -> Optional[dict]:
    """Try to extract the first JSON object from a text blob."""
    try:
        return json.loads(text)
    except Exception:
        m = re.search(r"\{.*\}", text, re.S)
        if m:
            try:
                return json.loads(m.group(0))
            except Exception:
                return None
    return None


def chat_completion(messages, tools=None, temperature=0.0, max_tokens=800) -> Dict[str, Any]:
    """Call Azure OpenAI chat completion returning the full JSON, supporting tool (function) calls."""
    # Random jitter 3-5s to reduce rate spikes
    time.sleep(3 + (2 * os.urandom(1)[0] / 255.0))

    if not AZURE_ENDPOINT or not AZURE_KEY:
        raise RuntimeError("Azure OpenAI credentials (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY) not set")

    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_KEY,
    }
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)
    resp.raise_for_status()
    return resp.json()


def _parse_final_json(content: str) -> Optional[Dict[str, Any]]:
    obj = _extract_json(content)
    if not obj:
        return None
    # Minimal validation
    if {"safe", "category", "explanation"}.issubset(obj.keys()):
        return obj
    return obj  # return anyway; caller can decide


def send_email(joke:str, explanation: str) -> bool:
    group_email = "all@example.com"
    cmd = [sys.executable, "send_email.py", group_email, joke, explanation]
    logger.info("Sending email to %s with joke", group_email)
    try:
        import subprocess
        result = subprocess.run(cmd, capture_output=True, text=True)
        if result.returncode != 0:
            logger.error("Failed to send email: %s", result.stderr)
            return False
        logger.info("Email sent successfully")
        return True
    except Exception as e:
        logger.exception("Exception while sending email: %s", e)
        return False


def process_joke_file(path: Path, state: Dict[str, Any]) -> None:
    logger.info("Processing joke file: %s", path)
    joke = path.read_text(encoding="utf-8").strip()
    file_id = path.name

    if file_id in state.get("processed", {}):
        logger.info("Already processed %s, skipping", file_id)
        return

    try:
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"joke: `{joke}`"},
        ]
        response = chat_completion(messages)["choices"][0]["message"]["content"]
        result = _parse_final_json(response)

        if result['funny'] and result['category'] == 'Safe for work':
            # Send email
            if send_email(joke, result['explanation']):
                logger.info("Email sent for joke %s", file_id)
            else:
                logger.error("Failed to send email for joke %s", file_id)

    except Exception as e:
        logger.exception("LLM tool-driven processing failed for %s\nException: %s", file_id, e)
        sys.exit(1)

    # Mark processed
    state.setdefault("processed", {})[file_id] = {"agent": "002", "joke": joke, "processed_at": datetime.datetime.utcnow().isoformat(), "funny": result["funny"], "explanation": result["explanation"], "category": result["category"]}
    save_state(state)


def main_loop(poll_interval: int = 60):
    state = load_state()
    logger.info("Agent started, watching %s", OUTPUT_DIR)

    while not shutdown_requested:
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            if shutdown_requested:
                break
            process_joke_file(Path(f), state)
        # Sleep and be responsive to shutdown
        for _ in range(int(poll_interval)):
            if shutdown_requested:
                break
            time.sleep(1)

    logger.info("Agent shutting down")


if __name__ == "__main__":
    main_loop()

send_email.py

send_email.py
#!/usr/bin/env python3
import sys
import json
import logging
from pathlib import Path
from datetime import datetime, timezone

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("send_email")

OUTBOX = Path("/tmp/agent-001/outbox.json")
OUTBOX.parent.mkdir(parents=True, exist_ok=True)


def main():
    if len(sys.argv) < 4:
        print("Usage: send_email.py <to_group> <quote> <explanation>")
        sys.exit(2)
    to_group = sys.argv[1]
    quote = sys.argv[2]
    explanation = sys.argv[3]

    # Append to outbox file as a record
    record = {"to": to_group, "quote": quote, "explanation": explanation, "ts": datetime.now(timezone.utc).isoformat()}
    if OUTBOX.exists():
        arr = json.loads(OUTBOX.read_text(encoding="utf-8"))
    else:
        arr = []
    arr.append(record)
    OUTBOX.write_text(json.dumps(arr, indent=2), encoding="utf-8")
    logger.info("Queued email to %s", to_group)


if __name__ == "__main__":
    main()

September 24, 2025
10 min read

Agent-001 Part-1

Series

This three-part series will guide you through building an AI agent from the ground up. Our goal is to create a simple, autonomous, multi-hop agent.

Simple: The agent will be implemented in under 100 lines of code.
Autonomous: We'll specify only the task, leaving the agent free to determine its own steps to achieve the goal.
Multi-Hop: The agent will solve problems by executing multiple commands one after another, after checking the result from each step - similar to a human.

By the end of this series, you'll understand the core components of an AI agent, how to implement and customize one, and how to approach complex, non-linear problems using agent-based solutions.

The name Agent-001 is chosen as a playful nod to James Bond's 007. Our agent is starting out as a humble "001," but with enough development and the right tools, it could one day evolve into a full-fledged "007"—complete with a license to kill -9.

Problem statement

AI shines when working with vast amounts of unstructured, natural language data—such as lengthy log files, technical articles, or legal documents. To illustrate this, let's consider a scenario where natural language content involved. We have a public API that serves random programming jokes. Each time you call the API, it returns a new joke, providing a dynamic and unpredictable data source for our agent to process.

Programming Jokes API
$ curl --silent https://official-joke-api.appspot.com/jokes/programming/random | jq .
[
  {
    "type": "programming",
    "setup": "What's the best thing about a Boolean?",
    "punchline": "Even if you're wrong, you're only off by a bit.",
    "id": 15
  }
]
$

Let's define a practical use case for our agent. Suppose I want to help my team start each day with a bit of humor by sending out a programming joke via email. To do this, I'll use the programming jokes API and automate the process of sharing a joke with the team. Here are some important considerations for this workflow:

Not everyone may understand the joke, so the email should include a brief explanation.
The joke must be culturally appropriate and not offensive to anyone in the team.
If a joke is specific to a certain area of technology, it should be sent only to the relevant division. For example, a scheduling-related joke should go to the OS and Systems team, not the OOPS team.

A manual version of this workflow would look like this:

After logging in each morning, I fetch a joke from the API.
I use ChatGPT to generate an explanation for the joke.
If the joke is appropriate and funny, I include both the joke and its explanation in an email and send it to the relevant division.

Ask AI

After repeating this manual workflow for several days, I realized it was becoming tedious and inefficient. Even though ChatGPT generates responses quickly, the wait time—however brief—adds unnecessary friction to my morning routine. As any programmer would, I decided to automate the process. Instead of manually fetching a joke and then asking ChatGPT for an explanation, I thought: why not create a script that does both tasks automatically and saves the results? This way, both the joke and its explanation would be ready for me as soon as I log in.

To start, I wrote a script that fetches a joke from the API and stores the JSON output in a file. Full file will be given at the end of this blog. Here, we'll have the snippet of main logic.

producer.py `fetch_text`: API call to programming jokes
def fetch_text(url: str, timeout: int = 10) -> str:
    """Fetch text content from a URL and return it as a string.

    Uses the standard library so there are no extra dependencies.
   """
    try:
        with urllib.request.urlopen(url, timeout=timeout) as resp:
            raw = resp.read()
            try:
                return raw.decode("utf-8")
            except Exception:
                return raw.decode(errors="replace")
    except urllib.error.URLError as e:
        raise RuntimeError(f"Failed to fetch {url}: {e}")

producer.py Call `fetch_text` and store output
            json_text = json.loads(fetch_text(url))[0]
            target_file = out_dir / f"{json_text['id']}.txt"
            with target_file.open("w", encoding="utf-8") as fh:
                fh.write(json_text["setup"] + "\n" + json_text["punchline"] + "\n")

The script is named producer.py because it generates the content our agent will process. In a real-world context, this is similar to a CI/CD pipeline producing logs or artifacts. Keeping this script independent allows all three agents in our series to reuse it.

For our initial implementation, we'll make a simple LLM call: the script will pass the fetched joke to the LLM and request a brief explanation. Both the joke and its explanation will be saved to an output file. This way, each morning when I log in, I can quickly access the day's joke and its explanation, ready to include in the team email.

To interact with LLM, we need one LLM service provider and secret keys to access it. In our case, Azure-Foundry is used for deploying the LLMs. The secrets to access the foundry models should be stored in .env file.

.env
$ cat .env
AZURE_OPENAI_API_KEY='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
AZURE_OPENAI_ENDPOINT='https://ai-XXXXXXXXXXXXXXXXXXXXXXXXXX.openai.azure.com/'
AZURE_OPENAI_API_VERSION='2024-12-01-preview'
$

These environmental variables will be consumed by the program using the load_dontenv() library function. Then we have our variables loaded with from these env variables.

The save_state and load_state functions are to store and load the processed jokes. Every time this program runs, it will take the already processed jokes in an object and skip them from processing again.

ask_ai.py
### State Management ###

def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}

### State Management ###

You can ignore the signal handlers. These are to gracefully handle Ctrl+C. The main_loop lists all the files from the producer's output dir and calls process_joke_file for each file.

ask_ai.py: main_loop
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            if shutdown_requested:
                break
            process_joke_file(Path(f), state)

The process_joke_file constructs the message and calls chat_completion.

ask_ai.py: process_joke_file
        message = {
            "role": "user",
            "content": f"Explain me the joke: '{joke.replace('\n', ' ')}' in a short paragraph.",
        }
        explanation = chat_completion([message])["choices"][0]["message"]["content"]

ask_ai.py: chat_completion
    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_KEY,
    }
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)

As shown above, the chat_completion function builds an HTTP request to the Azure OpenAI service. The response is saved in the state file, making it easy to review each morning.

ask_ai.py: process_joke_file
    # Mark processed
    state.setdefault("processed", {})[file_id] = {"agent": "001", "joke": joke, "processed_at": datetime.datetime.utcnow().isoformat(), "explanation": explanation}
    save_state(state)

Each day, I review the processed jokes and forward them to the relevant teams, ensuring they are culturally appropriate before sharing. In the upcoming blogs, we'll automate these tasks as well.

Conclusion

Series

In this first part, we've defined the problem, explored a practical use case, and implemented the foundational scripts to fetch and process programming jokes using an LLM. By automating the initial steps, we've set the stage for building a more capable and autonomous agent. In the next part, we'll focus on enhancing our agent's decision-making abilities—filtering jokes for appropriateness, routing them to the right teams, and handling more complex workflows. Stay tuned as we continue to evolve Agent-001 into a smarter, more autonomous assistant!

Complete code can be found at code/agent-001 directory of this blog's GitRepo

Complete Code

producer.py

producer.py
import json
import time
import datetime
import logging
import signal
import threading
import urllib.request
import urllib.error
from pathlib import Path

DEFAULT_URL = "https://official-joke-api.appspot.com/jokes/programming/random"
DEFAULT_OUTPUT_DIR = "/tmp/agent-001/"
DEFAULT_INTERVAL_SECONDS = 1 * 60  # 1 minute


def ensure_output_dir(path: str) -> Path:
    """Ensure the output directory exists and return a Path object."""
    p = Path(path)
    p.mkdir(parents=True, exist_ok=True)
    return p


def fetch_text(url: str, timeout: int = 10) -> str:
    """Fetch text content from a URL and return it as a string.

    Uses the standard library so there are no extra dependencies.
   """
    try:
        with urllib.request.urlopen(url, timeout=timeout) as resp:
            raw = resp.read()
            try:
                return raw.decode("utf-8")
            except Exception:
                return raw.decode(errors="replace")
    except urllib.error.URLError as e:
        raise RuntimeError(f"Failed to fetch {url}: {e}")


def run_loop(url: str = DEFAULT_URL, output_dir: str = DEFAULT_OUTPUT_DIR, interval_seconds: int = DEFAULT_INTERVAL_SECONDS, stop_event: threading.Event = None) -> None:
    """Run the fetch->write loop until stop_event is set.

    The loop is responsive to shutdown requests via the provided stop_event.
   """
    if stop_event is None:
        stop_event = threading.Event()

    out_dir = ensure_output_dir(output_dir)
    logging.info("Starting producer loop: url=%s interval=%s output=%s", url, interval_seconds, out_dir)

    while not stop_event.is_set():
        start = time.time()
        try:
            json_text = json.loads(fetch_text(url))[0]
            target_file = out_dir / f"{json_text['id']}.txt"
            with target_file.open("w", encoding="utf-8") as fh:
                fh.write(json_text["setup"] + "\n" + json_text["punchline"] + "\n")
            logging.info("Wrote file %s", target_file)
        except Exception as exc:
            logging.exception("Error during fetch/write: %s", exc)

        # Wait for next cycle but be responsive to stop_event
        elapsed = time.time() - start
        wait_for = max(0, interval_seconds - elapsed)
        logging.debug("Sleeping for %.1f seconds", wait_for)
        # wait returns True if the event is set while waiting
        stop_event.wait(wait_for)

    logging.info("Producer loop exiting cleanly.")


def _setup_signal_handlers(stop_event: threading.Event) -> None:
    """Attach SIGINT and SIGTERM handlers to set the stop_event for graceful shutdown."""

    def _handler(signum, frame):
        logging.info("Received signal %s, shutting down...", signum)
        stop_event.set()

    signal.signal(signal.SIGINT, _handler)
    signal.signal(signal.SIGTERM, _handler)


def main() -> None:
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )

    stop_event = threading.Event()
    _setup_signal_handlers(stop_event)

    try:
        run_loop(stop_event=stop_event)
    except Exception:
        logging.exception("Unexpected error in main loop")
    finally:
        logging.info("Shutdown complete")


if __name__ == "__main__":
    main()

ask_ai.py

ask_ai.py
import os
import sys
import json
import time
import logging
import datetime
import glob
import signal
from pathlib import Path
from typing import Dict, Any

from dotenv import load_dotenv
load_dotenv()

import requests

OUTPUT_DIR = Path("/tmp/agent-001/")
STATE_FILE = OUTPUT_DIR / "state.json"

# Azure OpenAI settings - must be provided as environment variables
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# Ensure directories exist
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("agent")

shutdown_requested = False

### State Management ###

def save_state(state: Dict[str, Any]) -> None:
    STATE_FILE.write_text(json.dumps(state, indent=2), encoding="utf-8")


def load_state() -> Dict[str, Any]:
    if STATE_FILE.exists():
        try:
            return json.loads(STATE_FILE.read_text(encoding="utf-8"))
        except Exception:
            logger.exception("Failed to load state file, starting fresh")
    # default state
    return {"processed": {}, "last_sent": {}}

### State Management ###


def _signal_handler(signum, frame):
    global shutdown_requested
    logger.info("Signal %s received, will shut down gracefully", signum)
    shutdown_requested = True


signal.signal(signal.SIGINT, _signal_handler)
signal.signal(signal.SIGTERM, _signal_handler)


def chat_completion(messages, tools=None, temperature=0.0, max_tokens=800) -> Dict[str, Any]:
    """Call Azure OpenAI chat completion returning the full JSON, supporting tool (function) calls."""
    # Random jitter 3-5s to reduce rate spikes
    time.sleep(3 + (2 * os.urandom(1)[0] / 255.0))

    if not AZURE_ENDPOINT or not AZURE_KEY:
        raise RuntimeError("Azure OpenAI credentials (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY) not set")

    url = f"{AZURE_ENDPOINT}/openai/deployments/{AZURE_DEPLOYMENT}/chat/completions?api-version={API_VERSION}"
    headers = {
        "Content-Type": "application/json",
        "api-key": AZURE_KEY,
    }
    payload: Dict[str, Any] = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
    }
    if tools:
        payload["tools"] = tools
        payload["tool_choice"] = "auto"
    resp = requests.post(url, headers=headers, json=payload, timeout=90)
    resp.raise_for_status()
    return resp.json()


def process_joke_file(path: Path, state: Dict[str, Any]) -> None:
    logger.info("Processing joke file: %s", path)
    joke = path.read_text(encoding="utf-8").strip()
    file_id = path.name

    if file_id in state.get("processed", {}):
        logger.info("Already processed %s, skipping", file_id)
        return

    try:
        message = {
            "role": "user",
            "content": f"Explain me the joke: '{joke.replace('\n', ' ')}' in a short paragraph.",
        }
        explanation = chat_completion([message])["choices"][0]["message"]["content"]
    except Exception:
        logger.exception("LLM tool-driven processing failed for %s", file_id)
        sys.exit(1)
        # result = {"safe": False, "category": "Other", "explanation": "LLM error"}

    # Mark processed
    state.setdefault("processed", {})[file_id] = {"agent": "001", "joke": joke, "processed_at": datetime.datetime.utcnow().isoformat(), "explanation": explanation}
    save_state(state)


def main_loop(poll_interval: int = 60):
    state = load_state()
    logger.info("Agent started, watching %s", OUTPUT_DIR)

    while not shutdown_requested:
        txt_files = sorted(glob.glob(str(OUTPUT_DIR / "*.txt")))
        for f in txt_files:
            if shutdown_requested:
                break
            process_joke_file(Path(f), state)
        # Sleep and be responsive to shutdown
        for _ in range(int(poll_interval)):
            if shutdown_requested:
                break
            time.sleep(1)

    logger.info("Agent shutting down")


if __name__ == "__main__":
    main_loop()

September 11, 2025
7 min read

Parse Function Names from LLVM Bitcode

Large Language Models (LLMs) are increasingly used for coding tasks, but handling extensive codebases effectively requires special techniques. Simply feeding the entire codebase to an LLM can lead to hallucinations due to excessive context and significantly increase costs without yielding meaningful results. We'll discuss extensively about this in coming posts. This post focuses on a simple technique.

One effective technique is to avoid providing the entire code file, which often spans thousands of lines. Instead, extract and share only the skeleton of the file—retaining global variables, classes, and their member functions. This approach minimizes context while preserving essential information. You can explore this concept further in the research paper.

But how can we achieve this? How do we generate a skeleton of a source file—essentially a list of exported symbols along with their hierarchy? The paper mentioned above suggests using ctags, a simple reference lookup tool. However, ctags lacks the ability to provide hierarchy information. A better alternative lies in leveraging compilers. In this blog post, we’ll explore how to use LLVM to list all the functions in a C program.

Note: As this is the very first post on LLVM, basic detail about the LLVM command-line parser and how to compile an LLVM program using a Makefile will be explained here. This post will act as an onboarding document supporting subsequent LLVM related posts.

Setup

To follow along, you need to have llvm and clang installed in your system.

Install llvm and clang
 $ sudo apt install llvm clang -y

Test Program

Let's take a simple program that does arithmetic operations.

test.c
#include <stdio.h>

int add(int a, int b) {
    return a + b;
}

int sub (int a, int b) {
    return a - b;
}

int main() {
    int a = 10;
    int b = 20;

    printf("Addition: %d\n", add(a, b));
    printf("Subtraction: %d\n", sub(a, b));

    return 0;
}

Here we have three functions, main, add and sub. Our parser will list these functions along with some additional information.

The LLVM Parser

Complete parser code can be found at the bottom of the post. Here, let me explain the important parts in detail. We initially have the header file inclusions for all the different LLVM modules. Then we use the llvm namespace for easier typing.

Then we declare a static variable Filename to gather the argument passed.

parse_function_names.cpp
static cl::opt<std::string> FileName(cl::Positional, cl::desc("Bitcode file"), cl::Required);

int main(int argc, char **argv)
{
    cl::ParseCommandLineOptions(argc, argv, "LLVM hello world\n");

Though it is not directly related to parsing or compiling code, it is a feature provided by LLVM under the namespace llvm:cl. Using this we declare a static variable called FileName and mark it as a required positional argument. Then inside the main when we call cl::ParseCommandLineOptions function, the static variable will be automatically filled by the user provided argument. Pretty handy right?

Then, load the input file into memory.

parse_function_names.cpp
    LLVMContext context;
    std::string error;
    ErrorOr<std::unique_ptr<MemoryBuffer>> BufferOrErr = MemoryBuffer::getFile(FileName);

    if (!BufferOrErr)
    {
        std::cerr << "Error reading file: " << BufferOrErr.getError().message() << "\n";
        return 1;
    }

    std::unique_ptr<MemoryBuffer> &mb = BufferOrErr.get();

Lexing and parsing are resource-intensive tasks. Reading code from disk one token at a time can severely impact performance. To address this, we use a memory buffer to load the entire file into memory. Most interfaces, such as MemoryBuffer::getFile, include built-in error handling. Always check the return value for errors before proceeding. This defensive coding style is essential when working with LLVM APIs.

Please remember that test.c will not be passed as an argument. We'll compile the it and the compiled binary file will be passed as an argument to this parser. So, the memory buffer contains the llvm bitcode no the C code.

Now parse the functions from the loaded bitcode file.

parse_function_names.cpp
    auto ModuleOrErr = parseBitcodeFile(mb->getMemBufferRef(), context);
    if (!ModuleOrErr)
    {
        std::cerr << "Error parsing bitcode: " << toString(ModuleOrErr.takeError()) << "\n";
        return 1;
    }

    std::unique_ptr<Module> m = std::move(ModuleOrErr.get());

    raw_os_ostream O(std::cout);
    for (Module::const_iterator i = m->getFunctionList().begin(),
                                e = m->getFunctionList().end();
         i != e; ++i)
    {
        if (!i->isDeclaration())
        {
            O << i->getName() << " has " << i->size() << " basic block(s).\n";
        }
    }

In LLVM term, the functions are usually called as modules. So, we gather all the modules from the bitcode. Then we iterate one by one and print only the function definitions. Very simple and straight-forward.

Parser In Action

Makefile

We need to compile the original source to be parsed first. Then the LLVM parser should be compiled. Then pass the bitcode compilation of the original source to the LLVM parser. Compiling the LLVM parser by hand is not recommended due to the lengthy flags to be passed. So, let's write a Makefile. The complete file can be seen at the bottom of the post. Here I'll explain the important snippets.

Makefile
LLVM_CONFIG?=llvm-config

ifndef VERBOSE
QUIET:=@
endif

llvm-config is a very helpful tool. It provides standardized and portable way to get the necessary compiler and linker flags to make the developers' life easier. Then we have an option to be verbose or quieter.

With the help of llvm-config we construct our compiler flags.

Makefile
SRC_DIR?=$(PWD)
LDFLAGS+=$($(shell $(LLVM_CONFIG) --ldflags))
COMMON_FLAGS=-Wall -Wextra
CXXFLAGS+=$(COMMON_FLAGS) $(shell $(LLVM_CONFIG) --cxxflags)
CPPFLAGS+=$(shell $(LLVM_CONFIG) --cppflags) -I$(SRC_DIR)

Below are typical multi-level Makefile with targets - source(.cpp) --> object(.o) --> binary - along with a clean.

Makefile
PARSE_FUNCTION_NAMES=parse_function_names
PARSE_FUNCTION_NAMES_OBJECTS=parse_function_names.o

%.o : $(SRC_DIR)/%.cpp
    @echo Compiling $*.cpp
    $(QUIET)$(CXX) -c $(CPPFLAGS) $(CXXFLAGS) $<

$(PARSE_FUNCTION_NAMES) : $(PARSE_FUNCTION_NAMES_OBJECTS)
    @echo Linking $@
    $(QUIET)$(CXX) -o $@ $(CXXFLAGS) $(LDFLAGS) $^ `$(LLVM_CONFIG) --libs bitreader core support`

clean:
    @echo Cleaning up...
    $(QUIET)rm -f $(PARSE_FUNCTION_NAMES) $(PARSE_FUNCTION_NAMES_OBJECTS)

.PHONY: all clean

If you notice keenly, we didn't override the CXX. So, it falls backs to the default g++. It is okay. We don't need to compile our parser using llvm. We just need to compile and link it along with llvm libraries. So, we compile our parser with gcc itself.

Execution

First, compile the input source into llvm bitcode. We installed clang for this purpose only. So far we never used any clang libraries in the parser. clang is the preferred LLVM front-end for C, CPP and Objective-C programs. So, we will use it to convert our test.c into llvm bitcode.

Note: LLVM employs a front-end --> IR --> back-end compilation model. This means that LLVM uses an Intermediate Representation (IR) as an object representation. During compilation, source code is first converted to IR by a front-end. The IR is then transformed into machine code by a back-end. This design allows developers to implement a front-end without needing to account for the target machine, enabling the same front-end to work with multiple architectures by pairing it with the appropriate back-end.

Compile the test code
 $ clang -c -emit-llvm test.c -o test.bc
 $ file test.bc
test.bc: LLVM IR bitcode
 $

Now compile our parser and pass the test.bc to it.

Build and run the parser
 $ make
 ...
 ...
 Linking parse_function_names
 $ echo $?
 0
 $ $ ./parse_function_names test.bc
add has 1 basic block(s).
sub has 1 basic block(s).
main has 1 basic block(s).
$

Congratulations! We have successfully extracted and listed the functions defined in a source file by parsing its LLVM bitcode. Complete code of parse_function_names.cpp and the Makefile can be found below. Also, I've explained an alternate method do the same task at the end of this article.

Complete Code

parse_function_names.cpp
#include "llvm/Bitcode/BitcodeReader.h"
#include "llvm/IR/Function.h"
#include "llvm/IR/Module.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/MemoryBuffer.h"
#include "llvm/Support/Error.h"
#include "llvm/Support/raw_os_ostream.h"
#include <iostream>

using namespace llvm;

static cl::opt<std::string> FileName(cl::Positional, cl::desc("Bitcode file"), cl::Required);

int main(int argc, char **argv)
{
    cl::ParseCommandLineOptions(argc, argv, "LLVM hello world\n");
    LLVMContext context;
    std::string error;
    ErrorOr<std::unique_ptr<MemoryBuffer>> BufferOrErr = MemoryBuffer::getFile(FileName);

    if (!BufferOrErr)
    {
        std::cerr << "Error reading file: " << BufferOrErr.getError().message() << "\n";
        return 1;
    }

    std::unique_ptr<MemoryBuffer> &mb = BufferOrErr.get();

    auto ModuleOrErr = parseBitcodeFile(mb->getMemBufferRef(), context);
    if (!ModuleOrErr)
    {
        std::cerr << "Error parsing bitcode: " << toString(ModuleOrErr.takeError()) << "\n";
        return 1;
    }

    std::unique_ptr<Module> m = std::move(ModuleOrErr.get());

    raw_os_ostream O(std::cout);
    for (Module::const_iterator i = m->getFunctionList().begin(),
                                e = m->getFunctionList().end();
         i != e; ++i)
    {
        if (!i->isDeclaration())
        {
            O << i->getName() << " has " << i->size() << " basic block(s).\n";
        }
    }
    return 0;
}

Makefile
LLVM_CONFIG?=llvm-config

ifndef VERBOSE
QUIET:=@
endif

SRC_DIR?=$(PWD)
LDFLAGS+=$($(shell $(LLVM_CONFIG) --ldflags))
COMMON_FLAGS=-Wall -Wextra
CXXFLAGS+=$(COMMON_FLAGS) $(shell $(LLVM_CONFIG) --cxxflags)
CPPFLAGS+=$(shell $(LLVM_CONFIG) --cppflags) -I$(SRC_DIR)

PARSE_FUNCTION_NAMES=parse_function_names
PARSE_FUNCTION_NAMES_OBJECTS=parse_function_names.o

%.o : $(SRC_DIR)/%.cpp
    @echo Compiling $*.cpp
    $(QUIET)$(CXX) -c $(CPPFLAGS) $(CXXFLAGS) $<

$(PARSE_FUNCTION_NAMES) : $(PARSE_FUNCTION_NAMES_OBJECTS)
    @echo Linking $@
    $(QUIET)$(CXX) -o $@ $(CXXFLAGS) $(LDFLAGS) $^ `$(LLVM_CONFIG) --libs bitreader core support`

clean:
    @echo Cleaning up...
    $(QUIET)rm -f $(PARSE_FUNCTION_NAMES) $(PARSE_FUNCTION_NAMES_OBJECTS)

.PHONY: all clean

Alternative way

There is a tool called llvm-dis which will convert the llvm IR into a human readable format. When the bitcode is in human readable format, we can simply list the function definitions using the keyword define.

 $ llvm-dis test.bc -o test.ll
 $ grep "define" test.ll
define dso_local i32 @add(i32 noundef %0, i32 noundef %1) #0 {
define dso_local i32 @sub(i32 noundef %0, i32 noundef %1) #0 {
define dso_local i32 @main() #0 {

While there are plenty of tools out there to automate parsing, rolling your own parser has real advantages. You get to tailor the logic to your exact needs—something off-the-shelf solutions often can't do. This flexibility is especially useful when you're building systems like IDEs designed for LLM workflows, where generic tools just don't cut it. Plus, understanding the nuts and bolts of parsing gives you more control and insight into how your tooling works.