This post is not intended as a comprehensive overview of ‘agentic AI’ but just as a description of an agentic AI content creation project I recently was brought in to lead. Potentially, this captures abstractions/methods of thinking useful to those who don’t spend much time using these tools but find themselves having to work with these tools.

Project Motivation: Automating the content generation for an educational product.

Problem: One tricky part I found, when dealing with engineers, content creators, managers, leadership, etc who don’t spend much of their time using LLMs, there was a lot of perceptible ambiguity and time-wasting due to their lack of:

  • (1) intuition on what LLMs can reasonably do, and
  • (2) how to cobble together LLMs to accomplish (essentially) a giant automation project.
  • (3) keeping track of the ‘recursive abstraction‘ of definitions that confuses this further (I’ll explain more below)
  • (4) conflating of discussions about encouraging the use of AI in engineers/content creators personal workflow.

Tackling this all at once in planning/discovery meetings, and management/ICs can get quickly overwhelmed or superficially engaged. But its really a simple abstraction issue. Several abstractions involved:

Engineering process:

  • Nested loops: Build inner loop to outer loop
  • Verification: Verify before moving on
  • Scale: Scaling up in complexity and scope

I feel like this is basic intuition, even if you never went to any formal schooling, but I often see this has to be re-emphasized with large, complex project, even those that are not AI-centric. Whether its because (giving other people the benefit of the doubt), their human ‘attention tokens’ are overloaded with other responsibilities, or, more cynically, there’s an incentive issue resulting in a lack of investment in project technical success versus a performative success, the path forward for you is motivate your team by continuing to provide consistent clarity.

Just as a quick example of Engineering Process mistake (Verification):

Engineer downloads an auto-eval tool, and runs an automated loop which improves the prompt from X to Y% (according to the very eval tool downloaded) and engr/manager/leadership reports it. Engineer/manager fails to:

  • (1) Examine outputs at all with their own eyeballs (or domain expert eyeballs) and see if it is better or simply overfit to examples, has injected hallucinations, how it has changed, etc.
  • (2) Run this loop a number of times to see its reliability.
  • If you’re being an extra good engineer, you’d also think about:
    • (3) How to scale this across your (possibly ill-defined) increasing difficult examples or just different categories (e.g. does great with fiction, not with non-fiction; great with language, not with math; etc).
    • (4) How to re-run this in the future if the AI agent changes by, at the very least, (a) maintaining a set of MD files, (b) making sure code changes work on earlier, no-longer-run stages of project, etc.

Once these are accomplished with a reasonable degree of confidence, you can then move up in complexity or scope (or both).

Verification versus Verification Process versus Automated Verification Process

Whether or not you simply choose to verify or to set up a verification process, is up to the resources/timelines/future use/etc of your project and organization (or, often, the intuition of the ICs without direct management decisions). Being aware of this during project planning and execution is a step in the right direction.

“Recursive Abstraction” of definitions problem:

  1. [Agent 1] Create a prompt to generate some content
    • “When user gives you a food group, create three recipes with it”
  2. [Eval Tools/Agent 2] Create a second prompt for a second agent to evaluate output from first agent (list of criteria, etc)
    • “Evaluate AI generated recipes with this list of criteria. A.) Recipe should be between 5-15 ingredients, no more, no less. B.) Recipe should have list of all ingredients on top, with main ingredient listed first. Steps later…”
    • Or something algorithmic, or both AI and algorithms, etc
  3. [Autoeval Tools/Agent 3] Create/use a tool to create the second prompt automatically based on (a) sourced material somewhere, (b) a third agent which evaluates the output of the first agent against some examples, and edits the second prompt
    • “Compare paired examples of food items + real recipes with proper formatting, to generated paired examples. Identify patterns where they differ. Amend prompt {} with these additional patterns. If patterns in prompt {} are…”
    • MCP tools for running evaluations on Agent #2’s output
  4. Evaluate a series of ten runs of the third agent to see if that (more general) prompt or its evaluation tools have to be changed
    • “We’re currently running an evaluation process to evaluate the auto-evals”
    • “We’re doing a meta-evaluation of the eval tools”

You can see the definition confusion already. Within this example, we have three prompts, each probably fetching different docs, not to mention the series of evaluation tools called by agent #2 and/or #3, in this example.

Whatever your project is, make sure you define these in an internally referenceable way so those in your organization don’t lose motivation due to ambiguity and confusion. The work is quite doable and straightforward if you chunk it and define it correctly.

Inner loop to Outer Loop

I won’t spend much on this except to emphasize clear reproducible progression by simply verifying each piece with repeated testing, and human eyeballs on test cases, before moving on.

[The term loop is used in game design to emphasize if an entire video game is essentially one big loop, such that, you want to make sure your most basic inner loop (Mario jumping) along with its behaviors (Mario jumping correctly, player seeing & hearing confirmation of jump action, player psychologically rewarded with noises and motion) works before moving on to your next nested loop [Interactions: Mario jumping and hitting something with his head; Mario jumping on enemies; Jumping invisible objects; etc] and eventually your higher-level loops {level completions}, {stage completions}, etc. The point is you build a solid core loop, verify its solid, and then continue to layer loops on top.]

Using our example above:

  1. (Inner loop) Content creation
  2. (Second-level loop) Content creation evaluator
  3. (Third-level loop) Content creation meta-evaluator

We then might want to add a way to break this recipe down into fixed steps (preparation, cooking, plating) or dynamically using another agent or agents. Potentially, we want to add image generation for each of these steps. Potentially, a formatting tool (doesn’t have to be an LLM) to unburden the LLM’s attention from proper formatting (but then running that in the sequence prior to the evals).

Without spending too much time on this, working your way methodically from the content generation (and verifying it with repeated testing and actually checking it with your eyes), is necessary before building more loops on top of it (where you will be blindly relying on it from here on out).

Tools of Agentic AI

People will list out ‘agentic frameworks’, ‘agent frameworks’ (building the schema of agents/decisions/actions, versus tools for building the agents themselves; plenty of overlap) before understanding the simple nuts-and-bolts of what we’re actually doing:

It really boils down to a limited set of tools:

  1. “context engineering” (prompt editing + info retrieval)
    • clearer, more succinct prompts + more relevant information
  2. evals (auto-evaluate the output)
    • change criteria via prompts, context, MCP tools, custom code, better examples, patterns
  3. “multi-prompt” (chunk the problem into more agents)
    • to not overwhelm their small working memory, hopefully fixing performance issues

Some others for comprehensiveness, but the above is mostly the main core of it.

  • agents/agent version — Of course, you can change the agent you call to improve performance.
    • I would say this is more relevant as the service-styled businesses like Anthropic, OpenAI deprecate/alter available agents forcing a frustrating situation where you have to revisit your workflow because the new ‘smarter’ agents simply don’t work with your old workflow. This is compounded by non-technical clients/leadership demanding the latest, assuming it’ll be better, based on hyped up media-reported evaluations of the latest agent.
  • modification/local hosting/fine-tuning— maybe fine-tuning services, or training/fine-tuning an in-house model

Basically, you try something –> it has terrible performance. You chunk the problem into smaller chunks, give more specific/general patterns, better evals –> until works. Run it repeated times (and with varying conditions if relevant) to build confidence its working.

Then scale and scope up the next chunk. Wherever, its worthwhile time and effort wise (usually for the future), you automate and/or generalize.

Sometimes there’s higher-level patterns (repeated processes you use on these chunks) and you can generalize, and sometimes it takes 10x the effort to do that so you just focus on finishing the project/pipeline.

That’s basically it.

High level view

  • Ultimately, the trick that’s being played in organizations, and the extra-organizational hype across start-ups, consulting, and media, is that automation can be achieved automatically via an army of AI agents.
  • In reality, AI agentic projects is less about the agents, and almost entirely about organizing your project, your people, and your processes such that the hidden, intuitive, and implicit knowledge of your experts can be transferred over time into an organized reproducible framework, in a sustainable way that doesn’t lead to fragility or complexity collapse.
Posted in

One response to “Agentic LLM/Context Engineering”

  1. […] to constrain and inject each piece with supplementary intelligence (before LLM acolytes message me, I am well aware of ‘auto evaluators’, meta-loops of prompt editing, agentic frameworks,…), each chunk of brain pulp trying to hold up their part of a rickety mechanical Turk, usually for […]

    Like