From Context Engineering to AI Agent Harnesses: The New Software Discipline
Lance Martin of LangChain joins High Signal to outline a new playbook for engineering in the AI era, where the ground is constantly shifting under the feet of builders. He explains how the exponential improvement of foundation models is forcing a complete rethink of how software is built, revealing why top products from Claude Code to Manus are in a constant state of re-architecture simply to keep up.
Lance Martin of LangChain joins High Signal to outline a new playbook for engineering in the AI era, where the ground is constantly shifting under the feet of builders. He explains how the exponential improvement of foundation models is forcing a complete rethink of how software is built, revealing why top products from Claude Code to Manus are in a constant state of re-architecture simply to keep up. We dig into why the old rules of ML engineering no longer apply, and how Rich Sutton's "bitter lesson" dictates that simple, adaptable systems are the only ones that will survive. The conversation provides a clear framework for leaders on the critical new disciplines of context engineering to manage cost and reliability, the architectural power of the "agent harness" to expand capabilities without adding complexity, and why the most effective evaluation of these new systems is shifting away from static benchmarks and towards a dynamic model of in-app user feedback.
Guest
Lance Martin
ML/AI Engineer at LangChain
Key Takeaways
AI Engineering Works at a New Abstraction Layer.
**The ML landscape has fundamentally shifted from a world where every organization trained its own specialized models (like in the self-driving era) to one defined by a few large foundation model providers. Most users now operate at a higher level of abstraction, focusing on prompt engineering, context management, and building agents rather than model architecture and training.
The Bitter Lesson Demands Constant Re-Architecture.
**In the age of LLMs, applications are built on an exponentially improving primitive. This dictates that structures and assumptions baked into an architecture today will be made obsolete by tomorrow’s models, forcing continuous, aggressive re-architecture (e.g., one major agent product rebuilt five times in eight months) to avoid bottlenecking future performance.
Start Simple and Build for Verifiable Evaluation.
**Lessons from traditional ML still apply, emphasizing that simplicity is essential—use a simple prompt, then a workflow, and only move to an agent if the problem is truly open-ended. Evaluation remains critical, and systems should be designed around "Verifier's Law," meaning tasks are easier to solve if their successful completion is easily verifiable.
Match the Problem: Workflows for Predictability, Agents for Autonomy.
**System design should be intentional: Workflows are best for predefined, predictable steps (like running a test suite or a migration), ensuring consistency and repeatability. Agents, which allow the LLM to dynamically direct its own processes and tool usage, are reserved for open-ended, adaptive tasks like complex research or debugging.
Model Improvement Drives Agent Autonomy and Reliability.
**Agents have become significantly more viable because frontier models are much better at instruction following, tool calling, and crucially, self-correction. This increase in LLM capacity means the length of tasks an agent can reliably accomplish is doubling approximately every seven months, making longer-horizon tasks possible.
Context Engineering: Reduce, Offload, and Isolate.
Managing the LLM's context window is vital for controlling costs, improving latency, and maintaining output quality, as performance can degrade (context rot) even with very large context windows. Strategies include Reduction (pruning/summarizing old messages), Offloading** (saving data to a file system, or using Bash/CLI tools to expand the action space instead of binding numerous tools), and Isolation** (using sub-agents for token-heavy tasks).
Ambient Agents Require Thoughtful Human-in-the-Loop Design.
**Asynchronous, or ambient, agents (like an email triage system running in the background) are an emerging form factor, but their higher autonomy introduces risk. They must be designed with careful human-in-the-loop checkpoints to prevent them from getting stuck in long, off-track sequences, and should incorporate a memory system to learn user preferences from ongoing feedback.
Protocols Drive Standardization in the LLM Ecosystem.
**The rapid proliferation of custom tools and endpoints has led to the emergence of unifying standards like the Model Context Protocol (MCP). The adoption of such protocols and robust frameworks (like LangGraph) is crucial in large organizations to provide a common, well-supported standard for connecting tools, context, and prompts, improving security and developer efficiency.
Evaluation Must Be Dynamic and Component-Driven.
**Static benchmarks are quickly saturated by rapidly improving models. Effective evaluation now relies on aggressive "dogfooding," capturing in-app user feedback, inspecting raw execution traces, and rolling new failure cases into dynamic eval sets. Additionally, system quality is improved by setting up separate evaluations for sub-components, such as the retrieval step in a RAG system.
Avoid the Rush to Fine-Tune; Frontier Models Catch Up.
**Leaders should be wary of immediately rushing into model training or fine-tuning. The rapid advancements in frontier models mean that capabilities that required custom fine-tuning yesterday (like generating high-quality structured output) are often integrated into the general models today, risking wasted time and effort.
You can read the full transcript here.
Timestamps
00:00 Introduction and Welcome
00:21 Overview of Fine Tuning and Deep Learning
01:12 Workshop Logistics and Resources
04:19 Setting Up for the Workshop
04:33 Introduction to Fine Tuning and Prompting
09:27 Hands-On with CoLab and Hugging Face
18:25 Prompting and Evaluating the Model
35:51 Fine Tuning for Classification
40:42 Introduction to System and User Instructions
41:37 Evaluating Model Performance
42:32 Fine-Tuning the Model
44:55 Preparing the Dataset
47:26 Understanding Hyperparameters
50:03 Training and Inference
55:15 Real-World Applications and Challenges
01:07:44 Alien NPC Dialogue Model
01:15:40 First Run and Initial Impressions
01:16:16 Understanding Model Thinking
01:17:46 Evaluating Model Performance
01:18:40 Training and Validation Insights
01:19:32 Fine-Tuning and Overfitting
01:23:23 Implementing LLM as a Judge
01:25:30 Practical Applications and Use Cases
01:28:10 Advanced Techniques and Best Practices
01:39:50 Final Thoughts and Next Steps
Links From The Show
- Lance on LinkedIn
- Context Engineering for Agents by Lance Martin
- Learning the Bitter Lesson by Lance Martin
- Context Engineering in Manus by Lance Martin
- Context Rot: How Increasing Input Tokens Impacts LLM Performance by Chroma
- Building effective agents by Erik Schluntz and Barry Zhang at Anthropic
- Effective context engineering for AI agents by Anthropic
- How we built our multi-agent research system by Anthropic
- Measuring AI Ability to Complete Long Tasks by METR
- Your AI Product Needs Evals by Hamel Husain
- Introducing Roast: Structured AI workflows made easy by Shopify
- Watch the podcast episode on YouTube
- Delphina's Newsletter
Ready to unleash your data?
Discover how Delphina can transform your data science.
