Kevin Smith
5 min read • 12 August 2025
🔗 Originally published on LinkedIn
For the past two years the AI industry has raced towards ever-larger models. Budgets have soared. Cloud bills have ballooned. And much of the conversation has centred on the promise of Artificial General Intelligence (AGI) and, eventually, Artificial Superintelligence (ASI). Yet a quieter revolution is already under way. It is driven not by ever-bigger neural nets but by agentic AI—systems in which discrete software “agents” perform specialised, repeatable tasks on demand. And the future of these systems is not megascale models but small, fit-for-purpose language models, connected by open protocols that let them work together as an Agentic Web.
It is tempting to fixate on AGI and ASI as the ultimate AI milestone. Yet most enterprises do not need a digital oracle. They need instruction-following agents that execute a fixed workflow: schedule a meeting, extract key performance metrics, monitor system logs.
Throughout this article, we draw on the recent NVIDIA position paper “Small Language Models are the Future of Agentic AI” and on industry developments reported by Microsoft, Anthropic and others.
The term Agentic AI is banded about all the time these days, but what is it? At its core, an AI agent is software that uses a language model to decide when and how to use tools or perform actions towards a goal. Rather than an 'unbounded' chatbot, an agent is a discrete workflow engine, with intent-driven prompts, structured tool interfaces and strict output formatting.
Side note: I like to think of agents and agentic systems as software that models behaviours, but you probably wont find that definition in many text books. More on that here.
Broadly speaking, there are two architectural patterns. In Language-Model Agency, the model handles both the dialogue with the user and the orchestration of tool calls—for example, calling a search API, invoking a database query or running code. These are the very public agents that you can see and use right now. ChatGPT agent mode, Claude Code, Gemini etc.
In Code-Driven Agency, a lightweight controller handles orchestration and error-checking. The model is used only for specific tasks—generating text snippets, parsing inputs or making decisions when rules reach their limits. These are the hidden agents, that you probably don't see, because they look like computer programs.
Both styles are agentic because they embed the model in a broader, tool-centric workflow. What they share is the need for reliable, repeatable behaviour—exactly the kind of workload where general-purpose LLMs can be overkill.
The conventional wisdom is that bigger is better. Scale laws show that large models tend to improve across many benchmarks as their parameter counts grow. Yet agentic tasks are rarely open-ended. Instead they are repetitive (for instance, extracting dates from emails), scoped (summarising a specific document, not a free-form essay) and structural (producing code, JSON or form-filling with strict syntax).
NVIDIA’s paper argues that in these contexts Small Language Models (SLMs) - models that fit on a consumer device (below ~10 billion parameters) - are often the better choice. Inference cost can be 10–30× cheaper per token, with lower latency and energy use; they can often run on-device or in small cloud instances, avoiding multi-GPU parallelism; specialist models can be adapted overnight with a few GPU-hours, using LoRA or QLoRA techniques; and they can be trained for a single task or format, reducing hallucinations and format errors.
Several recent SLMs illustrate this trend. Microsoft Phi-2 (2.7 B) achieves common sense reasoning and code-generation on par with 30 B models, running ~15× faster. NVIDIA Nemotron-H (4.8 B to 9 B) matches the instruction-following accuracy of 30 B LLMs at a fraction of the FLOPs and Huggingface SmolLM2 (125 M to 1.7 B) rivals 14 B models on many benchmarks while using only a few percent of the compute.
In an SLM-first architecture, a fleet of small specialists handles routine subtasks. When broader reasoning is essential, a generalist LLM is invoked selectively. This modular approach drives down cost and improves reliability.
Just as the rise of microservices depended on HTTP, REST and gRPC, agentic AI will require open protocols for inter-agent communication. Two specifications are leading the way:
Model Context Protocol (MCP) Originally introduced by Anthropic and now championed by Microsoft, MCP defines a JSON-RPC-style interface between agents and their clients or peer agents. It provides session management, persistent context so long-running tasks survive server restarts and clients can resume workflows, tool invocation with standardised method calls (for example, tools/call) with typed parameters and result objects, and notifications and progress updates with fine-grained status messages for long-running jobs or human approvals. MCP is already finding use in multi-agent demos and production systems, enabling agents to coordinate without bespoke glue code.
Agent-to-Agent Protocol (A2A) While MCP focuses on the agent-to-tool interface, A2A standardises direct communication between agents. Hosted on GitHub by Google, A2A specifies opaque payloads, secure envelopes that agents exchange without needing to reveal internal state; discovery and handshake via a registry mechanism for locating agents and negotiating capabilities; and versioning and compatibility with a scheme for evolving message formats while preserving backwards support. By combining MCP for tool usage and A2A for peer-to-peer messaging, developers can weave a dynamic ecosystem of specialised intelligent services rather than monolithic AI endpoints.
As Microsoft pointed out ahead of its Build 2025 conference, these protocols could foster an “agentic web” much like HTTP and HTML did for the public internet. Agents will publish their capabilities. Clients and other agents will discover and compose them. And value will flow through standardised interfaces rather than closed silos.
It is tempting to fixate on AGI and ASI as the ultimate AI milestone. Yet most enterprises do not need a digital oracle. They need instruction-following agents that execute a fixed workflow—schedule a meeting, extract key performance metrics, monitor system logs. They need repetition with reliability—processes that run a thousand times a day with zero format errors. They need cost-effective scale—thousands of calls per second at a fraction of the cost of LLM-only solutions.
Consider personal assistant agents that automatically manage calendars, book meeting rooms and send reminders. They must integrate with corporate APIs, parse structured data and never miss a beat. Operational intelligence agents aggregate logs, run root cause analysis and escalate faults. They must obey strict JSON schemas, run queries reliably and deliver alerts within seconds. Customer support agents handle Tier-1 queries, route edge cases to humans and summarise transcripts. They must produce concise replies, maintain session data and never hallucinate policy details. All of these use cases align perfectly with SLM-first architectures and open protocols. They do not demand world-class creative reasoning; they demand precision.
NVIDIA’s paper outlines a six-step conversion algorithm for migrating existing agentic systems from large monolithic models to fleets of SLMs. It begins with instrumentation—secure logging of model calls, tool invocations and latencies, with anonymisation. Next comes task clustering, using unsupervised techniques to group similar invocations such as date extraction, code generation and summarisation. SLM selection involves choosing candidate SLMs based on capabilities, context window size and licensing. Specialist fine-tuning applies lightweight methods (LoRA, QLoRA) or knowledge distillation to adapt SLMs to each subtask. Teams then iterate, repeating data collection and retraining to refine performance and extend coverage.
Early case studies suggest 40–70 per cent of agentic invocations can be handled by SLM specialists, cutting costs dramatically.
The shift towards SLM-first architectures and open protocols forces us to re-evaluate the status quo. We need to revisit total cost of ownership. We need to be weary of assuming scale advantages from cloud-native LLM services.
SLM architecture means we need to prioritise modularity and break agentic workflows into fine-grained tasks. This aligns with the emerging Agentic Web narrative and would allow us to swap in new models or services without rewriting everything.
Investment in data logging becomes non-negotiable. High-quality usage data is the key asset for fine-tuning and model selection. We will need to ensure our agent platforms can capture structured examples.
Businesses should push to adopt open standards early. Getting ahead of MCP and A2A adoption will position organisations to be able to integrate with third-party agents, collaborate with partners and avoid vendor lock-in.
Ultimately, these practices will yield systems that are more robust, more sustainable and more cost-effective than LLM-centric alternatives.
Agentic AI is not a future prospect - it is the present, messy and juvenile reality. And while the promise of AGI inspires, the real transformation will come from practical agents built on small, specialised models and connected by open protocols.
This is not hyperbole. It is the convergence of technical capability, economic necessity and the industry’s move towards modular, service-oriented architectures.
NVIDIA’s paper on SLMs offers a well-argued roadmap; MCP and A2A provide the plumbing. Leaders who embrace these changes now are likely to find themselves at the vanguard of the Agentic Web—delivering faster, cheaper and more reliable AI services than any monolithic LLM can offer.
This article was originally written and published on LinkedIn by Kevin Smith, CTO and founder of Dootrix.
Kevin Smith