Why We Built This
Home loan processing is document-intensive by nature. A single application can include national ID documents, bank statements, payslips, company registration certificates, KYC filings, title deeds, and more — each needing to be verified not only for authenticity but for internal consistency across the application. A loan officer manually cross-checking all of this is slow, error-prone, and expensive at scale.
We saw a genuine opportunity for an AI agent to own this verification workflow end-to-end. But as we started planning the build, a question surfaced that we couldn’t ignore: how much does it actually cost to run an LLM-powered agent over a complex, multi-document process — and does that cost vary meaningfully depending on how you build it?
Token consumption is the electricity bill of AI. You can have the smartest agent in the world, but if it burns through tokens inefficiently — redundant context, bloated prompts, unpredictable retries — it becomes economically unviable at scale. We decided to treat this as a proper research exercise rather than a one-shot build.
“Token consumption is the electricity bill of AI. The smartest agent in the world becomes economically unviable if it burns tokens inefficiently.”
What the Agent Does: The RCU Agent
We called it the RCU Agent — short for Review, Check, and Underwrite. Its scope covers the full document verification layer of the home loan intake process. The agent operates across three distinct capability areas:
RCU Agent — Capability Architecture
01 Document Extraction
Extracts structured information from both templated documents (national IDs, bank statements, KYC forms) with defined schemas, and non-templated documents (payslips, company registration certificates, employer letters) that require adaptive extraction logic.
02 Application Verification
Cross-references extracted document data against the applicant’s stated details in the loan application — checking for consistency in identity, income, employment, and property details. Also verifies completeness, flagging missing fields or mismatched values.
03 Business Rules Engine
Applies a rules layer to determine whether the submitted document set is sufficient to proceed. Validates KYC document requirements, income verification documentation, and property-related documents independently — then produces a consolidated go/no-go recommendation with specific gap annotations.
How We Tested: Three Builds, Same Scope
Rather than picking a platform and building once, we ran a structured comparison across three approaches using an identical set of test applications — the same documents, the same verification rules, and the same expected outcomes for each run.
Build A: Scratch (Vibe-Coded)
The first build was an intentionally unconstrained implementation — written quickly, leaning heavily on the LLM to handle logic, with minimal prompt engineering discipline. Think of it as the “move fast” prototype: system prompts written intuitively rather than precisely, context passed in bulk, and no structured orchestration layer controlling the agent’s reasoning flow. This gave us a baseline for what unoptimised AI development actually looks like in practice.
Build B: Kore.ai Orchestration Platform
The second build used Kore.ai, an enterprise conversational and agentic AI platform. Kore provides a structured environment for building agents — pre-built dialogue flows, intent management, and some token-management tooling out of the box. This represented the “platform-assisted” tier of development: more guardrails than a scratch build, but still dependent on prompt quality at the developer level.
Build C: Pega Platform
The third build used Pega’s AI-integrated process automation layer. Pega’s architecture is designed around deterministic process execution first, with AI invoked at specific, bounded decision points — rather than an AI-first design where the LLM drives the overall orchestration. This structural difference turned out to be significant.
Key Design Difference |
In the Scratch and Kore builds, the LLM was doing heavy orchestration work — deciding what to do next, when to call tools, how to structure outputs. In the Pega build, the process architecture handled orchestration deterministically, and the LLM was only invoked for the cognitive tasks that genuinely require it: extraction and judgement calls. |
What We Observed
Most token analyses start at runtime — the cost of executing the agent. We went further, measuring across four distinct phases: build, first execute, multiple executions, and at scale. This matters because the build phase alone carries a significant token cost that rarely appears in platform comparisons, and the gap between approaches compounds differently at each stage.
Scope and Methodology |
Runtime figures for Pega and Kore cover GenAI node configurations on both platforms — not the full agent frameworks of either. This was a deliberate scope decision: we tested how each platform’s AI invocation layer performs under identical workloads, not their broader agentic capabilities. Build-phase token estimates for all three approaches are reasoned estimates based on observed development patterns (prompt iteration cycles, integration testing, debugging runs) — not directly measured. They are presented transparently as such. |
Phase 1: Build
Building a 9-node document verification agent requires prompt development, orchestration wiring, integration testing, and debugging — all of which consume tokens. The scratch build carries the heaviest build cost because every element is hand-crafted without platform scaffolding.
Build Activity | Scratch (Est.) | Kore.ai (Est.) | Pega (Est.) |
Prompt dev per node (~12 iters scratch · ~5 Kore · ~3 Pega · avg 5,500 tokens/iter) | ~594,000 | ~248,000 | ~149,000 |
Orchestration building — manual wiring, flow testing, partial pipeline runs | ~450,000 | ~120,000 | ~80,000 |
End-to-end integration testing — full pipeline runs during development | ~750,000 | ~300,000 | ~240,000 |
Bug fixing & regression — cross-node failures, logic errors, retries | ~400,000 | ~80,000 | ~40,000 |
Final validation runs | ~300,000 | ~52,000 | ~21,000 |
Total build (est.) | ~2,494,000 | ~800,000 | ~530,000 |
The scratch build costs roughly 3× more tokens to build than Pega — before a single production case is processed. This overhead is largely invisible in standard platform evaluations because most teams measure running cost only, not development cost.
Phases 2–4: Execution
Token consumption across execution phases — all three approaches
Phase | Scratch | Kore.ai | Pega | Key Driver |
First execute (1 run) | 103,065 measured | 30,138 measured | 28,063 measured | Scratch agentic loop: 72.1% of tokens consumed by orchestration overhead |
Multiple executions (5 runs avg/run) | 103,065 extrapolated | 30,436 measured | 23,970 measured | Kore completion tokens 89.6% higher than Pega; scratch overhead constant per run |
At scale (100 runs total) | 10,306,500 extrapolated | 3,043,640 measured | 2,396,960 measured | Scratch 3.4× Kore, 4.3× Pega — orchestration overhead multiplies with every run |
The Hidden Tax Inside the Scratch Build |
Of the 103,065 tokens consumed in a single scratch run, only 28,720 (27.9%) were actual specialist work — schema mapping, KYC validation, income validation, bank validation, final decision. The remaining 74,345 tokens (72.1%) were manager and orchestration overhead: the agentic loop planning what to do next, managing tool-call context, and carrying intermediate outputs between steps. |
This is what unstructured AI invocation looks like under the hood. The model is spending the majority of its token budget deciding how to do the task, not doing it. Structured platforms eliminate this overhead by handling orchestration deterministically. |
Total Cost of Ownership: Build + 100 Runs
With all three approaches carrying measured or extrapolated runtime data, the full picture is stark. Scratch does not just cost more — it costs more by an order of magnitude once it reaches scale.
Approach | Total Tokens (Build + 100 Runs) | vs Scratch |
Scratch | ~12,800,500 tokens | Baseline |
Kore.ai | ~3,843,640 tokens | −70% |
Pega | ~2,926,960 tokens | −77% |
The real divide is between unstructured agentic builds and any platform that applies architectural discipline to how the AI is invoked. Scratch is not close to Kore — it is 3.4× more expensive at 100 runs, driven by an orchestration overhead that compounds with every single execution.
Output Correctness: Platform Comparison
Token consumption was only part of the story. The more revealing findings emerged when we looked at output correctness and developer experience — whether the agent was actually getting the right answers, and how much control the platform gave us in pursuing that.
Approach | Token Efficiency | Output Correctness | Error Rate | Verdict |
Scratch / Vibe-coded | Poor | Variable | High — factual extraction errors, hallucinated fields | Not viable |
Kore.ai | Moderate | Consistent | Moderate — complex multi-document scenarios and cross-field rule evaluation exposed correctness gaps requiring ongoing prompt engineering | Viable with caveats |
Pega | Good | Deterministic | Very low — same inputs produced same outputs across all runs | Production-ready |
Key Observations
Observation 1: The Scratch Build Consumed the Most — and Made Mistakes
The vibe-coded build was predictably expensive. Without structured prompt engineering or a constrained orchestration layer, the agent passed large, unfiltered context windows to the LLM repeatedly. It also made substantive errors: in some runs, it hallucinated document fields that weren’t present, mismatched applicant names across documents, and occasionally skipped entire rule checks. The high token count was partly a consequence of retry logic attempting to recover from its own inconsistencies.
Observation 2: Kore Improved Efficiency, but Correctness and Flexibility Had Limits
Kore.ai’s platform tooling meaningfully reduced token consumption compared to the scratch build. The structured flow management reduced redundant context passing, and results were consistent across runs. However, we hit two meaningful friction points in practice.
First, because Kore is fundamentally a prompt-led development and execution environment, the quality of every output depended heavily on prompt construction. In several verification scenarios the agent returned incorrect results — not inconsistent results, but confidently wrong ones. It is worth being precise here: LLMs are not deterministic by nature, and a combination of well-engineered prompts, code-based logic, and post-LLM decisioning can make the system deterministic enough for most purposes. The challenge is that reaching and sustaining that threshold requires continuous prompt engineering investment.
Second, we encountered limits in workflow customisation depth. The platform’s structural conventions imposed a ceiling on certain configuration choices that a more process-native architecture would not.
Observation 3: Pega Delivered Efficiency, Correctness, and Greater Workflow Freedom
At GenAI node level, Pega consumed 27% fewer tokens than Kore at scale — 2,396,960 versus 3,043,640 for 100 applications. A critical qualification: this advantage is specific to the GenAI-node-optimised configuration. The architectural discipline of GenAI nodes is what creates the widening gap at volume.
On model flexibility, Pega’s model catalogue is curated rather than fully open, and integration with external models is not yet available on the platform. The developer freedom we experienced was in the workflow and decisioning layer — the ability to design the process architecture around our exact requirements without being forced into platform defaults.
“A combination of good prompting, code-based logic, and output decisioning can make a prompt-led system deterministic enough — but sustaining that at scale across complex, multi-document workflows requires constant engineering investment.”
Why the Architectural Difference Matters
The reason Pega outperformed both alternatives on token consumption comes down to a principle we’d call bounded AI invocation. When an LLM is responsible for orchestrating its own workflow, it consumes tokens not just on the task at hand but on the meta-reasoning about the task. A deterministic process layer handles this orchestration for free — the AI is told exactly when it is needed, for exactly what purpose, with exactly the context required.
Key Findings
- Orchestration waste: In an unstructured scratch agent build, 72.1% of tokens are consumed by the orchestration loop — planning, tool-call context, and intermediate outputs. Only 27.9% goes toward actual specialist work. This is the cost of letting the LLM manage its own workflow.
- Efficiency gap: The gap between Pega and Kore starts small (~7%) on first run but expands to ~27% at scale — driven almost entirely by completion token verbosity in Kore’s JSON generation and validation summary nodes.
- Scope of findings: These findings are scoped to GenAI node configurations on both platforms. Results may differ when running native agentic implementations. Architecture choices within a platform matter as much as the platform choice itself.
- Determinism: LLMs are not deterministic by nature. Process-controlled invocation enforces reliable outputs structurally, while prompt-led approaches require continuous engineering effort to sustain.
- Completion tokens: In Kore, completion tokens were 89.6% higher than Pega per 100 runs — investigating output verbosity in LLM nodes is the highest-ROI optimisation for any prompt-led build.
- Build-phase costs: Build-phase token costs are invisible in most analyses but real. Process-first architectures reduce this overhead by limiting LLM involvement during the development phase itself.
Where Each Platform Genuinely Shines
It would be a misreading of this research to conclude that Kore is simply the inferior choice. These findings are specific to a particular type of agent — a complex, multi-document, rules-heavy verification workflow in a regulated environment. The numbers favour Pega for this use case. They do not describe the full picture of what either platform is built for.
Kore.ai — Where It Leads |
Kore’s genuine strengths are in speed, accessibility, and AI-native tooling. It is a low-code environment where teams without deep engineering depth can configure and deploy AI agents quickly. Time-to-market is materially faster than a process-first platform. Beyond the build, Kore brings a rich operational layer: AI governance, agent management, conversation analytics, and multi-agent orchestration are first-class features, not add-ons. For organisations whose primary need is deploying and managing AI agents at scale, Kore’s platform breadth is a genuine advantage. |
Pega — Where It Leads |
Pega’s strength in this exercise was its workflow-driven architecture — AI invoked at bounded, controlled points within a deterministic process. The industry is moving fast toward a world where AI that actually works is distinguished from AI that merely sounds convincing. Deterministic outcomes, explainable decisions, auditable reasoning trails, and governance baked into the execution layer are not optional features for regulated industries — they are the price of admission. |
Beyond the AI layer, Pega is an enterprise operating system covering case management, decisioning, CRM, and operations automation. For organisations thinking about where enterprise AI is going — agents embedded in every process, every decision, every customer interaction, with full governance and traceability — Pega’s architecture is already built for that future. |
The Cost Picture Is Bigger Than Tokens — and That’s Exactly the Point
Let’s address the obvious objection head-on: both Kore and Pega come with platform licensing costs — the normal cost of enterprise SaaS, no different from the CRM, the cloud infrastructure, or the data platform your organisation already runs. And yes, a team that vibe-codes an agent on a bare API call pays neither.
But token costs are not static. They scale with every application processed, every agent deployed, every workflow automated. Uber burned through its annual AI budget by April. Microsoft told engineers to stop using Claude. These are early signals from organisations that treated AI consumption as a rounding error until it wasn’t. The teams that hardcode cost efficiency into their architecture from day one are the ones that will still be running AI at scale in two years.
Platform investment reframes this entirely. When Kore costs money, it is buying you AI governance, agent lifecycle management, multi-agent orchestration, and deployment infrastructure that would take a substantial engineering team months to build. When Pega costs money, it is typically justified across the breadth of enterprise problems it can address — AI is leverage on an investment already being made.
“The teams that treat token costs as an afterthought are already building tomorrow’s tech debt. Optimisation isn’t a nice-to-have — it’s the difference between AI that scales and AI that gets shut down.”
An unoptimised AI architecture does not just cost more per token. It costs more in errors caught late, in inconsistent outputs that require human review, in retry logic that multiplies consumption, and in engineering cycles spent firefighting instead of building. The token bill is the visible tip. The operational cost underneath is what sinks teams.
With model costs under constant pressure and AI workloads growing exponentially, every architectural decision you make today is a financial commitment you will be living with at 10× the volume in 18 months. The organisations that figure this out early compound the ROI of their platform investments rather than be consumed by them.
What This Means for AI in Financial Services
Conclusion |
The economics of AI agents in production environments are not just about model capability — they are about how the model is invoked. Our research suggests that teams building AI agents for regulated, high-stakes workflows should invest in architectural discipline before investing in model scale. For this specific use case — complex, multi-document verification with strict correctness requirements — a workflow-driven architecture delivered measurably better token efficiency and output reliability than a prompt-driven approach. |
That is not a verdict on which platform is “better” in the abstract. Kore brings real strengths in speed of development, AI-native governance, and agent management that matter enormously in the right context. Pega’s advantage here is inseparable from its workflow-first architecture — most valuable when the problem is complex, regulated, and demands auditable outcomes. |
The “just vibe code it” approach remains a useful starting point for experimentation. But at production scale, architectural choices made early compound quickly — in token costs, in correctness risk, and in the engineering effort required to fix what was built in a hurry. |