

Get a personalized demo of Stuut and see how it can help with AR automation.
We’re introducing CARB, the Collections and Accounts Receivable Benchmark, a task suite for evaluating AI agents on finance back-office work.
CARB v0.1 includes 168 tasks across 58 fictional company ledgers, graded by 969 binary pass/fail criteria. It covers six AR work families: cash application, collections, communications, promise extraction, deductions, and AR analytics.
CARB helps finance teams see which order-to-cash tasks agents can handle today, which need human review, and which should remain manual until they pass organization-specific evaluations.
For the past year, we’ve built agents that apply cash, work collections queues, handle disputes, adjudicate deductions, and answer customer messages. In production, those systems make hundreds of thousands of LLM calls in a typical week.
No finance team can manually review work at that volume. Agents need objective grading before they can handle finance workflows at scale.
Walkthroughs miss failures that only appear against a ledger, after cash has already been posted or a customer has been contacted.
AR is well suited for this kind of benchmark because much of the work has a verifiable answer. A payment allocation either reconciles or fails to reconcile. A collections worklist either follows policy or violates it. A DSO calculation is either correct or incorrect.
CARB tests whether an agent can apply payments to the cent, reject transactions that should not be posted, work collections without contacting the wrong customer, answer messages without inventing numbers, extract promises correctly, adjudicate deductions against policy, and calculate AR metrics from the ledger.
A task passes only when every required criterion passes. There is no partial credit. A cash application task that gets nine of ten allocations right fails because a finance leader would not partially post cash.
Most grading is deterministic. 940 of 969 criteria are checked in code against the ledger and answer key. The remaining 29 judgment-based checks, such as whether a customer reply answers the question or accidentally concedes a discount, are graded by a fixed LLM judge with strict settings. We also report judge error alongside the results.
The agent sees a bank transaction for $8,041.74 from Copperline Building Products. Somewhere in the document pile is this remittance advice:
COPPERLINE BUILDING PRODUCTS INC
REMITTANCE ADVICE
--------------------------------------------------------
Document Gross Paid
--------------------------------------------------------
0300075502 $6,095.45 $6,095.45
300075510 $2,300.00 $1,946.29
less program fee per vendor agreement
--------------------------------------------------------
TOTAL PAID $8,041.74
To pass, the agent has to notice that 300075510 refers to invoice 0300075510 with a leading zero dropped. It has to allocate the $1,946.29 paid amount, ignore decoy invoices, confirm the $8,041.74 transaction ties out, and classify the missing $353.71 as a short-payment deduction.
Seven criteria check those details. Miss one, and the task fails.
Dropped leading zeroes, gross-versus-paid mismatches, short-pay notes, and ambiguous remittances are common cash application work.
A quarter of the 60 cash application tasks are designed to be rejected. Posting the wrong transaction is a finance error, even when the agent found a plausible match.
Everything in CARB is synthetic: company names, customers, invoices, payments, documents, and policies. Production data shaped the distributions for payment sizes, invoice counts, lateness, deduction reasons, inbound languages, and document types.
CARB is publishable, auditable, and regenerable because it contains no customer data. The answer key is exact because the generator records the truth as it builds each world. If the pinned v0.1 set leaks into training data, we can generate a statistically similar unseen version from fresh seeds.
We also tested the benchmark itself. Replaying known-correct answers caught generator bugs and spec issues. A reject-everything agent scored 6.0%, which gives us a floor. Human and adversarial review found places where the instructions or grading needed to be tightened.
CARB runs each task in two modes. In single-shot mode, the model gets the whole world in the prompt. In tool mode, the agent gets only the request and has to use a read-only SQL ledger, document fetches, and a policy binder.
Tool mode is closer to real AR work. Analysts query the ledger, pull the remittance, check policy, and then act. Tool mode also helps identify the bottleneck. Models that improve with tools were constrained by access. Models that do not improve are more likely failing on reasoning or policy-following.
We ran four models across three providers in both modes on the same stratified sample. pass¹ is the pass rate on one attempt. pass² is the share of tasks the model passes on both of two attempts, which measures consistency.
The results show where production systems need routing, tools, and review.
Tools changed the task ceiling for some families. AR analytics reached 100% for the best models with SQL access after failing in single-shot mode. Collections still depended on whether the model could apply written policy correctly.
Agent performance differed from single-shot performance. GPT-5.2 improved from 51.7% in single-shot mode to 83.3% in tool mode, while its pass² score in tool mode was 71.7%.
No model led every family. That supports routed production systems that use cheaper models where they are reliable and stronger models where policy reasoning or ambiguity drives the failure rate.
CARB gives finance teams a concrete way to decide which agent workflows are ready for production.
In the 60-task sample, promise extraction passed at 100% for all four models in tool mode, and cash application passed at 80% to 90%. Collections varied more by model: Sonnet reached 90%, while the other three models scored 0% to 30%. Collections should stay under closer human review until each organization’s own evaluations show acceptable error rates.
The tool-mode results show what agents need in production: access to the systems where the truth lives, including the ledger, policies, documents, customer history, and payment records. Larger prompts did not produce the same results as tool access.
CARB v0.1 has four important limits. Documents are clean text rather than scans or OCR output. Promises do not come due yet. Collections actions do not change customer behavior. Per-family samples are small enough that scores should be read as directional.
Next, we want to add messier documents, longer-horizon grading, and continuous re-runs as models change.
We intend to release CARB as open source so finance teams, agent builders, and model providers can run it, audit the graders, and add task families.
If you work on AR operations, finance back-office agents, or frontier-model evaluation, we would love to hear from you.
