Hear Stuut chase cash! The agent doing $200M / mo in AI collections wants to chat.

Give Stuut a Ring

Meet CARB: the Collections and Accounts Receivable Benchmark

Jason Jho
Chief Architect
Table of contents

See Stuut in action

Get a personalized demo of Stuut and see how it can help with AR automation.

Get started

TL;DR: CARB (Collections and Accounts Receivable Benchmark) is a 168-task suite that measures how reliably AI agents handle real B2B finance work like cash application, collections, and deductions, using 969 strict pass/fail criteria graded against synthetic company ledgers. Baseline results show top models passing 80 to 87 percent of tasks with tool access, while revealing which AR workflows are production-ready today and which still need human review.

We’re introducing CARB, the Collections and Accounts Receivable Benchmark, a task suite for evaluating AI agents on finance back-office work.

CARB v0.1 includes 168 tasks across 58 fictional company ledgers, graded by 969 binary pass/fail criteria. It covers six AR work families: cash application, collections, communications, promise extraction, deductions, and AR analytics.

CARB helps finance teams see which order-to-cash tasks agents can handle today, which need human review, and which should remain manual until they pass organization-specific evaluations.

Why we built this

For the past year, we’ve built agents that apply cash, work collections queues, handle disputes, adjudicate deductions, and answer customer messages. In production, those systems make hundreds of thousands of LLM calls in a typical week.

No finance team can manually review work at that volume. Agents need objective grading before they can handle finance workflows at scale.

Walkthroughs miss failures that only appear against a ledger, after cash has already been posted or a customer has been contacted.

AR is well suited for this kind of benchmark because much of the work has a verifiable answer. A payment allocation either reconciles or fails to reconcile. A collections worklist either follows policy or violates it. A DSO calculation is either correct or incorrect.

What CARB measures

CARB tests whether an agent can apply payments to the cent, reject transactions that should not be posted, work collections without contacting the wrong customer, answer messages without inventing numbers, extract promises correctly, adjudicate deductions against policy, and calculate AR metrics from the ledger.

A task passes only when every required criterion passes. There is no partial credit. A cash application task that gets nine of ten allocations right fails because a finance leader would not partially post cash.

Most grading is deterministic. 940 of 969 criteria are checked in code against the ledger and answer key. The remaining 29 judgment-based checks, such as whether a customer reply answers the question or accidentally concedes a discount, are graded by a fixed LLM judge with strict settings. We also report judge error alongside the results.

What a task looks like

The agent sees a bank transaction for $8,041.74 from Copperline Building Products. Somewhere in the document pile is this remittance advice:

COPPERLINE BUILDING PRODUCTS INC
REMITTANCE ADVICE
--------------------------------------------------------
Document            Gross          Paid
--------------------------------------------------------
0300075502        $6,095.45      $6,095.45
300075510         $2,300.00      $1,946.29
 less program fee per vendor agreement
--------------------------------------------------------
TOTAL PAID                       $8,041.74

To pass, the agent has to notice that 300075510 refers to invoice 0300075510 with a leading zero dropped. It has to allocate the $1,946.29 paid amount, ignore decoy invoices, confirm the $8,041.74 transaction ties out, and classify the missing $353.71 as a short-payment deduction.

Seven criteria check those details. Miss one, and the task fails.

Dropped leading zeroes, gross-versus-paid mismatches, short-pay notes, and ambiguous remittances are common cash application work.

CARB — Work families table (Embed 1) — Preview
Family Tasks Work product
Cash application 60 Apply or reject one bank transaction
Collections 18 Build the day’s worklist
Communications 36 Classify and draft grounded replies
Promise extraction 24 Extract promise-to-pay data
Deductions 15 Adjudicate short-pay claims
AR analytics 15 Answer ledger questions

A quarter of the 60 cash application tasks are designed to be rejected. Posting the wrong transaction is a finance error, even when the agent found a plausible match.

How we made it auditable

Everything in CARB is synthetic: company names, customers, invoices, payments, documents, and policies. Production data shaped the distributions for payment sizes, invoice counts, lateness, deduction reasons, inbound languages, and document types.

CARB is publishable, auditable, and regenerable because it contains no customer data. The answer key is exact because the generator records the truth as it builds each world. If the pinned v0.1 set leaks into training data, we can generate a statistically similar unseen version from fresh seeds.

We also tested the benchmark itself. Replaying known-correct answers caught generator bugs and spec issues. A reject-everything agent scored 6.0%, which gives us a floor. Human and adversarial review found places where the instructions or grading needed to be tightened.

Two ways to run CARB

CARB runs each task in two modes. In single-shot mode, the model gets the whole world in the prompt. In tool mode, the agent gets only the request and has to use a read-only SQL ledger, document fetches, and a policy binder.

Tool mode is closer to real AR work. Analysts query the ledger, pull the remittance, check policy, and then act. Tool mode also helps identify the bottleneck. Models that improve with tools were constrained by access. Models that do not improve are more likely failing on reasoning or policy-following.

Baseline results

We ran four models across three providers in both modes on the same stratified sample. pass¹ is the pass rate on one attempt. pass² is the share of tasks the model passes on both of two attempts, which measures consistency.

CARB — Baseline results table (Embed 2) — Preview
Model Mode pass¹ pass² Latency p50
Claude Sonnet 4.6 tools 86.7% 78.3% 22.5s
GPT-5.2 tools 83.3% 71.7% 8.9s
Claude Haiku 4.5 tools 65.0% 60.0% 12.3s
Gemini 3 Flash tools 61.7% 55.0% 33.9s
Claude Sonnet 4.6 single-shot 68.3% 66.7% 8.2s
Gemini 3 Flash single-shot 60.0% 56.7% 8.4s
GPT-5.2 single-shot 51.7% 43.3% 2.3s
Claude Haiku 4.5 single-shot 48.3% 48.3% 2.1s

The results show where production systems need routing, tools, and review.

Tools changed the task ceiling for some families. AR analytics reached 100% for the best models with SQL access after failing in single-shot mode. Collections still depended on whether the model could apply written policy correctly.

Agent performance differed from single-shot performance. GPT-5.2 improved from 51.7% in single-shot mode to 83.3% in tool mode, while its pass² score in tool mode was 71.7%.

No model led every family. That supports routed production systems that use cheaper models where they are reliable and stronger models where policy reasoning or ambiguity drives the failure rate.

What this means for finance teams

CARB gives finance teams a concrete way to decide which agent workflows are ready for production.

In the 60-task sample, promise extraction passed at 100% for all four models in tool mode, and cash application passed at 80% to 90%. Collections varied more by model: Sonnet reached 90%, while the other three models scored 0% to 30%. Collections should stay under closer human review until each organization’s own evaluations show acceptable error rates.

The tool-mode results show what agents need in production: access to the systems where the truth lives, including the ledger, policies, documents, customer history, and payment records. Larger prompts did not produce the same results as tool access.

CARB v0.1 has four important limits. Documents are clean text rather than scans or OCR output. Promises do not come due yet. Collections actions do not change customer behavior. Per-family samples are small enough that scores should be read as directional.

Next, we want to add messier documents, longer-horizon grading, and continuous re-runs as models change.

We intend to release CARB as open source so finance teams, agent builders, and model providers can run it, audit the graders, and add task families.

If you work on AR operations, finance back-office agents, or frontier-model evaluation, we would love to hear from you.

Jason Jho

Chief Architect

Over the past two decades, I have built zero-to-one B2B and B2C products across various industries. Chief Architect at Stuut, where I’m building AI-powered agents for accounts receivables.

Frequently asked questions  about DSO

Is a higher or lower DSO better?
Lower is better because it means cash reaches your account faster. A DSO of 35 days is better than 55 days if your payment terms are the same.
Does DSO include current AR?
Yes. DSO reflects the total dollar amount you're owed from outstanding invoices, including invoices that aren't yet due.
How does bad debt affect DSO?
Writing off bad debt reduces your AR balance, which artificially lowers DSO even though no cash was collected. Ensure your AR figure is net of bad debt reserves for accurate measurement.
Should I calculate DSO monthly or annually?
Both. Annual DSO tracks long-term trends, while monthly DSO helps you spot process problems quickly and take corrective action before they compound.
What's the difference between DSO and CEI?
DSO measures collection speed in days. CEI measures collection quality as a percentage. A company can have low DSO but poor CEI if they're writing off accounts aggressively.
Can I reduce DSO without upsetting customers?
Yes. Proactive communication before due dates, helpful reminders, and fast dispute resolution improve customer experience while accelerating payment.

Related posts

Setup time to learn more