Delos Logo

Reliable. Scalable. Accurate. Benchmarking the Delos One engine for AI adjudication.

Reliable. Scalable. Accurate. Benchmarking the Delos One engine for AI adjudication.

TL;DR

  • What are we publishing? we are unveiling details of our Delos One AI engine along with a set of benchmarks comparing its accuracy in solving legal problems against standalone Large Language Models.
  • What is Delos One? It’s the agentic architecture at the core of our platform, combining AI and logic programming to adjudicate legal cases. Delos One can solve Computational Legal Clauses with 100% accuracy.
  • What are Computational Legal Clauses? These are structured, rule-based legal provisions often found in complex contracts - like energy pricing formulas, diesel surcharges, tiered insurance policies, or loan covenants (e.g., "If destination is X, price is $2.50 per mile; if Y, $2.70").
  • Why is it important? These clauses are central to many legal disputes. To build trustworthy AI for legal adjudication, precision in interpreting and applying them is essential.
  • What does Delos do? Delos is a clearinghouse for B2B contracts. Users upload contracts and submit facts about specific events (e.g., deliveries, services). Once both sides input their data, Delos’ AI adjudicates-applying the contract’s terms to the facts: "Party A owes Party B", "the price should be X", etc. Delos acts as an AI arbitrator for micro-disputes.
  • Why did we build Delos One? Because accurate adjudication requires an engine that can reliably and deterministically solve the most complex contractual problems. That’s what Delos One delivers.

destined audience

Section 1 of this blog provides a business overview of the problem and the Delos One architecture. It is aimed to a business-oriented audience. Section 2 is more technical, going into the specific benchmarks we've run. It's primarily destined for technical leaders. We tried however to keep it fairly straightforward, so that it will be understandable by folks that aren't technical either.

Some jargon used across the blog will be technical and relate to terms specific to trucking as well as energy contracts. We will be explaining those along the way.

introduction

Contracts, from straightforward purchase agreements to intricate structured financing deals, frequently contain complex clauses defining mutual rights and obligations. Often, these contractual clauses exceed the complexity of the underlying business transaction. Correctly interpreting and applying those clauses usually requires a sophisticated blend of:

  • mathematical calculations (e.g., complex pricing formulas)
  • logical condition evaluation (e.g., cancellation deadlines)
  • contextual language interpretation (e.g., interpreting informal cancellation requests).

We call such clauses Computational Legal Clauses. They are widespread, appearing in sectors like energy contracts, logistics, insurance, tax law, healthcare, and loan agreements. Solving them poses substantial challenges due to their interdisciplinary nature, difficulties in automation, and the dispersed nature of required data across various documents and sources.

Given these complexities, AI emerges naturally as a potential solution. However Large Language Models have significant shortcomings when it comes to their accuracy on tasks requiring math and logic.

In this blog we're happy to unveil details of our neuro-symbolic AI architecture called Delos One - a blend of agentic AI with logic programming that is aimed to solve Computational Legal Clauses with 100% accuracy.

1. what is Delos One and why build it?

1.1 Computational Legal Clauses and Where to Find Them

From a simple sales and purchase term sheet to a complex structured debt agreement – contracts contain complex terms and clauses to express the mutual rights and obligations of the parties. The complexity of those clauses often exceeds the complexity of the business transaction in question.

For example, a transportation agreement might cover a seemingly straightforward relationship: Party A will carry goods for Party B. However, the pricing structure of such an agreement might contain over 100 pricing combinations to choose from. Rates can vary for each origin and destination that the goods will be transported from/to. They can also increase or decrease depending on the container size. To protect from potential diesel price hikes, such contracts will contain so-called “diesel surcharges” – tied to the diesel price on a given day or week.

We call such problems Computational Legal Clauses.

Although they are legal in nature, solving them requires a mix of:

  • mathematical calculations (e.g. price = 0.87 × (0.85 × Commodity_Index + 0.15 × TTF) + α - β + Diesel_Surcharge)
  • evaluation of logical conditions (e.g. “if delivery was cancelled less than 24 hours ahead of planned delivery date, a termination clause will apply”)
  • understanding the meaning of words from the context of a document (e.g. is a client emailing to say they “don’t want the delivery anymore” equal to cancellation?)

Computational Legal Clauses are abundant in contracts and across the legal space. Some examples include:

  • pricing structures in energy and commodity contracts
  • logistics pricing tables
  • diesel surcharges in trucking agreements
  • insurance policies
  • tax law
  • healthcare Fee-For-Service agreements
  • debt covenants in loan agreements

An example of a Computational Legal Clause is a fuel/diesel surcharge found in trucking and logistics contracts.

A fuel surcharge is "An extra fee, determined as a percentage of the base rate, charged by transport companies to allow for the fluctuating costs of fuel. It is intended to protect the carrier if the price of fuel rises during transport."

Oftentimes those rates will have a granular structure covering 20-3 possible ranges of what the diesel index can be at a given time. The logic is this: if the diesel index published by the U.S. Energy Information Administration is between $2-3 apply an extra 5% to the base price of the contract.

Here's an example of a diesel surcharge clause we're using in our benchmarks:

The applicable percentage fuel surcharge (“FSC”) is tied to the U.S. DOE Gulf‑Coast diesel index. Percentages in the table below are applied to all billable charges that consume fuel. Index values are reviewed weekly; the FSC in effect on the date of service applies.

Fuel Price Low Fuel Price High FSC % Fuel Price Low Fuel Price High FSC %
2.00 2.099 1% 4.50 4.599 27%
... ... ... ... ... ...
4.10 4.199 22% 6.60 6.699 48%
4.20 4.299 23%
4.30 4.399 24%
4.40 4.499 25%
Over $6.70 per gallon – add 1 % for every $0.10 increment.

Solving those problems is complex for multiple reasons.

First, because it may require a mix of domain expertise (medical, energy, trucking), legal knowledge and accounting.

Second, because building software automations around those problems is almost impossible. Terms can easily vary from contract to contract and a well-optimized billing automation system will be useless if one of your clients tomorrow asks for a special volume based discount or multi tiered pricing your system does not account for.

Third, the inputs needed to solve any such problem can be spread across multiple documents and data sources. Applying the correct rate for 1 shipment usually requires reconciling data across emails, shipping documents and outside indexes, to make the correct price calculation. The final price can also depend on metered read outs on kWh or LNG/gallon usage. Reconciling all those datapoints across 1000 shipments every month and tying them to the terms of each contract – becomes a challenge that scales exponentially.

And so, given this complexity – AI seems like a natural candidate to help.

1.2 AI is not all you need

Large Language Models (LLMs) on their own fail to accurately solve such problems. There are multiple reasons for this.

First, such problems are simply not what LLMs are good at. LLMs struggle at tasks that involve applying accurate logic, math and language reasoning. Legal contractual clauses certainly present such challenges. Even if the mathematical component would be handled by a tool/function call - the LLM would still need to handle the logic accurately. Contractual logic may come in different shapes and forms. Sometimes requiring the evaluation of conditional thresholds with tiered pricing logic that requires precise analysis of numerical values. For instance – a contract might say that a particular pricing tier might be applicable if a given index value falls between 2.000-2.099, while the next tier starts at 2.100 and ends at 2.199. Finally, contracts contain so-called chained clause interpretation problems: the application of one clause and its particular calculation depends on combining conditions scattered across multiple sections (e.g., fuel price, loaded miles, document verification).

Second, contracts themselves are also not clearly structured – leading to further complexity. One clause – related to pricing could be spread out across multiple pages. Vice versa, one paragraph could also contain multiple clauses. Even a simple take or pay agreement can exceed 50 pages, with obligations scattered across annexes (e.g., “see Section 12.3(b)(ii)”) and shared definitions. An LLM must track parties, dates, and liquidated damages thresholds across the entire context window. Without efficient mechanisms dependencies vanish and wrong obligations are inferred.

Third, contractual language can be complex and imprecise. Companies will often use synonyms or industry-specific acronyms. Inaccurate phrasing is also a common occurence. Formulations like: “a flat rate of $2 per gallon” will lead the LLM to apply a flat fee of $2 for transportation - and not $2 x mile.

1.3 Delos One – Blending AI with Logic Programming to Achieve 100% Accuracy on Legal Tasks

Our overarching goal is to build a platform that can adjudicate legal cases between parties – playing the role of a clearinghouse, adjudicator, or arbitrator. Think of a mini-court where parties can upload their contract and then submit information on particular events (shipments, services performed, goods delivered etc.). Once both parties have submitted their facts – Delos adjudicates the matter. That is, it applies the legal terms of the contract to the particular details of the event: “the price should be X”, “this procedure is covered by the policy”, “Party A should pay Party B” etc. Our current focus is on solving what we call “mini-disputes” – small disagreements between companies that are not significant enough to land in court, but important enough that they lead to hours of reconciliation, delayed cashflow and damaged relationships.

Effectively, we want to offer companies a fast, cheap and accurate way to adjudicate disagreements and disputes stemming from their contractual dealings. In the long run however, we see our technology as a first step towards a future where disputes are handled by AI arbitrators or judges. There are over 5 billion people in the world without access to justice – 1.5 billion of those in developed economies.

In order to offer a realistic path towards AI performing adjudication we needed an architecture that is first and foremost accurate. Not 99.9% accurate, but rather 100% accurate. When calculating a price – if the result should be $9,784.45, then it cannot be $9,700.00 or even $9,784.40. Furthermore, this accuracy must be reliable across all cases solved. Whether running 10 or 10,000 queries – an effective system should always yield accurate results. Second, we needed an architecture that is fast and cheap. When solving a few hundred or thousand cases per month we need an architecture that can solve this at a low cost and high speed – offering an unparalleled efficiency.

Delos One is our agentic architecture that leverages a combination of AI and logic programming to achieve precisely that: 100% accurate results when solving Computational Legal Clauses. Doing so at 75% of the latency for standalone LLMs and only at 2% of the cost. To do that, we’ve built an architecture that combines Large Language Models with logic programming. In essence, every legal clause we want to adjudicate on is transformed into a special format that we’ve called “DL1”. With DL1 we transform contractual clauses into procedural logic graphs. The DL1 encoding effectively serves as a structured domain model for the Delos One engine.

The process of transforming legal clauses into the DL1 format is separate from the process of adjudicating on them. However – in designing DL1 we focused on its scalability and universality. DL1 can be used to represent the logic of virtually any legal clause.

We tested it internally on contractual clauses such as:

  • commodity/energy pricing (subject of this paper)
  • logistics pricing (subject of this paper)
  • diesel surcharges (subject of this paper)
  • insurance logic
  • loan agreement debt covenants
  • simple pricing agreements

as well as some excerpts of EU regulation or public city ordinances in the US.

The approach we used to DL1 is able to represent in a scalable and simple way clauses found across contracts a variety of contracts. This means that simple pricing terms can be easily represented without extra overengineering and abstractions. At the abstractions of DL1 allow to express complex multi-tiered clauses in a similarly simple fashion.

Agreements in DL1 can be created from scratch – ie. a new contract can be encoded directly in this format (via API or a web editor). However, since that is not convenient for legacy contracts, we designed a proprietary semantic parsing algorithm that allows to convert legacy contractual clauses into the DL1 format. We will be benchmarking it separately in later studies. The entire workflow from legal clause in natural language to adjudication performed by Delos 1 can be represented in the following way:

Delos One workflow

Finally, to provide visibility into how decisions are made, the Delos One engine traces each reasoning path and produces transparent execution logs, showing precisely how values were calculated. In case of errors or incompatibility points to a incompatible statements, it shows their cause and placement in the calculation logic.

All in all, we believe Delos One to be the future of providing reliable answers to the most complex legal questions – offering a path forward to adjudication at scale.

2. benchmarking Delos One against LLMs

2.1 overview and data

To benchmark Delos One’s performance we compared its accuracy against that of a RAG pipeline with Claude Sonnet 3.5 on a batch of 111 legal queries. We used Anthropic's latest citation feature to improve the accuracy of the RAG pipeline.

For this test we look at 3 particular legal clauses:

  • an LNG purchase price calculation in a 72-page municipal solicitation outlining multi-year RLNG supply terms including pricing and compliance (64 event queries tested)
  • a transportation rate calculation in a 21-page compilation of U.S. freight agreements between a shipper and a carrier (47 queries)
  • a diesel surcharge calculation in a 17-page logistics agreement detailing freight service provisions such as safety, liability, and invoicing (47 queries)

All three are common occurences in contracts in the energy supply chain industry.

In real life, the work performed by humans to make this adjudication looks as follows:

  • Step 1: Clause Identification & Dependency Mapping: The operations team begins by manually tracing every external variable—fuel indices, shipment dates, mileage figures, tier thresholds. Those are usually transferred and tracked in a spreadsheet.
  • Step 2: Ad Hoc Data Gathering, Cleaning & Alignment: For every shipment, contract managers collect whatever source material arrives—spreadsheet rate sheets, PDF invoices, index snapshots, even email threads. They convert units, and cross check that each invoice date aligns with the correct index window. Tiered pricing tables or rebate schedules are then applied in a working spreadsheet.
  • Step 3: Review & Sign Off: Any ambiguities or exceptions are transferred to legal or accounting for clarification via email or phone calls.

Once approvals are secured, the reconciled package is finalized and filed.

Steps 2 and 3 have to be repeated for every shipment performed. Step 1 is repeated every time a new contract or a contract amendment have been signed.

In our evaluation we focus on result estimation for 3 different tasks:

  • (1) determining and calculating the transportation base rate across 47 shipment events
  • (2) calculating the diesel surcharge for the same 47 shipment events
  • (3) determining the appropriate natural gas price and calculating it across 64 shipment events

Transportation Rate Estimation

The transportation rate is estimated based on contractual clause and related Pricing exhibit that containt the following formulation:

3.8 Pricing. All pricing - including rates, charges (e.g. fuel surcharges), fees or tariffs for the Vendor’s services - will be as set forth in the Pricing Schedule and will remain firm for the term of this Agreement, unless the Parties mutually agree in writing to amend it. If amended in writing, the rates and fuel surcharges in effect on the date the Bill of Lading was issued will govern. Under no circumstances may the Vendor “gross up” the agreed fee for any taxes, fees, licenses or other charges; the Client need not pay any such amounts and may apply any payments made toward future invoices as a credit. No additional rates, charges, fees or tariffs will be imposed by the Vendor without the Client’s prior written authorization. Any rate or charge for oversized shipments or services not covered by the Pricing Schedule must be documented in writing (e.g., email) between the Parties, and a copy of that special rate agreement shall accompany the Vendor’s invoice as supporting documentation. Anonymised pricing exhibit example.

Section Sub‑Section Description Rate / Reference
1. Base Rates 1.1 – Linehaul Tier A Distance‑based lift service (up to 0.5 ton) $X.XX / mile + FSC Exhibit A
1.2 – Linehaul Tier B Distance‑based lift service (0.5–0.75 ton) $X.XX / mile + FSC Exhibit A
2. Local Tariffs 2.1 – Zone 1 (≤ 45 mi) Local pick‑up / drop‑off services within 45 miles Flat fees per equipment type Exhibit B
2.2 – Zone 2 (45–140 mi) Local services beyond 45 miles up to 140 miles Flat fees per equipment type Exhibit C
3. Ancillary Fees 3.1 – Backhaul Discount Return‑trip rate adjustment 50 % off applicable Linehaul Rate Exhibit D
3.2 – Additional Stops Per extra pick‑up/delivery point $100 / stop
4. Fuel Adjustment 4.1 – FSC Calculation Fuel surcharge applied to all mileage‑based charges Method per Exhibit E (rate locks at BOL date)

The decision “state” for a single delivery run can therefore be defined by:

  • DayRate ∈ {0,1} — whether a day-rate applies (0 = no, 1 = yes)
  • EmptyReturn ∈ {0,1} — whether the return trip is empty (dead-head)
  • ReturnType ∈ {0,1,2} — kind of empty return (0 = none, 1 = tandem-axle, 2 = one-ton truck)
  • BackhaulOpt ∈ {“Yes”, “No”, “Unspecified”} — whether there’s a paying backhaul or it’s not defined
  • MilesDriven ∈ ℝ⁺ — total miles driven (later grouped into discrete buckets)
  • CargoClass ∈ {1…7} — type of load, out of seven standard categories

Taken together, the raw state-space size is 2 × 2 × 3 × 3 × |M| × 7 (where |M| is the number of mile-buckets). After pruning logically impossible combos (e.g. “DayRate” and “EmptyReturn” both true can’t happen under most contracts) and grouping miles into three buckets, you still end up with around 126 discrete states.

In other words from the contractual text and external inputs the model must:

  1. Classify which of the 126 scenarios applies (based on DayRate, EmptyReturn, ReturnType, BackhaulOpt, miles bucket, CargoClass),
  2. Compute the exact dollar amount using the contract’s formula for that scenario.

Diesel Surcharge Estimation

A diesel surcharge (or fuel surcharge) is an additional fee added to freight or transportation costs to offset fluctuations in diesel fuel prices. It helps carriers manage the impact of rising fuel expenses, which can significantly affect their operating costs. The surcharge is usually calculated based on diesel price indexes and is adjusted regularly. Shippers pay this fee on top of base rates, ensuring fair compensation for fuel-related cost increases during transport.

The diesel surcharge calculation is performed based on the transporation base rate and a surcharge table following the following structure:

The applicable percentage fuel surcharge (“FSC”) is tied to the U.S. DOE Gulf‑Coast diesel index. Percentages in the table below are applied to all billable charges that consume fuel. Index values are reviewed weekly; the FSC in effect on the date of service applies.

Fuel Price Low Fuel Price High FSC % Fuel Price Low Fuel Price High FSC %
2.00 2.099 1% 4.50 4.599 27%
... ... ... ... ... ...
4.10 4.199 22% 6.60 6.699 48%
4.20 4.299 23%
4.30 4.399 24%
4.40 4.499 25%
Over $6.70 per gallon – add 1 % for every $0.10 increment.

Natural Gas (LNG) Price Calculation

For the natural gas price calculation we use gas index values, conversion factors and other input aggregated from externals data sources, which have to be matched with contractual logic as shown below:

Clause 3.3.1: Fuel Cost Determination All prices for fuel per unit shall be based solely on each month's respective regional gas index, with adjustments reflecting each month's index price.

Clause 5.11: Pricing Methodology The price per unit that the purchaser will pay the supplier is based on the following formula: (Regional Gas Index / Conversion Factor) + Y + R + F, where:

  • Regional Gas Index = average natural gas price for the month of delivery, as published in a recognized market report.
  • Conversion Factor = units per MMBtu of natural gas.
  • Y = supplier's costs and profit per unit delivered.
  • R = fixed rebate amount per unit, representing the purchaser's share of any generated credits.
  • F = freight cost per unit, based on standard delivery volumes. Attachment A: Price Sheet Contains detailed tables with values for variables Y, R, and F for the contract term.

queries

The 111 queries we evaluate are a set of factual scenarios (we call them “events”) that require adjudication in relation to a particular legal clause. We're essentially asking - "Based on the following event information: .... and provided this agreement: ... make a determinatation on this clause... ."

Here's what each query looks like for the particular clauses:

Transportation Rate Estimation:

query_transportation_rate = (
        f"Based on shipment event data: {event}\n\n"
        f"and provided agreement calculate base transportation rate.") 

Diesel Surcharge Estimation:

query_diesel_surcharge = (
        f"Based on shipment event data: {event}\n\n"
        f"and transportation base rate {getattr(rag_tbr, 'result', rag_tbr)} "
        f"calculate diesel surcharge with fuel index value of: {fuel_index_value}."
    )

Natural Gas Price Calculation:

query_natural_gas_price = (
        f"Based on shipment event data: {event}\n\n"
        f"and provided agreements calculate fuel price for the shipment based on Socal index value of {socal_index}."
    )

events

For each of the queries we pass it a specific event. An event is a set of facts about a particular shipment event. It is essentially a JSON based extract from a TMS or other shipment tracking system. It could also be extracted shipment data from an email or other document.

Note: we pass structured data to both the RAG system and the Delos One engine, as in both cases we want to validate the reasoning skills of the model on computational legal clauses - and not the ability to parse unstructured data. It is nevertheless worth noting that both systems have still to pick-up the correct relevant data from the event.

Events have the following structure:

event = {
      "name": "Shipment 2025-01-18",
      "description": "shipment description",
      "details": {
        "Shipment Number": "#12093209",
        "Delivery Code": "09170129321",
        "Trip Number": "#09170129321",
        "Shipment Origin": "HAWKINS",
        "Shipment Pickup Weight Gross (in pounds)": 79380,
        "Shipment Pickup Weight Tare (in pounds)": 44240,
        "Shipment Delivery Date": "2025-01-17",
        "Shipment Delivered Weight Gross (in pounds)": 79260,
        "Shipment Delivered Weight Tare (in pounds)": 44240,
        "Total Distance (in miles)": 348,
        "Bill of Lading Number": "DEL 0912093209",
        "Carrier": "Acme Transport Inc.",
        "Driver": "John Doe",
        "Order #": "#109273012",
        "Delivery Departure Datetime": "2025-01-17 20:42:29",
        "SKU": "LNG",
        "Trip Status": "Delivered",
        "Delivery Arrival Datetime": "2025-01-17 20:12:29",
        "Division": "Energy delivery group"
      }
}
 

2.2 experimental setup and methodology

compared architectures

We evaluated two architectures to benchmark legal clause comprehension, procedural reasoning, and execution accuracy. We compare Delos One against Retrieval-Augmented Generation system based on Anthropic Citations. For tracing we use Weave from Weights&Biases as it allows for efficient reasoning transparency.

The RAG configuration uses Claude 3.5 Sonnet. Cleaned, structured PDFs of the agreements were indexed to support clause-level retrieval.

We selected Anthropic Citations as a baseline due to following legal-focus advantages:

  • Transparent Source Attribution: Claude returns not only its answer but also exact citations from the source documents—highlighting the sentence-level origin of its conclusions.
  • Improved Clause Resolution: The citation-aware model can reason with the retrieved clause fragments more effectively by grounding its responses in verifiable language from the contract. This mitigates hallucination risk and supports more faithful clause interpretation.
  • Contextual Retrieval Optimization: Anthropic's Claude incorporates Contextual Embeddings + BM25 hybrid retrieval, improving the relevance of retrieved passages, which is critical when clauses are scattered across sections.

Our goal was to benchmark pure legal reasoning performance. To isolate clause interpretation from document parsing or extraction from external data, we assumed ideal conditions - all external data inputs (e.g., shipment invoice data, fuel indices) for all shipments instances were aggegated and structured.

This setup ensured our comparison focused exclusively on the LLMs’ and agents' ability to:

  • Extract relevant clauses
  • Understand formulaic logic
  • Execute multi-step calculations correctly
  • Handle flawed or missing input data

choice of Anthropic 3.5 with citations

The choice of model for our RAG baseline was preceded by an initial evaluation benchmarks of various models and on a small subset of queries. We looked at Claude 3.7 as well as the o1, o3, o4 model family from OpenAI. We chose of Claude 3.5 Sonnet because it is the most stable, cost-predictable release in the Claude 3 family and lends itself well for this task. Claude 3.7 introduces a hybrid “extended-thinking” mode. Hidden chain-of-thought tokens inflate latency and cost, which at times makes it impossible to determine whether a correct answer results from accurate clause retrieval or parametric-based reasoning. In our experience with contractual language applications, this often leads to overthinking—a behavior that deviates from the clause comprehension capabilities in which the LLM focuses on the agreement formulations rather than solving the specific comprehension task as instructed in the prompt. This pattern was especially clear when evaluating a standard natural gas pricing calculation task. Instead of returning the computed figure, Claude 3.7 Sonnet generated a multi-paragraph summary of the contract, including delivery site locations, pressure limits, and weighing procedures. The model did not however produce an answer to the query. For this same reason we discarded the o1, o3, o4 family from OpenAI. We do plan to include those in future benchmarks. While factually accurate, the output often entirely missed the clause-level computation target—precisely the kind of misalignment we aim to avoid with Delos One.

2.3 results

Delos One Chart

We measured the accuracy of the calculations for both architectures against synthetically generated ground truth data in a single pass through the validation set of 111 shipment data instances, called events.
For simplicity, both transporation contracts were aggregated as same types of estimations were analysed.

Clause Events RAG Accuracy Delos One Accuracy
Fuel price estimation 64 10.0% 100.0%
Transportation rate estimation (weighted avg) 47 45.7% 100.0%
Diesel Surcharge estimation (weighted avg) 47 45.7% 100.0%

error analysis: base rate & diesel surcharge clauses

The RAG workflow could handle simple look ups (e.g., “apply base rate X per mile”), but it faltered often when the logic chained across multiple conditions. Delos One, which translates every if then condition into an executable graph, evaluated all branches correctly in every test case.

Typical RAG failure modes:

  • Guessed inputs. When a required field (e.g., mileage band) was missing, the model substituted a default value instead of flagging the gap.
  • Dropped conditions. Low scoring retrieval passages were ignored, so distance thresholds or backhaul modifiers vanished from the calculation.
  • Silent continuation. Even with incomplete data, the model produced a dollar figure; Delos, by design, stopped execution and surfaced a “missing input” alert, preserving the audit trail.

error analysis: index linked price formula

Fuel pricing combines variables scattered across Sections 3.3.1, 5.11, and Attachment A, then performs floating point arithmetic. The RAG pipeline often stitched the right text together but misapplied the math; Delos executed the encoded algebra step by step and matched the ground truth every time.

Typical RAG failure modes:

  • Term omission. One coefficient (e.g., freight factor F) dropped from the formula, systematically under pricing every delivery.
  • Bracket mis selection. The model read the wrong row in the conversion factor table, propagating the error through the entire invoice.
  • Context loss. When the prompt neared the token limit, user supplied inputs such as monthly volume were clipped, and the model either guessed or abandoned the calculation.

cost efficiency analysis

In our neuro-symbolic architecture the LLM functions primarily as a data integrator and validator, rather than being responsible for the entire contract comprehension within its context window. This division of labor allows for a significant reduction in computational overhead and query costs.

After evaluating several LLMs—including GPT-4o, GPT-4.1 Mini (2025-04-14), and GPT-4o Mini (2024-07-18)—we identified GPT-4o Mini as the optimal choice for our use case. Its cost-effectiveness is notable: GPT-4o Mini is approximately 24 times less expensive than Claude 3.5 Sonnet for both input and output tokens, with input costs at $0.15 per million tokens and output costs at $0.60 per million tokens.

In our experiments focusing on fuel price clause calculations, the integration of GPT-4o Mini with the Delos L1-encoded neuro-symbolic engine resulted in:

  • Latency: Achieving 75% of the latency observed in the RAG baseline utilizing Claude 3.5 Sonnet.
  • Cost: Reducing execution costs to just 2% of the RAG baseline.

the AI clearinghouse for B2B contracts

Join our private beta today