OpenAI’s cutting edge GPT models are not optimised for mathematical or reasoning tasks.

Kemper (2023) believes that the probabilistic nature of these models is the reason. The very nature of text generation is about plausibility, probability and guesswork, but mathematical calculations require 100% accuracy.

To improve the situation, we can use an Expert Model, such as TAPAS, a model developed by Google specifically for extracting data from tables. However, being a transformer at the core, TAPAS still makes the same mistakes as every other LLM, rendering it unusable when important decisions rely on the AI output.

A more reliable approach (Kemper, 2023) is to use GPT models’ codex capability to generate SQL statements and execute the generated code for the numerical part of the task. In an series of experiments, Kemper was able to achieve 100% accuracy when querying structured data.

A more generic approach is to use tools to harness the work already done by external systems and APIs.

References