GPT models are not optimised for mathematical or reasoning tasks.

Kemper (2023) believes that the probabilistic nature of these models is the reason. The very nature of text generation is about plausibility, probability and guesswork, but mathematical calculations require 100% accuracy.

To improve the situation, we can use an Expert Model, such as TAPAS, a model developed by Google specifically for extracting data from tables. However, being a transformer at the core, TAPAS still makes the same mistakes as every other LLM, rendering it unusable when important decisions rely on the AI output.

A more reliable approach (Kemper, 2023) is to use GPT models’ codex capability to generate SQL statements and execute the generated code for the numerical part of the task. Kemper’s experiment was able to achieve 100% accuracy.

References