Can the government trust AI to answer citizens’ questions?

The government has been here before. To retain control of its relationship with citizens, it must ensure AI prioritises GOV.UK info over other sources
Photo: Adobe Stock/Sutthiphong

By Elena Simperl

17 Feb 2026

Large language models are increasingly part of everyday life, with 73% of the UK public having used AI chatbots in their day‑to‑day lives. As the government advances pilots of AI assistants across public services, citizen queries – questions about benefits, public health, employment, or eligibility for services – are a natural fit for these systems. They can respond to highly specific questions in conversational language that reflects a user’s circumstances, education and accessibility needs. But just how accurate are the answers?

Citizen queries are not low-risk interactions. Tools such as AI-based fraud detection systems operate largely out of sight while analysing transaction patterns in the background under human oversight. In contrast, citizen queries are answered in real time with no human oversight. When the subject matter involves welfare eligibility, tax payments, or legal requirements, details matter. But LLMs are known to be prone to hallucinations and omissions, and highly personal disclosures have subsequently appeared in unexpected contexts, creating risks for privacy and safety.

With new and updated AI models being released every week, ongoing evaluation of their risks and benefits is essential, particularly when they are used in public services. However, much of the research on citizen queries dates back to the ‘naughties’, when the government shifted from citizen advice bureaux and call centres to government websites in what was known as the ‘e-government’ movement. That technological shift transformed how the public sector delivered information. Twenty years later, as everyday interactions increasingly rely on AI systems like ChatGPT and Gemini, another technological transformation is underway.

Evaluating chatbots

To test how AI models performed when answering real public-service questions, the Open Data Institute and its collaborators[1] mapped 22,000 synthetically generated citizen queries, such as “How do I apply for Universal Credit?”, against authoritative answers from GOV.UK. Responses from models including Anthropic’s Claude-4.5-Haiku, Google’s Gemini-3-Flash, and OpenAI’s ChatGPT-4o[2]  were then compared directly with official government sources, creating a new independent benchmark for AI, called CitizenQuery-UK.

Long tails and high variance undermine trustworthiness      

The research findings were consistent across models: while many responses were correct, others were incomplete or wrong, and it was often difficult for users to tell the difference.  The research showed that models can perform well on average while still producing significant errors in individual cases. In practice, this creates a “long tail” of failures, unpredictable inaccuracies that undermine reliability when citizens need certainty. If an incorrect answer about eligibility or deadlines carries financial or legal consequences for the person relying on it, this matters more than ‘overall’ accuracy.

Researchers identified examples such as ChatGPT-OSS-20 B providing incorrect advice about Guardians’ Allowance; Llama 3.1 8B incorrectly stating that a court order was required to add an ex-partner’s name to a child’s birth certificate; and Qwen3-32B misadvising a charity about tax deadlines. These errors could lead to unnecessary stress, financial cost or other problems for citizens.

AI likes to talk – but at what cost?

Part of the challenge lies in how chatbots are designed to behave. They aim to be helpful by drawing on information from a range of sources and presenting it as a single response. However, in this context, they can swamp people with unrelated information, failing to prioritise official government information or admit when they don’t know the answer. In high-stakes public service contexts, this behaviour introduces risk rather than reducing it.

The research also found important behavioural differences between models. Some demonstrated higher accuracy but generated excessively verbose responses. When researchers experimented with forcing models to be more concise and direct, their factual accuracy actually dropped, suggesting that when they are asked to concentrate their responses to citizen queries, they did not prioritise GOV.UK information over other sources.

But the government has been here before. In the early 2010s, as search engines matured, the UK government prioritised visibility for official domains such as gov.uk and data.gov.uk, treating them as trusted public gateways for citizens seeking information. Officials worked with Google to ensure authoritative pages ranked highly for common public queries, so that searches such as driving licence renewal surfaced DVLA guidance ahead of commercial alternatives.

In many cases in the study, smaller models delivered comparable results at a lower cost than large, closed-source models such as ChatGPT 4.1. Open-weight models such as the Llama and Qwen series performed competitively against closed-source systems, with smaller or open models delivering comparable results in some cases. Models that are less verbose and more predictable may be better suited to public-sector use, where reliability and consistency matter more than maximum capability.

"Some demonstrated higher accuracy but generated excessively verbose responses. When researchers experimented with forcing models to be more concise, their factual accuracy dropped"

These findings suggest caution in rushing towards larger or more expensive models and highlight the risks of vendor lock-in at a time when the technology is evolving rapidly. The results also demonstrate that rigorous testing and configuration are essential before models are deployed, including defining their limits and enabling models to admit fallibility.

Technical progress alone is not enough; what matters is whether systems deliver value in real-world public services. This means more independent benchmarking using frameworks such as CitizenQuery-UK, more public testing and continuous evaluation of their risks and benefits, with ethical guardrails established before implementation. AI should be treated as part of service design rather than as a standalone product and integrated carefully where it demonstrably improves outcomes.

The government must retain control of the interface and its relationship with citizens, allowing models to be swapped in and out as technology evolves rather than embedding dependency on a single provider. For citizens, this reinforces the importance of AI literacy, while for the public sector, it points towards projects that prioritise learning and openness over rapid expansion. Without that discipline, there is a risk that premature adoption could undermine trust in public services at precisely the moment that the government can least afford to lose it.

To support the UK government’s AI strategy and collaboration with like-minded organisations, the code has been released and licensed as open source, with a preprint posted on arXiv.org.

Elena Simperl is director of research at the Open Data Institute

Share this page