University of Waterloo study finds AI models fail software output tasks 25% of the time

Reliability Gaps in AI Data Processing

A benchmarking study conducted by researchers at the University of Waterloo, published in March 2026, has revealed that leading artificial intelligence models fail to complete basic software development and data formatting tasks 25 per cent of the time. The peer-reviewed research, titled StructEval, evaluated the reliability of large language models when tasked with producing structured outputs. These structured formats, which include JSON, XML, and Markdown, are essential for integrating artificial intelligence into automated workflows, database management, and professional reporting pipelines.

For Australian environmental professionals, developers, infrastructure planners, and legal advisors, this research highlights a critical reliability gap. As the environmental and engineering sectors increasingly turn to generative artificial intelligence to streamline operations, compile reporting data, and automate compliance assessments, the assumption that these models can operate autonomously is being challenged. A failure rate of one in four tasks in high-tier models indicates that direct, unvalidated integration of artificial intelligence into professional workflows introduces severe data integrity risks.

Understanding the limits of these tools is essential for maintaining the high standards of accuracy required in regulated industries. Whether compiling laboratory results, drafting environmental site assessments, or parsing council planning schemes, the introduction of formatted errors can disrupt databases, corrupt programmatic validation systems, and lead to incorrect compliance conclusions. This study serves as a timely reminder that while artificial intelligence offers significant efficiency gains, it cannot yet be treated as a fully autonomous component of professional service delivery.

Waterloo University StructEval Study Findings

The StructEval benchmarking study from the University of Waterloo subjected 11 leading large language models to rigorous testing to determine their ability to adhere to strict technical constraints. The evaluation spanned 18 different structured output formats and 44 distinct tasks, measuring how consistently the models could restrict their outputs to predefined schemas. This extensive testing framework was designed to replicate real-world software development and data integration scenarios where syntactical perfection is mandatory.

Historically, large language models returned responses in free-form natural language. While this is suitable for conversational applications, it is highly challenging for computer systems to parse and process. To address this limitation, major artificial intelligence developers, including OpenAI, Google, and Anthropic, introduced structured output capabilities. These features are designed to force the model’s response into rigid, machine-readable schemas such as JSON (JavaScript Object Notation), XML (eXtensible Markup Language), or Markdown, allowing downstream databases and software applications to ingest the data automatically.

The findings of the Waterloo study demonstrate a clear performance ceiling for these structured output tasks. The most advanced commercial proprietary models achieved an average accuracy of only 75 per cent across the evaluated tasks. This translates to a 25 per cent failure rate, where the model failed to follow the requested schema or outputted syntactically invalid code. For open-source models, the performance was lower, with average accuracy dropping to 65 per cent, representing a 35 per cent failure rate in standard formatting adherence.

The errors identified in the StructEval study are particularly problematic because they often involve minor syntactical discrepancies. An artificial intelligence model might omit a closing bracket, misplace a quotation mark, or fail to escape a special character within a JSON payload. While a human reader can easily interpret the intended meaning of such minor errors, automated software parsers cannot. A single syntax error in a structured file will cause downstream systems to reject the file, halt data processing pipelines, or corrupt database entries, making autonomous operation without verification impossible.

University of Waterloo study finds AI models fail software output tasks 25% of the time — Image source: AI-generated supporting image

Australian context

In Australia, the management and reporting of contaminated land and environmental data are governed by highly prescriptive frameworks. These include the National Environment Protection (Assessment of Site Contamination) Measure 1999 (NEPM 2013), the PFAS National Environmental Management Plan (PFAS NEMP), and state-specific guidelines issued by Environment Protection Authorities (EPAs) in New South Wales, Victoria, Queensland, and South Australia. These regulatory frameworks demand absolute precision, particularly when comparing analytical laboratory results against statutory thresholds such as Health Investigation Levels (HILs) or Ecological Investigation Levels (EILs).

The reliability gap highlighted by the StructEval study has significant implications for how Australian environmental consultancies and planning authorities manage data. Many organisations are currently exploring the use of generative artificial intelligence to parse laboratory certificates, automate the migration of historical site data into modern databases, or prepare environmental e-deliverables such as ESDAT files. If an artificial intelligence model fails to correctly structure this analytical data 25 per cent of the time, the risk of misclassifying a contaminant exceedance or overlooking a critical environmental risk is unacceptably high.

Furthermore, the reliance on unverified automated data pipelines complicates the statutory auditing and planning processes. In jurisdictions like New South Wales and Victoria, site contamination audits and planning certificates, such as Section 10.7 Certificates in New South Wales, require legally defensible data trails. If a consultant utilises an artificial intelligence tool to generate structured outputs that feed into these statutory documents, any formatting error or schema breach could invalidate the data trail, expose the consultant to professional liability, and require costly re-auditing. Manual verification by qualified practitioners remains essential to ensure that automated outputs meet the evidentiary standards expected by regulators and the courts.

References and related sources

Primary source: www.eurekalert.org
NEPM Assessment of Site Contamination

How iEnvi can help

iEnvi provides specialist consulting services relevant to this topic. Our team includes CEnvP Site Contamination Specialists with experience across contaminated land, groundwater, remediation, ecology, and regulatory compliance.

This is an iEnvi Machete news summary. Prepared by iEnvi to summarise the source article for contaminated land, groundwater, remediation, approvals and site risk professionals.

Published: 17 Jun 2026

Need advice on this topic? Speak to an iEnvi expert at info@ienvi.com.au or 1300 043 684, or contact us online.

Need advice on this issue? iEnvi provides practical, senior-led environmental consulting across contaminated land, remediation, ecology and environmental risk.

Team credentials Environmental management plans Contaminated land advice Remediation services Talk to iEnvi

Reliability Gaps in AI Data Processing

Waterloo University StructEval Study Findings

Australian context

References and related sources

How iEnvi can help

National environmental consultancy with direct senior involvement.

Explore

Core Services

Office Coverage