Monthly Archives: December 2025

AI Is still making code worse: A new CMU study confirms

In early 2025 I wrote about GitClear’s analysis of the impact of GenAI on code quality, based on 2024 data, which showed a significant degradation in code quality and maintainability. I recently came across a new study from Carnegie Mellon, “Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects” that looks at a more recent period, tracking code quality in projects using GenAI tools up to mid-2025. So has code quality improved as the models and tools have matured?

The answer appears to be no. The study finds that AI briefly accelerates code generation, but the underlying code quality trends continue to move in the wrong direction.

How the study was run

Researchers at Carnegie Mellon University analysed 807 open source GitHub repositories that adopted Cursor between January 2024 and March 2025, and tracked how those projects changed through to August 2025. Adoption was identified by looking for Cursor configuration files committed to the repo.

For comparison, the researchers built a control group of 1,380 similar GitHub repositories that didn’t adopt Cursor (see caveats below).

For code quality, they used SonarQube, a widely used and well respected code analysis tool that scans code for quality and security issues. The researchers ran SonarQube monthly to track how each codebase evolved, focusing on static analysis warnings, code duplication and code complexity.

Finally, they attempted to filter out toy or throwaway repositories by only including projects with at least 10 GitHub stars.

Key findings

Compared to the control group:

  • A short lived increase in code generated: Activity spikes in the first one or two months after adoption. Commits rise and lines added jump sharply, with the biggest increase in the first month
  • The increase does not persist: By month three, activity returns to baseline. There is no sustained increase in code generated.
  • Static analysis warnings increase and remain elevated: Warnings rise by around 30 percent post-adoption and stay high for the rest of the observation window.
  • Code complexity increases significantly: Code complexity rose by more than 40 percent, more than could reasonably be accounted for by just the growth in codebase size.

Caveats/Limitations

The study only looked at open source projects, which aren’t really comparable to production code bases. Also, adoption is inferred from committed Cursor configuration files which I would say is a reasonably reliable signal of usage within those projects, however the control group is not necessarily AI usage free, code in those repositories may still have been created using Copilot, Claude Code or other tools.

My Takeaways

A notable period for AI assisted development

What’s notable is the period this study tracks. In December 2024 Cursor released a major upgrade to their IDE and introduced its agent mode. It was the first time I heard experienced developers I respect describe AI coding assistants as genuinely useful. Cursor adoption climbed quickly and most developers I knew were using Claude Sonnet for day to day coding. Then in February 2025 Anthropic released Claude 3.7 Sonnet, followed in May by Sonnet 4.0 and their first reasoning model, Opus 4.1.

If improvements in models or tooling were going to reverse the code quality issues seen previously, you’d expect it to show up during this period. This study shows no reversal. The pattern is broadly the same as GitClear observed for 2024.

It’s not just “user error”

A common argument is that poor AI-generated code is the user’s fault, not the tool’s. If developers wrote clearer prompts, gave better instructions or reviewed more carefully, quality wouldn’t suffer. This study disagrees. Even across hundreds of real projects, and even after accounting for how much code was added, complexity increased faster in the AI-assisted repos than in the control group. The tools are contributing to the problem, not merely reflecting user behaviour.

Context collapse playing out in real time

Organisations training LLMs probably use similar signals to this study to decide which open source repositories to train on: popularity, activity and signs of being “engineered” rather than experimental. This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time.

Things are continuing to evolve quickly, but…

Of course, things have continued to move quickly since the period this study covers. Claude Code is currently the poster child for GenAI assisted development. Developers are learning how to instruct these tools more effectively through patterns like Claude.md and Agents.md, and support for these conventions is improving within the IDEs.

In my recent experience at least, these improvements mean you can generate good quality code, with the right guardrails in place. However without them (or when it ignores them, which is another matter) the output still trends towards the same issues: long functions, heavy nesting of conditional logic, unnecessary comments, repeated logic – code that is far more complex than it needs to be.

No doubt the tools will continue to improve, and much of the meaningful progress is happening in the IDE layer rather than in the models themselves. However this study suggests the underlying code quality issues aren’t shifting. The structural problems remain, and they aren’t helped by the fact that the code these models are trained on is likely getting worse. The work of keeping code simple, maintainable and healthy still sits with the human, at least for the foreseeable future.