I recently wrote that we may be reaching a plateau with GenAI development, and the implications for use in business-critical systems, due to their current limitations. But just how precise do they need to be?
🛣 Road Transportation in the UK shows a 99.99996% safety rate [source], with only 0.4 casualties per million miles travelled in 2022. 🛩 Aviation globally is similar, with one accident for every 1.26 million flights in 2023 [source].
🏥 Healthcare, not so great. According to the WHO, 1 in 10 patients are harmed in healthcare worldwide globally, with 1 in 20 preventable [source]. For the sake of the argument, let’s say accuracy here would need to exceed 95% (it’s a lot more complicated than this of course).
Most industries have stringent regulatory safety standards they need to comply with, and the consequences of errors can come with huge financial implications and often criminal punishment.
How accurate are GenAI systems currently?
Studies I found show GPT model accuracy varies widely, from ~50% to ~90% (the higher end generally for more simple tasks) [source].
For healthcare, while LLMs like ChatGPT4 are good at interpreting medical notes, their accuracy drops in complex diagnosis – 93% in identifying common diseases but only 53.3% in identifying the most likely diagnosis, far behind physicians at 98.3% [source]. And 18% less accurate in diagnosis than radiologists in musculoskeletal radiology [source].
As anyone who’s worked on system reliability will tell you, it’s the last mile that’s the hardest. Improvements often face diminishing returns, especially as you approach higher levels of reliability.
This is why, unless there’s another exponential leap, we could still be a long way off from these models being reliable enough to be used in business critical systems (of course this doesn’t mean there isn’t the potential for lots of other valuable uses for GenAI).
How accurate do GenAI models need to be to be used in business critical systems?
Leave a reply