OpenAI Cuts AI Model Operating Costs by Over 50% Using New Efficiency Techniques
OpenAI has reportedly succeeded in reducing the operating costs of its artificial intelligence models by more than 50 percent during the inference phase, when the model actively generates responses for users. Engineers at OpenAI revealed that the company can now run some services using significantly fewer Nvidia processors, including for users accessing ChatGPT without an account. Although the exact method remains undisclosed, it likely involves a combination of optimizations that reduce redundant computations and improve text processing efficiency. One possible technique is Key-value caching, which allows the model to store previous calculations instead of repeating them for every response. Other potential methods include batch processing of queries and routing simpler requests to less demanding tasks within the model based on user query complexity.
Reducing inference costs is critical for AI companies because, unlike the training phase conducted before deployment, inference occurs with every user request, causing costs to rise as user numbers grow. This breakthrough could help OpenAI lessen its reliance on continuously expanding server farms and expensive Nvidia chips, which are highly sought after in the industry. It remains unclear whether these savings will translate into expanded free usage or lower prices for users, or if OpenAI will primarily use them to enhance profitability ahead of future business moves. Regardless, the development highlights a key competitive factor in AI: not only who builds the smartest model, but who can operate it more cost-effectively and at scale.