The rapid progress of Generative Artificial Intelligence (GenAI) has raised concerns about the sustainable economics of emerging GenAI services. Can Microsoft, Google, and Baidu offer chat responses to every search query made by billions of global smartphone and PC users? One possible resolution to this challenge is to perform a significant proportion of GenAI processing on edge devices, such as personal computers, tablets, smartphones, extended reality (XR) headsets, and eventually wearable devices.
The first article in this series (GenAI Breaks The Data Center: The Exponential Costs To Data Center) predicted that the processing requirements of GenAI including Large Language Models (LLMs) will increase exponentially through the end of the decade as rapid growth in users, usage, and applications drives data center growth. Tirias Research estimates that GenAI infrastructure and operating costs will exceed $76 billion by 2028. To improve the economics of emerging services, Tirias Research has identified four steps that can be taken to reduce operating costs. First, Usage steering to guide users to the most efficient computational option to accomplish their desired outcome. Model optimization to improve the efficiency of models employed by users at scale. Next, computational optimization to improve neural network computation through compression and advanced computer science techniques. Last, infrastructure optimization to cost-optimized data center architectures and offload GenAI workloads to edge devices. This framework can show how, at each step, optimization for client devices might occur.
Usage Steering
GenAI is able to perform creative and productive work. However, GenAI generates an entirely new burden on the cloud, and potentially client devices. At several points in the user journey, from research to the creation of a query or task, a service provider can steer users toward specialized neural networks for a more tailored experience. For GenAI, users can be steered toward models that are specifically trained on their desired outcome, allowing the use of specialized neural networks that contain fewer parameters compared with more general models. Further, models may be defined such that user queries can activate only a partial network, allowing the remainder of the neural network to remain inactive and not executed.
The edge, where users employ web-browsers, is a likely point of origin for user requests where an application or local service might capture a GenAI request and choose to execute it locally. This could include complex tasks, such as text generation; image and video generation, enhancement, or modification; audio creation or enhancement; and even code generation, review, or maintenance.
Model & Computational Optimization
While neural network models can be prototyped without optimization, the models we see deployed for millions of users will need to trade off both computational efficiency and accuracy. The typical scenario is that the larger the model, the more accurate the result but in many cases, the increase in accuracy comes at a high price with only minimal benefit. The size of the model is typically measured in parameters, where fewer parameters correspond linearly to the amount of time or computational resources required. If the number of parameters is halved while maintaining reasonable accuracy, users can run a model with half the number of accelerated servers and roughly half the total cost of ownership (TCO), which includes both amortized capital cost and operating costs. This includes models that may run multiple passes before generating a result.
Optimizing AI models is accomplished through quantization, pruning, knowledge distillation, and model specialization. Quantization essentially reduces the range of potential outcomes by limiting the number of potential values or outcomes to a defined set of values rather than allowing a potential infinite number of values. This is accomplished by representing the weights and activations with lower-precision data types including 4-bit or 8-bit integer (INT4 or INT8) instead of the standard high-precision 32-bit floating point (FP32) data type. Another way to reduce the size of a neural network is to prune the trained model of parameters that are redundant or unimportant. Typical compression targets range from 2X to 3X with nearly the same accuracy. Knowledge distillation uses a large, trained model to train a smaller model. A good example of this is the Vicuna-13B model which was trained from user-shared conversations with OpenAI’s GPT and fine-tuned on Facebook’s 65-billion parameter LLaMA model. A subset of knowledge distillation is model specialization, the development of smaller models for specific applications, such as using ChatGPT to answer only questions about literature, mathematics, or medical treatments rather than any generalized question. These optimization techniques can reduce the number of parameters dramatically. In forecasting the operating costs of GenAI, we take these factors into consideration, assuming that competitive and economic pressures push providers to highly optimized model deployments, reducing the anticipated capital and operating costs over time.
Infrastructure Optimization with On-Device GenAI
Improving the efficiency of GenAI models will not overcome the requirements of what Tirias Research believes will be necessary to support GenAI over just the next five years. Much of it will need to be performed on-device, which is often referred to as “Edge GenAI.” While Edge GenAI workloads seemed unlikely just months ago, up to 10-billion parameter models are increasingly widely viewed as candidates for the edge, operating on consumer devices utilizing model optimization and reasonable forecasts for increased device AI performance. For example, at Mobile World Congress earlier this year, Qualcomm demonstrated a Stable Diffusion model generating images on a smartphone powered by the company’s Snapdragon 8 Gen 2 processor. And recently, Qualcomm announced their intention to deliver large language models based on Meta’s LLaMA 2 on the Snapdragon platform in 2024. Similarly, GPU-accelerated consumer desktops can run the LLaMA 1 based Vicuna 13b model with 13 billion parameters, producing results similar, but of slightly lower quality, to GPT 3.5. Optimization will reduce the parameter count of these networks and thereby reduce the memory and processing requirements, placing them within the capacity of mainstream personal computing devices.
It’s not difficult to imagine how GenAI, or any AI application, can move to a device like a PC, smartphone, or XR headset. The smartphone platform has already shown its ability to advance its processing, memory, and sensor technology so rapidly that in under a decade, smartphones replaced point-and-shoot cameras, consumer video cameras, DSLRs, and in some cases even professional cameras. The latest generation of smartphones can capture and process 4K video seamlessly, and in some cases even 8K video using AI-driven computational photo and video processing. All major smartphone brands already leverage AI technology for a variety of functions ranging from battery life and security to audio enhancement and computational photography. Additionally, AMD, Apple, Intel, and Qualcomm are incorporating inference accelerators into PC/Mac platforms. The same is true for almost all major consumer platforms and edge networking solutions. The challenge is matching the GenAI models to the processing capabilities of these edge AI processors.
While the performance improvements in mobile SoCs will not outpace the parameter growth of some GenAI applications like ChatGPT, Tirias Research believes that many GenAI models can be scaled for on-device processing. The size of the models that will be practical for on-device processing will increase over time. Note that the chart below assumes an average for on-device GenAI processing. In the development of the GenAI Forecast & TCO (Total Cost of Ownership) model, Tirias Research breaks out different classes of devices. Processing on device not only reduces latency, but it also addresses another growing concern – data privacy and security. By eliminating the interaction with the cloud, all data and the resulting GenAI results remain on the device.
Even with the potential for on-device processing, many models will exceed the processing capabilities for on-device processing and/or will require cloud interaction for a variety of reasons. GenAI applications that leverage a hybrid computing model might perform some processing on device and some in the cloud. One reason for hybrid GenAI processing might be the large size of the neural network model or the repetitive use of the model. By using hybrid computing, the device would process the sensor or input data and handle the smaller portions of the model while leaving the heavy lifting to the cloud. Image or video generation would be a good example where the initial layer or layers could be generated on devices. This could be the initial image, and then the enhanced image or the following images in a video could be generated by the cloud. Another reason might be the need for input from multiple sources, like generating updated maps in real-time. It would be more effective to use the information from multiple sources combined with the pre-existing models to effectively route vehicle traffic or network traffic. And in some cases, the model may be using data that is proprietary to a vendor, requiring some level of cloud processing to protect the data, such as for industrial or medical purposes. The need to use multiple GenAI models may also require hybrid computing because of the location or size of the models. Yet another reason might be the need for governance. While an on-device model may be able to generate a solution, there may still be the need to ensure that the solution does not violate legal or ethical guidelines, such as issues that have already arisen from GenAI solutions that infringe on copyrights, make up legal precedents, or tell consumers to do something that is beyond ethical boundaries.
The Impact of On-Device GenAI on Forecasted TCO
According to the Tirias Research GenAI Forecast and TCO Model, if 20% of GenAI processing workload could be offloaded from data centers by 2028 using on-device and hybrid processing, then the cost of data center infrastructure and operating cost for GenAI processing would decline by $15 billion. This also reduces the overall data center power requirements for GenAI applications by 800 megawatts. When factoring in the efficiencies of various forms of power generation, this results in a savings of approximately 2.4 million metric tons of coal, the reduction of 93 GE Halide 14MW wind turbines, or the elimination of several million solar panels plus and associated power storage capacity. Moving these models to devices or hybrid also reduces latency while increasing data privacy and security for a better user experience, factors that have been promoted for many consumer applications, not just AI.
While many are concerned about the rapid pace of GenAI and its impact on society, there are tremendous benefits, but the high-tech industry now finds itself in catchup mode to meet the astronomical demands of GenAI as the technology proliferates. This is similar to the introduction and growth of the internet, but on a much larger scale. Tirias Research believes that the limited forms of GenAI in use today, such as text-to-text, text-to-speech and text-to-image, will rapidly advance to video, games, and even metaverse generation starting within the next 18 to 24 months and further straining cloud resources.
Read the full article here