How many steps does it take to put a large model into a cell phone?

The big model "runs" into the cell phone, the fire of AI has burned from the "cloud" to the "mobile terminal".

"Entering the AI era, Huawei Pangu big model will come to help Hongmeng ecology." On August 4, Huawei executive director, terminal BG CEO, intelligent car solutions BU CEO Yu Chengdong introduced, through the underlying technology of Pangu big model, Harmony OS brings the next generation of intelligent terminal operating system.

The use of large models on cell phones has long been nothing new, and previously ChatGPT, Wenxin Yiyin, Myo Duck, and other APPs and small programs have been used to meet the needs of AI applications for mobile terminals by calling the cloud computing power.

The next step is to let the big model run directly on the cell phone.

Since April and May this year, the American technology giants - Qualcomm, Microsoft, Nvidia, the most popular AI star OpenAI, as well as the domestic AI "head team" of Tencent, Baidu, etc., have accelerated the promotion of AI big models in mobile terminals. Lightweight deployment of AI models. Qualcomm even announced that it is gradually transforming into an intelligent edge computing (providing computing services at the source of data such as mobile terminals) company.

Under the strong push of the group of giants, the industrial trend of big models moving from cloud to end has been very clear.

Why should the big model "run" on the cell phone?

The biggest feature of big models is "big", often tens of billions of billions or even trillions of parameters, and in order to better run big models, the arithmetic cluster has been upgraded to the level of "10,000 cards". Now, why is it necessary to "stuff" the big model into a small palm-sized cell phone?

Large models will indeed bring some experience improvements to cell phone users. For example, Huawei terminal intelligent assistant small art can not only recommend restaurants according to voice prompts, but also summary summary, information retrieval, multilingual translation and other information processing, thousands of words of English long text, with the ability of large model of the cell phone intelligent assistant can generate a summary, but also can be translated into Chinese. Especially the latter point, in the era of information explosion, for improving the efficiency of learning is still very valuable.

Jia Yongli, President of Huawei Terminal BG AI and Intelligent Full Scene Business Department, explains that, on the one hand, the Big Language Model has the ability to generalize, which can help the cell phone intelligent assistant to improve its comprehension ability. On the other hand, the plug-in capability of the big model Plug-in can open up the barriers between applications within the cell phone and expand the capabilities with the help of tools.

In addition, AIGC applications such as ChatGPT have always been accompanied by strong privacy and security controversies, but this problem can be completely avoided if they run completely on the end-side. Because the big model runs on the end-side, the data doesn't leave the end-side either. Moreover, the response will be faster this way.

On the other hand, the need for big models to mobile terminals such as cell phones has become very urgent.

The surging momentum of big models makes the cloud increasingly unable to carry the demand for arithmetic power on its own. Alex Katouzian, senior vice president of Qualcomm, recently said bluntly, "With the accelerated growth of connected devices and data traffic, overlaid with climbing data center costs, [we] can't send everything to the cloud."

Not counting data transmission to consume network bandwidth, storage, as well as hardware and other large resources, just the cloud arithmetic has now made the relevant vendors a bit unbearable. chatGPT is only in the reasoning stage, a conservative estimate of the cost of arithmetic in the amount of $ 10 million per month.

The biggest problem is not "expensive", but "lack of".

Previously, even OpenAI founder Sam Altaman has revealed that GPUs are in short supply, and even said that he did not want too many people to use ChatGPT. Recently, there are industry insiders who speculate that the large-scale H100 clusters of both small and large-scale cloud providers will soon run out of capacity, and that the trend of demand for H100s will continue at least until the end of 2024. Current NVIDIA H100 capacity is also heavily constrained by the supply chain.

Therefore, the cloud and terminal to form a cooperation, cell phones and other terminal idle arithmetic resources are utilized to solve the "centralized" arithmetic and "distributed" demand mismatch, has become a large model development "cost reduction and efficiency gains It has become a definite trend in the development of large models to "reduce costs and increase efficiency". What's more, compared with the limited number of center nodes, numerous mobile terminals are called "capillaries" that touch thousands of scenes, which determines that this entrance will be the key to accelerate the application penetration of large models.

How to put the big model "into the pocket"?

"Compared with traditional PCs or servers, the biggest challenge for mobile terminals is how to balance the experience and energy consumption, which is one of the most important core points of the Hongmeng kernel design." Gong Ti, president of Huawei's terminal business software department, emphasized.

The large model requires a large amount of computing and storage resources, especially based on the existing hardware configuration of the cell phone, which requires the software system to be well coordinated to improve efficiency and reduce energy consumption.

Now the phone in order to improve performance, at least 8 chip cores, it is necessary to cell phone system to do the coordination, this process will consume a lot of arithmetic power. If the use of heterogeneous resource scheduling, you can efficiently coordinate CPU, GPU, NPU. Gong body said that the scheduling efficiency can be increased by more than 60%.

Cell phone system can be computing, scheduling the smallest unit is called a thread, the traditional operating system is often tens of thousands of threads running at the same time, there will be a large number of invalid threads. For this point, it can be more lightweight concurrency model to deal with concurrent operations, reduce the invalid thread switching on the consumption of computing power. According to Gong Ti said that the concurrency model can make the task switching overhead savings of 50%.

In addition, in the operating system's task scheduling, which is also the most basic element affecting the smooth experience, compared to fair scheduling, dynamic priority scheduling will greatly reduce energy consumption. Dynamic priority scheduling is similar to an intelligent transportation system, according to road conditions and traffic flow, dynamically adjust the traffic signal light status, such as when the traffic flow in a certain direction increases, the signal light in that direction will turn green in advance, it will reduce congestion and delay.

However, to make the big model deployed to the phone, can still run, just cell phone operating system upgrade and improvement is not enough.

As the big model predictions become more accurate and the network gets deeper, the memory capacity consumed by the neural network has become a central issue. At the same time, it also involves the problem of memory bandwidth, network operation, memory, CPU and battery will be consumed at a rapid pace, which is definitely the weight of the current cell phone is difficult to bear.

Therefore, before deploying to cell phones, it is necessary to compress the large model to reduce the demand for inference arithmetic. However, it must be ensured that the original performance and accuracy remain largely unchanged.

Quantization is a common and important compression operation that reduces the memory space occupied by the model and improves inference performance. It is essentially converting a floating-point model into an integer model, since integer operations are more accurate and faster than floating-point operations.

Currently, quantization technology has also been accelerating breakthroughs. The model trained on the server generally uses 32-bit floating point operation (FP32), and on the cell phone side, Qualcomm has quantized and compressed the FP32 model into an INT4 model, achieving 64 memory and computing energy efficiency. Qualcomm's implementation data shows that after training with Qualcomm's quantization perception, many AIGC models can be quantized to the INT4 model, which improves performance by about 90% and energy efficiency by about 60% compared with INT8.

Large model compression technology is undoubtedly a key factor for AI giants to win the mobile terminal battlefield. This also explains, to a certain extent, why NVIDIA in February this year "quietly" acquired the master compression of large model technology of artificial intelligence startup OmniML.

Large model forced terminal hardware upgrade

"This year we will be able to support generative AI models with 10 billion parameters running on cell phones." Qualcomm senior vice president of product management and head of AI Ziad Asghar, on the other hand, recently said to the public that a model with 10-15 billion parameters can cover the vast majority of AIGC use cases. If the device can already support this parameter level, computing can all be done on the device and the phone will become a true personal assistant.

However, the current generation of flagship version of the cell phone chip can be carried to run 1 billion parameter level large model, Qualcomm in June this year, the top academic conference of computer vision CVPR, successfully demonstrated running on the Android system on the large model, but only 1.5 billion parameters.

Parameters jumped almost ten times, running to the mobile terminal of the big model has stepped on the "gas", that the phone will have to accelerate the upgrade in order to cope with.

Cell phone hardware urgently needs to be revolutionized in the AI gas pedal and memory.

First of all, the larger parameters of the big model, need more memory and storage space to store the model parameters and intermediate results. This requires mobile terminal memory chip capacity, as well as memory interface bandwidth are upgraded.

Second, larger parameters inevitably require more powerful computation and inference capabilities to process input data and output results.

Although, AI gas pedals (e.g., various NPU IPs) on current cell phone chips are almost standard, the design is basically for the previous generation of convolutional neural network design, and is not fully targeted at large models.

In order to adapt to large models, AI gas pedals must be able to have larger memory access bandwidth and reduce memory access latency. This requires some changes in the interface of the AI gas pedal (e.g., allocating more pins to the memory interface), as well as corresponding changes in the on-chip data interconnect to meet the needs of the AI gas pedal's memory access.

One of the important reasons why Qualcomm can call out "10 billion parameters within a year to run the phone" is that it has the second generation Snapdragon 8 processor equipped with the fastest and most advanced AI engine in the history of Qualcomm, compared with the first generation of Snapdragon 8 processor, the AI performance has been increased by 4.35 times, and the energy efficiency has been increased by 60%.

Of course, the training and reasoning of large models with ultra-large scale parameters, even in the cloud, there is an urgent need to break through five walls: memory wall + arithmetic wall + communication wall + tuning wall + deployment wall, the phone has to break through one layer at a time.

However, from "intelligence" to "artificial intelligence", for cell phones, the opportunity is greater than the challenge.

"The impact of the innovation cycle on consumer electronics is more important, and can even lead an industry out of the economic cycle." Zhao Ming, CEO of Glory Terminal, judged that the current smartphone industry is in a new innovation cycle opened by AI and 5G+.

The big model "runs" into the cell phone, the fire of AI has burned from the "cloud" to the "mobile terminal".

The next step is to let the big model run directly on the cell phone.

Under the strong push of the group of giants, the industrial trend of big models moving from cloud to end has been very clear.

Why should the big model "run" on the cell phone?

On the other hand, the need for big models to mobile terminals such as cell phones has become very urgent.

The biggest problem is not "expensive", but "lack of".

How to put the big model "into the pocket"?

However, to make the big model deployed to the phone, can still run, just cell phone operating system upgrade and improvement is not enough.

Large model forced terminal hardware upgrade

Parameters jumped almost ten times, running to the mobile terminal of the big model has stepped on the "gas", that the phone will have to accelerate the upgrade in order to cope with.

Cell phone hardware urgently needs to be revolutionized in the AI gas pedal and memory.

Second, larger parameters inevitably require more powerful computation and inference capabilities to process input data and output results.

However, from "intelligence" to "artificial intelligence", for cell phones, the opportunity is greater than the challenge.

gonewind

gonewind

No activity yet

gonewind

gonewind

No activity yet

How many steps does it take to put a large model into a cell phone?

How many steps does it take to put a large model into a cell phone?

No activity yet

No activity yet