With the PockEngine training method, machine-learning models can efficiently and continuously learn from user data on edge devices like smartphones.
Such personalized deep-learning models could be used in intelligent computer chatbots that learn how to understand a particular user’s accent, while also using smart keyboards that are continually updated to adjust and effectively predict the next word each time based on a certain person’s typing history For this reason, there is need for repeated readjustment of a machine-learning model using fresh data to achieve the customization.
The data from the users is usually sent over to the cloud server because such devices as smartphones do not possess enough memory space and computing power for fine-tuning such a detailed process. However, data transmission consumes much power, and the possibility of confidential users’ data leakage into external cloud storage can be regarded as a danger.
A method was created by researchers from MIT, the MIT-IBM Watson AI Lab, and other places in which deep-learning models can easily adjust to novel sensor data right at an edge device level.
The PockEngine is their on-device training method that finds out which components should be replaced within a large machine-learning module to raise precision and store only these particular fragments. This bulk of computation is executed as the model is being prepared, thereby reducing computational overhead during runtime and increasing the speed of fine-tuning.
Compared with other approaches, PockEngine accelerated by a factor of 15 on some platforms. In addition, PockEngine did not degrade any model in terms of accuracy. The researchers also noticed that the fine-tuning of the popular AI chatbot was able to provide accurate answers to complex questions.
On-device fine-tuning may be useful for privacy protection, cost reduction, customization capability, and even lifelong learning but its practicability is very dubious. The resources are limited hence everything has to be done within the confines of those resources. We would like to be capable of undertaking both, training as well as inference on edge devices. According to Song Han, an associate professor in EECS, a member of the MIT-IBM Watson AI Lab, a distinguished scientist at NVIDIA, and senior author of an open-access paper detailing PockEngine, “now we can”
The paper is co-authored by lead author Ligeng Zhu, a PhD in EECS, together with other colleagues from MIT, MIT–IBM Watson AI Lab, and the University of California San Diego. This paper has just been published at the IEEE/ACM International Symposium on Microarchitecture.
In the realm of deep learning, models are constructed based on neural networks, comprising interconnected layers of nodes, or “neurons.” These neurons process data to generate predictions. When the model is set in motion, a process known as inference takes place. During inference, a data input, such as an image, traverses through the layers until a prediction, like an image label, is produced at the end. Notably, each layer doesn’t need to be stored after processing the input during inference.
However, during the training and fine-tuning phase, the model undergoes backpropagation. In backpropagation, the model is run in reverse after comparing the output to the correct answer. Each layer is updated as the model’s output approaches the correct answer. Since fine-tuning may require updating each layer, the entire model and intermediate results must be stored, making it more memory-intensive than inference.
Interestingly, not all layers contribute equally to accuracy improvement; even for crucial layers, the entire layer may not require updates. These unessential layers or parts thereof don’t need to be stored. Moreover, there might be no need to trace back to the initial layer for accuracy enhancement; the process could be halted midway.
PockEngine capitalizes on these insights to accelerate the fine-tuning process and reduce the computational and memory demands. The system sequentially fine-tunes each layer for a specific task, measuring accuracy improvement after each layer adjustment. This approach allows PockEngine to discern the contribution of each layer, assess trade-offs between accuracy and fine-tuning costs, and automatically determine the percentage of each layer that requires fine-tuning.
Han emphasizes, “This method aligns closely with the accuracy achieved through full backpropagation across various tasks and neural networks.”
Traditionally, generating the backpropagation graph constitutes heavy computational work that takes place at runtime. On its side, this process takes place at compile time before the PockEngine deploys it for use.
PockEngine discards portions of text codes to eliminate superfluous sections or parts of layers forming an uncomplicated diagram for runtime. This reduces it to a graph which it then optimizes further for effectiveness.
As everything just has to happen only once, there is minimal computational overhead at runtime.
“It is as if you were preparing for a hike in the woods. You sit down at home and make all arrangements—which hiking routes do you plan to take, which will you skip over,” Han explains.”
It trained models up to 15x faster than other solutions with no impact of accuracy when the researchers ported PockEngine to deep-learning models running on different edge devices like Apple M1 Chips, digital signal processors that typically are found inside modern smartphones and Raspberry Pi In addition, fine-tuning using PockEngine was performed at a vastly reduced memory requirement.
Furthermore, the team employed the technique on a large language model Llama-V2. In big language models fine-tuning includes multiple examples and it is important for such models to know how to talk with people, Han says. It is also crucial for the models involved in complex problem-solving and solution reasoning.
For example, when using PockEngine to fine-tune Llama-V2 models, they successfully answered the question, “What was Michael Jackson’s last album?” In contrast, models that weren’t fine-tuned failed to provide the correct answer. With the help of PockEngine, the time for each iteration of the fine-tuning process was reduced from about seven seconds to less than one second on an NVIDIA Jetson Orin, an edge GPU platform.
Looking ahead, the researchers aim to leverage PockEngine for fine-tuning even larger models designed to handle both text and images simultaneously.
“This research tackles the increasing efficiency challenges posed by the widespread adoption of large AI models like LLMs across various applications in diverse industries. It not only shows promise for edge applications incorporating larger models but also has the potential to reduce the cost associated with maintaining and updating large AI models in the cloud,” says Ehry MacRostie, a senior manager in Amazon’s Artificial General Intelligence division. MacRostie, who was not involved in this study, collaborates with MIT on related AI research through the MIT-Amazon Science Hub.
Support for this work was provided, in part, by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT-Amazon Science Hub, the National Science Foundation (NSF), and the Qualcomm Innovation Fellowship.
Leave a Reply