Currently, the Foundation Model or large models are gaining popularity. Next, let’s introduce what a large model is and its basic concept. Then, we will explore the practical applications of large models and discuss a few use cases based on these applications. Finally, we will introduce AI frameworks that support large model training.
In August 2021, Fei-Fei Li and over 100 scholars jointly published a research report titled “On the Opportunities and Risks of Foundation Models,” which comprehensively reviews the opportunities and challenges faced by large-scale pre-trained models in depth. The report spans over 200 pages.
In the article, AI experts collectively refer to large models as “Foundation Models,” which can be translated as either “fundamental models” or “cornerstone models.” The paper acknowledges the role of Foundation Models in driving the basic cognitive capabilities of intelligent agents. At the same time, it points out the characteristics of “emergence” and “homogenization” exhibited by large models. “Emergence” refers to the implicit nature of a system’s behavior, rather than being explicitly designed. “Homogenization” means that the capabilities of the foundational model form the center and core of intelligence, and any improvement in large models quickly spreads across the entire community. However, any flaws in the foundational model will also be inherited by all downstream models.
Returning to large models, the introduction of the Transformer architecture in 2017 allowed deep learning models to surpass 100 million parameters. The following graph illustrates the progression from the initial models like LeNet, AlexNet, and ResNet, with increasing numbers of parameters. With the introduction of the BERT network model, the parameter count exceeded 300 million for the first time. The GPT-3 model surpassed billions of parameters, and the Pangu model achieved a dense scale of over 100 billion parameters. The emergence of the Switch Transformer even broke the trillion-parameter mark.
Taking the GPT series as an example:
- GPT-1 had a parameter count in the billions range. The dataset used was BookCorpus, consisting of 10,000 books with a total of 2.5 billion words.
- GPT-2 had a parameter count of 1.5 billion. The data for training came from the internet and included 80 million web pages linked on Reddit. After cleaning, the dataset amounted to over 40GB (WebText).
- GPT-3 marked the first breakthrough with over 10 billion parameters. The dataset was expanded to include the Common Crawl (570GB, 400 billion words), WebText2 (19 billion words), BookCorpus (67 billion words), and Wikipedia (3 billion words).
As we can see, with each generation, there has been a significant leap in terms of data, both in terms of coverage and richness. It is evident that the next trillion-parameter model, if it does not have a significant change in the quality, source, and scale of the data compared to GPT-3, it will be challenging to achieve substantial improvements.
Large models have caused waves across various industries, showcasing not only the capabilities of distributed parallelism and mastery of AI algorithms but also the groundbreaking efforts of major companies utilizing large-scale AI clusters to push the boundaries.
As network models continue to grow larger, it becomes increasingly difficult to train them using existing resources, be it single-machine single-GPU, single-machine multi-GPU, or even small-scale clusters with multiple machines and GPUs, once the parameter count exceeds billions. This has led some researchers to raise questions:
- Will blindly making models bigger and exponentially increasing the parameter count truly improve the learning ability of AI models?
- Will it truly bring about genuine intelligence?
- Some even challenge, can’t they solve simple elementary math problems?
- Is the generated text illogical?
- Are the medical recommendations unreliable?
It is worth clarifying that currently, large models like GPT-3 rely primarily on the extensive memorization of massive corpora during the pre-training phase to achieve zero-shot and few-shot learning capabilities. Additionally, they are strengthened by semantic encoding, long-range dependency modeling, text generation abilities, and the design of natural language task descriptions. However, there is no explicit guidance for the model to learn small-sample generalization abilities during training. Therefore, it is understandable that there may be failures in tasks involving niche corpora, logical understanding, mathematical problem-solving, and other language tasks.
Although there were initial doubts when large models were introduced, it cannot be denied that they have achieved what early pre-trained models could not or did not do well. For example, in natural language processing, tasks such as text generation, text comprehension, and automated question answering have seen significant improvements. The generated text is not only more fluent but also exhibits improved factual accuracy. However, whether large models can ultimately lead to general artificial intelligence remains unknown. Nevertheless, large models do have the potential to pave the way for the next significant advancements in artificial intelligence.
With the basic introduction to large models, let’s take a closer look at their specific applications.
The figures below demonstrate the breakthroughs in accuracy achieved by deep learning models on the ImageNet image dataset as new models were introduced.
The figure below illustrates the continuous improvement in machine understanding of natural language after the emergence of network pre-training models.
Although deep learning has greatly improved the accuracy and precision in many general domains, there are still many challenges with AI models. The most significant issue is the lack of generalizability, meaning that a model designed for a specific domain A often performs poorly when applied to domain B.
1. Model Fragmentation, Large Models Provide Pre-training Solutions
Currently, AI is facing many industries and business scenarios, and the demand for artificial intelligence is showing fragmented and diversified characteristics. From development, tuning, optimization, iteration to application, AI model development costs are extremely high and difficult to meet market customization needs. Therefore, some people online say that the current stage of AI model development is like a manual workshop. Basically, if a company wants to empower its business with AI, it will need to hire AI development personnel to some extent.
In order to solve the transition from manual workshop to factory mode, large models provide a feasible solution, which is the “pre-training large model + downstream task fine-tuning” approach. Large-scale pre-training can effectively capture knowledge from a large amount of labeled and unlabeled data, store the knowledge in a large number of parameters, and fine-tune specific tasks, greatly expanding the model’s generalization ability. For example, in the field of NLP, pre-trained large models share parameters of pre-training tasks and some downstream tasks, to some extent solving the problem of generality, and can be applied to natural language tasks such as translation, question answering, and text generation.
The field of NLP has seen rapid development in large-scale pre-trained models, from BERT to GPT-3, and now to the trillion-parameter Switch Transformer. These models have been growing in size, data volume, and computational resources at an astonishing rate. Just how large are they? GPT-3 has 175 billion parameters, was trained on over 45TB of data, and requires over 3.14E23 FLOPS, which is over 1900 times more than BERT. Despite the staggering amount of data and the terrifying number of network model parameters, the FLOP-matched Switch Transformer outperforms T5-Base and T5-Large by 4.4% and 2% respectively on the SuperGLUE NLP benchmark. Overall, the Switch Transformer model brings significant performance improvements in various inference and knowledge tasks. This indicates that the ultra-large model architecture is not only useful for pre-training but can also improve quality through fine-tuning for downstream tasks.
2. Large models have the capability of self-supervised learning, which helps to reduce training and development costs.
The self-supervised learning method for large-scale models can reduce the need for data annotation, to some extent addressing the issues of high cost, long cycle, and low accuracy of manual annotation. By reducing the cost of data annotation, it enables better learning capabilities for small samples compared to before, and the advantage becomes more pronounced as the model parameter size increases. This avoids the need for developers to conduct large-scale training and allows them to train the desired models using small samples, greatly reducing development and usage costs.
Bert was first introduced in 2018 and immediately surpassed the State-of-the-art results in 11 NLP tasks, becoming a new milestone in the field of NLP. It also opened up new avenues for model training and NLP by demonstrating that deep exploration of unlabeled data can greatly improve the performance of various tasks. It is important to note that data annotation relies on expensive human resources, while in the era of the internet and mobile internet, a large amount of unlabeled data is readily available.
3. Large-scale models have the potential to further break through the accuracy limitations of existing model structures.
From the development of deep learning in the first 10 years, it can be observed that the improvement in model accuracy primarily relies on structural changes in the network. For example, from AlexNet to ResNet50, and then to EfficientNet discovered through NAS, the ImageNet Top-1 accuracy has increased from 58 to 84. However, as the design techniques for neural network structures have matured and converged, it has become increasingly difficult to break through accuracy limitations by optimizing the network structure alone. In recent years, with the continuous increase in data and model size, further improvements in model accuracy have been achieved. Research experiments have shown that increasing the scale of both the model and the data can indeed overcome the limitations of existing accuracy.
Taking the example of Big Transfer (BiT), a visual transfer model released by Google in 2021, increasing the scale of the data can also lead to improvements in accuracy. For instance, training ResNet50 with the ILSVRC-2012 dataset (1.28 million images, 1000 categories) and the JFT-300M dataset (300 million images, 18291 categories) resulted in accuracies of 77% and 79% respectively. Additionally, training ResNet152x4 with the JFT-300M dataset achieved an accuracy of 87.5%, which is a 10.5% improvement compared to the combination of ILSVRC-2012 and ResNet50.
Applications of large-scale models
Since large models can surpass the limits of training accuracy and are compatible with downstream tasks, are there any specific application scenarios to introduce?
- Natural Language Processing (NLP): Large models can be used for tasks such as machine translation, sentiment analysis, text summarization, and question-answering systems. These models can understand and generate human-like text, improving the accuracy and fluency of NLP applications.
- Computer Vision: Large models can be applied to image classification, object detection, and image generation tasks. They can accurately identify and classify objects in images, detect and track multiple objects, and generate realistic images.
- Speech Recognition: Large models can be used for speech recognition tasks, enabling accurate transcription of spoken language into written text. They can improve the accuracy and efficiency of voice assistants, transcription services, and voice-controlled systems.
- Recommendation Systems: Large models can be employed in recommendation systems to provide personalized recommendations for users. By analyzing user behavior and preferences, these models can suggest relevant products, movies, music, or articles, enhancing the user experience.
- Healthcare: Large models can be utilized in medical imaging analysis, disease diagnosis, and drug discovery. They can assist doctors in detecting abnormalities in medical images, predicting disease outcomes, and identifying potential drug candidates.
Large model training framework
Currently, some deep learning frameworks, such as Pytorch and Tensorflow, are unable to meet the demands of training large-scale models. As a result, Microsoft has developed DeepSpeed based on Pytorch. Companies like Huawei’s MindSpore, Baidu’s PaddlePaddle have also made deep progress and exploration in training large-scale models, providing native AI framework support for training such models.
Microsoft released DeepSpeed in February 2021, with its core feature being the memory optimization technology called ZeRO (Zero Redundancy Optimizer). DeepSpeed advances the capability of training large models through four aspects: scaling, memory optimization, speed improvement, and cost control. Based on DeepSpeed, Microsoft developed the Turing Natural Language Generation model (Turing-NLG) with 17 billion parameters. Additionally, ZeRO-2 was released in May 2021, which supports training models with 200 billion parameters. Furthermore, Microsoft collaborated with NVIDIA to release the Megatron Turing-NLG model, an NLP model with 530 billion parameters, using a cluster of 4480 A100 GPUs.
In the field of large-scale model engineering, the main competitors are NVIDIA’s GPU with Microsoft’s DeepSpeed, Google’s TPU with TensorFlow, and Huawei’s Ascend Atlas 800 with MindSpore. These three major manufacturers are capable of comprehensive optimization. As for other manufacturers, most of them innovate and optimize based on NVIDIA’s GPU foundation. Ultimately, in the market, core technology is not the most important factor. The true winner is the one who can create greater value for customers.