Run LLM Pre-Training on Nutanix

Nutanix.dev-BuildYourGenerativePre-TrainedTransfor

Table of Contents

Introduction 

The goal of this article is to show how customers can develop a generative pre-trained  transformer (GPT) model from scratch using open source Python® libraries on the Nutanix Cloud Platform™ (NCP) full-stack LLM infrastructure platform.

We largely adopted the model architecture from GPT-2 paper with 1.5B parameters and built a miniaturized GPT model with 11M parameters. In this article, the model has been trained with the wikitext-2-raw dataset. It crawls over 23M URLs and over 10M HTML pages. Pre-trained models, such as the GPT-2, GPT-3, PaLM, Llama-2, and BloombergGPT models, serve as a foundation for sophisticated LLMs performing various NLP tasks such as question answering, summarization, machine translation, code generation, sentiment analysis, named entity detection, text entailment, commonsense reasoning, and reading comprehension.

We presume that a deep understanding of pre-trained model training leads to better decision-making for LLM system design, including model sizing, dataset sizing, and compute infrastructure provisioning. Some customers might need to train a pre-trained model from the ground up because their datasets are very different from datasets like C4, which are used to train popular pre-trained models such as GPT-3 and Llama-2. 

The compute infrastructure provisioning for LLM pre-training can be a complex endeavor, especially in consideration of data privacy, data sovereignty and data governance. The breakneck speed of AI innovation around model and training dataset availability and compute optimization trigger rapid enterprise AI adoption. This gold rush of enterprise AI adoption demands infrastructure solutions that can scale easily, securely and robustly with teams trained to manage more traditional workloads. In this article, we show how NCP can be leveraged to train GPT-2 or bigger models with WebText or a similar dataset. 

LLM Pre-Training

LLM Pre-training typically involves heavily over-parameterized and decoder-only transformers as the base architecture and model natural language in bidirectional, autoregressive or sequence-to-sequence manners on large-scale unsupervised corpora.

The premise of the GPT-2 model is a high-capacity, self-supervised, and generative model trained on a dataset with sufficiently large volume and variety, and can perform multiple tasks simultaneously with minimal discriminative supervision in the form of fine-tuning and in-context learning.

The generative capability of some of the pre-trained, high-capacity models, such as GPT-4 and PaLM, are so good that they obviate the need for downstream discriminative supervision. Due to its autoregressive nature, pre-trained models are most adept in text generation and fill mask tasks. 

The large language model has truly captured the popular media, which is abuzz with jargons and catchphrases. This section aims to inform what is pre-training in the context of a large language model, and how it is different from other functional phases, such as supervised finetuning and reinforcement learning on human feedback (RLHF), as shown in Figure 1. 

Figure 1: Different functional phases of large language model training, including pre-training, supervised finetuning, and RLHF. This present article categorically focuses on the pre-training block involving pre-training data, self-supervised learning algorithm and pre-trained model inference/deployment

The pre-training deals with developing a self-supervised learning model from a large corpus of data. A pre-trained model takes text (a prompt) and generates text (a completion), as shown in Figure 2.

Figure 2: Operational model of a pre-trained large language model

Large language models are autoregressive in nature. They typically use decoder-only transformer models for self-supervised learning. Supervised finetuning and RLHF are used for adapting pre-trained LLMs to domain-specific tasks such as summarization and question answering.

AI in Nutanix Infrastructure Stack

Figure 3 shows how AI/ML is integrated into the core Nutanix® infrastructure layer. The foundation models are essentially pre-trained generative models, such as BERT, GPT-3 and DALL-E. They assume a central role in this integration. The foundation models, a.k.a. the pre-trained models, run on the cloud-native infrastructure stack of NCP and empower customers to build a wide array of generative AI apps. 

Figure 3: AI stack running on the cloud-native infrastructure stack of NCP. The stack provides holistic integration between supporting cloud-native infrastructure layer, including chip layer, followed by virtual machine layer, supporting library/tooling layer, and AI stack layer, including Foundation Models (different variants of transformers), task specific AI app layers

Setting up NCP for GPT Model Pre-Training

At Nutanix, we are dedicated to enabling customers to build and deploy intelligent applications anywhere—edge, core datacenters, service provider infrastructure, and public clouds. The Prism Element™ management console enables users to fully configure, manage and monitor Nutanix clusters running any hypervisor. Therefore, the first step of the Nutanix infrastructure setup is to log into a Prism Element, as shown in Figure 4.

  1. Log into a Prism Element (the UI is shown in Figure 4)
Figure 4: The UI showing the setup for a Prism Element on which the transformer model for this article was trained. It shows the AHV® hypervisor summary, storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status
  1. Set up the Virtual Machine

After logging into Prism Element, we create a virtual machine (VM) hosted on our Nutanix AHV® cluster. As shown in  Figure 5, the VM has following resource configuration settings: 22.04 Ubuntu® operating system, 16 single core vCPUs, 64 GB of RAM, and NVIDIA® A100 tensor core passthrough GPU with 40 GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory. 

Figure 5: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this article
  1. Underlying A100 GPU 

The NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic datacenters for AI, data analytics, and HPC. Powered by the NVIDIA Ampere™ architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands.

The A100 80GB debuts the world’s fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets. To peek into the detailed features of A100 GPU, we run  `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 6. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7.

Figure 6: Output of `nvidia-smi` for the underlying A100 GPU

Figure 6 shows several critical features of the A100 GPU we used. The details of these features are described in Table 1.  

FeatureValueDescription
GPU0GPU Index
Name NVIDIA A100GPU Name
Temp34 CCore GPU Temperature
PerfP0GPU Performance
Persistence-MOnPersistence Mode
Pwr: Usage/Cap36W / 250WGPU Power Usage and it capability
Bus Id00000000:00:06.0domain:bus:device.function
Disp. AOffDisplay Active
Memory-Usage25939MiB / 40960MiBMemory allocation out of total memory
Volatile Uncorr. ECC0Counter of uncorrectable ECC memory error
GPU-Util0%GPU Utilization
Compute M.DefaultCompute Mode
MIG M.DisabledMulti-Instance Mode
Table 1: Description of the key features of the underlying A100 GPU

GPU Pre-Training on Nutanix Cloud Platform

Figure 7: Different elements of a pre-trained GPT model

The development of a pre-trained GPT model has three different artifacts: Data, model and infrastructure. The infrastructure piece was described in the last section. We have used  the wikitext-2-raw dataset. It crawls over 23M URLs and over 10M HTML pages. We used a Transformer-based architecture similar to GPT-2.

Code Walkthrough 

This section walks through the implementation details of the GPT. The pertinent codebase can be found in this repo. The repo has the following directory structure:

├── LICENSE

├── README.md

├── configurator.py

├── data

│   ├── data

│   │   └── input.txt

│   └── prepare.py

├── model.py

├── requirements.txt

├── sample.py

└── train.py

FileDescription
LICENSEApache 2.0 License
README.mdReadme markdown
configurator.pyConfiguration file
data/data/input.txtInput text file
model.pyOutline model-specific details
data/data/prepare.pyPrepare dataset
requirements.txtDeclare required Python libraries
sample.pyTrigger output 
train.pyRun training 
Table 2: Details of GPT codebase used in this article.

Data Engineering

We use a multilingual wikitext-2-raw dataset with a vocabulary size of 1,013 with 10,918,892 characters. The relevant data engineering code can be found in `prepare.py` script and run with `python prepare.py` command. This splits the dataset into training and validation tokens stored in  a train.bin and val.bin in that data directory. The training set has 9,827,002 tokens and the validation set has 1,091,890 tokens. 

Model Training Details

Table 3 shows the details of different model and training parameters for this article. Due to the limited computational resources, the model used was of significant low capacity of 11M parameters. 

ParameterSpecifications
batch size32
block_size or context window256
number of Transformer layers 6
number of attention head6
embedding dimension384
dropout0.2
learning rate1e-3
weight decay1e-1
max iterations5000
learning rate decay iteration5000
minimum learning rate1e-4
beta10.9
beta20.99
grad clip0.1
Warmup iterations100
Loss Function Cross Entropy
Table 3: Details of different model and training parameters used for this article

Results

Summary StatisticFinal Value
Iterations5000
Learning Rate0.0001
Model Flops Utilization (MFU)13.23778
Training loss1.21256
Validation loss1.31918
Table 4: Summary statistics for the training used in this article
Figure 8: Summary statistics for the training used in this article.
Figure 9: Output from the GPT model in this article on wikitext data.

Figure 9 shows the output from the GPT model with the training data. It shows reasonable validation accuracy. It generates 500 tokens in each sample and retains the top 200 tokens.  

Key Takeaways

  1. We trained a miniaturized GPT model with 11M parameters on wikitext-2-raw dataset. The model has similar architecture to GPT-2.
  2. The model shows reasonably good cross-entropy loss of 1.31918 on validation dataset.

Next Steps

We have focused on LLM pre-training in the blog. Previously, we have covered LLM fine-tuning. Our future blogs will explore the following topics:

  1. Reinforcement learning and how RLHF helps in improving LLM fidelity.
  2. LLM benchmarking with HELM.

Acknowledgement

The authors would like to acknowledge the contribution of Johnu George, staff engineer at Nutanix.

© 2023 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

This post may contain express and implied forward-looking statements, which are not historical facts and are instead based on our current expectations, estimates and beliefs. The accuracy of such statements involves risks and uncertainties and depends upon future events, including those that may be beyond our control, and actual results may differ materially and adversely from those anticipated or implied by such statements. Any forward-looking statements included herein speak only as of the date hereof and, except as required by law, we assume no obligation to update or otherwise revise any of such forward-looking statements to reflect subsequent events or circumstances

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.