LLM Model Selection and Inference Benchmarking on Nutanix Cloud Platform

Featured_Image_LLMModelSelection

Table of Contents

Summary

Large Language Models (LLMs) based on the transformer architecture (for example, GPT, T5, and BERT) have achieved state-of-the-art (SOTA) results in various Natural Language Processing (NLP) tasks such as multi-language understanding (GPT-4 achieving MMLU score of 86.4), logical interpretations (PaLM-2-L achieving BBH score of 65.7), multi-step mathematical reasoning (GPT-4 achieving GSM8K of 92.0), reading comprehension (PaLM-2-L scoring 86.1 in TriviaQA), and question answering (PaLM-2-L scoring 37.5 in Natural Questions). Currently, there are hundreds of pre-trained LLM models with new models coming up frequently. These models have fine-grained details in terms of architectures, token sizes, data lineages, context lengths, parameter counts, number of layers, model dimensionality, attention head counts, training languages, benchmarking details, and so on. In fact, it is intimidating to enterprise users to choose from a large variety of pre-trained LLM models. As we interact with several enterprise customers and our partners, we often encounter this question: what are the good LLM models and can Nutanix Cloud infrastructure (NCI) support them? This article discusses some of the prominent pre-trained LLM models available–either open source or proprietary-and their common use cases. We also present a benchmarking study evaluating the inference latency of different models on a given configuration of Nutanix Cloud Platform (NCP). 

Pre-Training of Large Language Models

The large language model has truly captured the popular media which is abuzz with jargons and catchphrases. This section aims to inform what is pre-training in the context of a large language model, and how it is different from other functional phases such as supervised finetuning and RLHF, as shown in Figure 1. 

Figure 1: Different Functional Phases of Large Language Model Training, including pre-training, supervised finetuning, and RLHF (reinforcement learning on human feedback). 

The pre-training deals with developing a self-supervised learning model from a large corpus of data. A pre-trained model takes text (a prompt) and generates text (a completion), Figure 2. 

Figure 2: Operational model of a pre-trained large language model.

Large language models are auto-regressive in nature. They typically use decoder-only transformer models for self-supervised learning. Supervised finetuning and RLHF are used for adapting pre-trained LLMs to domain-specific tasks such as summarization and question answering. 

List of Pre-Trained Large Language Models

Table 1 shows a list of 14 different pre-trained LLMs and their parameter counts, use cases, and managing organizations. It shows the parameter counts roughly vary between 1.7T (an unofficial estimate for GPT-4) to 11B (FLAN-t5-xxl). The use cases include conversational interfaces such as chatbots, question-answering, logical/mathematical reasoning, and summarization. These models come both from corporate research labs (such as OpenAI) and open source projects (such as Open Assistant). The model parameters are determined by embedding dimensions, number of attention heads, number of layers, dimensionality of keys and values, and dropout rates.

ModelParameter CountUse CaseOrg.
LLaMA-270Bchatbots, question-answering, math Meta
GPT-41.7T1chatbots, AI system conversations, and virtual assistantsOpenAI
GPT-3175B create human-like text and content (images, music, and more), and answer questions in a conversational mannerOpenAI
Codex12Bprogramming, writing, and data analysisOpenAI
Claude-V152Bresearch, creative writing, collaborative writing, Q&A, coding, summarizationAnthropic
Bloom176Btext generation, exploring characteristics of language generated by a language modelBigScience
FLAN-t5-xxl11Bresearch on language modelsGoogle
Open-Assistant SFT-4 12B12Bas an assistant, such that it responds to user queries with helpful answers
Open Assistant Project
SantaCoder1.1Bmultilingual large language model for code generationBigCode
PaLM 2 340 Ncommon sense reasoning, formal logic, mathematics, and advanced coding in 20+ languagesGoogle
Gopher280Breading comprehension, fact-checking, understanding toxic language, and logical and common sense tasksDeepMind
Falcon40Bcommercial uses, chattingTII
Vicuna 33Bchatbots, research, hobby useLMSYS
MPT33Breading and writing related use casesMosaicML
ERNIE 3.0 Titan260BchatbotBaidu
Table 1: A list of pre-trained models and their granular details, including parameter counts, common use cases, and managing organization.

Inference Benchmarking of Open Source LLMs on Nutanix Cloud Platform

In this section, we present an inference benchmarking study of different open source LLMs with Apache2.0 license on Nutanix Cloud Platform (NCP). 

Configuration of Nutanix Cloud Platform

With Nutanix Cloud Platform, Nutanix delivers the simplicity and agility of a public cloud alongside the performance, security, and control of a private cloud. At Nutanix, we are dedicated to enabling customers with the ability to build and deploy intelligent applications anywhere—edge, core data centers, service provider infrastructure, and public clouds. We offer a zero-touch AI infrastructure which reduces the infrastructure configuration and maintenance burden on machine learning scientists and data engineers.  Prism Element (PE) is a service built into the platform for every Nutanix cluster deployed. Prism Element enables a user to fully configure, manage, and monitor Nutanix clusters running any hypervisor. Therefore, the first step of the Nutanix infrastructure setup is to log into a Prism Element, as shown in Figure 3.

Log into Prism Element on a Cluster (UI shown in Figure 3)

Figure 3: The UI showing the setup for a Prism Element on which the transformer model for this article was trained. It shows the hypervisor summary (AHV), storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status. 

VM Configuration

After logging into Prism Element2, we create a VM hosted on our Nutanix AHV cluster. As shown in  Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu operating system, 16 single core vCPUs, 64 GB of RAM, and NVIDIA A100 Tensor Core passthrough GPU with 80GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory. 

Figure 4: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this work. 

Underlying A100 GPU 

NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere Architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. To peek into the detailed features of A100 GPU, we run  `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 5. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7.

Figure 5: The key features of the A100 40 GB GPU used

Table 2 shows several key features of the A100 GPU we used. 

FeatureValueDescription
GPU0GPU Index
Name NVIDIA A100GPU Name
Temp35 CCore GPU Temperature
PerfP0GPU Performance
Persistence-MOnPersistence Mode
Pwr: Usage/Cap65 W / 300 WGPU Power Usage and it capability
Bus Id00000000:00:06.0domain:bus:device.function
Disp. AOffDisplay Active
Memory-Usage25919MiB / 40960MiBMemory allocation out of total memory
Volatile Uncorr. ECC0Counter of uncorrectable ECC memory error
GPU-Util0%GPU Utilization
Compute M.DefaultCompute Mode
MIG M.DisabledMulti-Instance Mode
Table 2: Description of the key features of the underlying A100 GPU

Benchmarking Results

We have conducted our benchmarking study on 16 different open source models with Apache 2.0 license. All these models were tested with a common prompt: “Generative AI is “ and context window size of 50. Table 3 shows different open source LLMs and parameter counts. 

Model Names (as in Hugging Face)Parameter Counts
google/flan-ul220B
cerebras/Cerebras-GPT-13B13B
cerebras/Cerebras-GPT-6.7B6.7B
OpenAssistant/oasst-sft-1-pythia-12b12B
EleutherAI/pythia-12b12B
EleutherAI/gpt-j-6b6B
databricks/dolly-v2-12b12B
aisquared/dlite-v2-1_5b5B
EleutherAI/gpt-j-6b6B
mosaicml/mpt-7b7B
RedPajama-INCITE-Base-3B-v13B
RedPajama-INCITE-7B-Base7B
tiiuae/falcon-7b7B
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v37B
openlm-research/open_llama_7b7B
openlm-research/open_llama_13b13B
Table 3: The model parameter counts of the open source LLMs used in the inference benchmarking study. 

Latency Benchmarking

Table 4 shows the mean inference (response generation to the prompt) time and its standard error across 3 independent runs for 16 different models. It is expected that the smaller models have lower inference time. We see it is indeed the case with the smallest 5B model “aisquared/dlite-v2-1_5b” to have lowest inference time of 22.87 +/- 0.51 s (across three different runs). However, counterintuitively, mosaicml/mpt-7b with 7B model parameters seems to have the largest generation time of 130.79 +/- 4.31 s. It could be because of the model architecture.  We have also noted that the four larger models (“google/flan-ul2”, “cerebras/Cerebras-GPT-13B”, “databricks/dolly-v2-12b”, “openlm-research/open_llama_13b”) fails to produce any response because of the limited GPU memory of ~25GB. It is quite expected for 16-bit precision. 

ModelsMean Time (s)Std. Error (s)
google/flan-ul2FailFail
cerebras/Cerebras-GPT-13BFailFail
cerebras/Cerebras-GPT-6.7B105.111.60
OpenAssistant/oasst-sft-1-pythia-12b114.090.20
EleutherAI/pythia-12b113.920.26
EleutherAI/gpt-j-6b87.645.17
databricks/dolly-v2-12bFailFail
aisquared/dlite-v2-1_5b22.870.51
EleutherAI/gpt-j-6b90.483.69
mosaicml/mpt-7b130.794.31
RedPajama-INCITE-Base-3B-v126.661.50
RedPajama-INCITE-7B-Base60.230.91
tiiuae/falcon-7b104.404.56
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3101.734.04
openlm-research/open_llama_7b60.592.07
openlm-research/open_llama_13bFailFail
Table 4: The mean inference (response generation to the prompt) times with standard errors across 3 independent runs for 16 different models.

Accuracy Benchmarking

Table 5 shows the responses from 16 different models to a given prompt, “Generative AI is “ and their accuracy assessments. We set the output token size cut-off to be 50. We see acceptable results from models such as “OpenAssistant/oasst-sft-1-pythia-12b”, “tiiuae/falcon-7b”, “h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3”, “RedPajama-INCITE-Base-3B-v1”, “RedPajama-INCITE-7B-Base”. The remaining models either hallucinate (meaning irrelevant response) or produce partially accurate results. In general, the modes with large parameter counts perform better. In that context, “RedPajama-INCITE-Base-3B-v1”, “RedPajama-INCITE-7B-Base”, “tiiuae/falcon-7b”, “h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3”  perform impressively even though they have sub-10B parameter counts. Although “mosaicml/mpt-7b” has the highest response time, it does not exhibit good accuracy.  

ModelsResponse to the Prompt: “Generative AI is “Accuracy Estimate
google/flan-ul2FailFail
cerebras/Cerebras-GPT-13BFailFail
cerebras/Cerebras-GPT-6.7BGenerative AI is \na field that’s been around for a while, but it’s only recently that it has been really taken seriously. It’s a field that is, in manyPartially accurate but not precise
OpenAssistant/oasst-sft-1-pythia-12bGenerative AI is \na subfield of machine learning that uses algorithms to generate new data based on patterns found in existing data. It is a type of artificial intelligence that is designed to create novel and diverse outputs that are not explicitly programmed.Excellent
EleutherAI/pythia-12bGenerative AI is \nthe idea that we can create a computer program that can learn to do things that ive never seen before.\nSo, for example, if you give it a bunch of pictures of cats, it can start to learnVague 
EleutherAI/gpt-j-6bGenerative AI is \na field of computer science that is concerned with the creation of intelligent machines.\n\nGeneration of AI\nThe generation of artificial intelligence is the process of creating a computer program that can perform tasks that are similar to thosePoor 
databricks/dolly-v2-12bFailFail
aisquared/dlite-v2-1_5bGenerative AI is \n\nA type of AI that can generate its own solutions to problems.\nIt is a form of artificial intelligence that uses a large amount of data and a lot of training data to train a model to a solution. ItNot Precise
EleutherAI/gpt-j-6bGenerative AI is \na field of computer science that is concerned with the creation of intelligent machines.\n\nGeneration of AI\nThe generation of artificial intelligence is the process of creating a computer program that can perform tasks that are similar to thosePoor
mosaicml/mpt-7bGenerative AI is icing on the cake for the future of work\nBy: David Cearley\nThe future is here. It’s just not evenly distributed.\nIn the past few years, we’ve seen the rise of theHallucination
RedPajama-INCITE-Base-3B-v1Generative AI is \nthe ability to generate new data from a given dataset.\n\n\\subsection{Generating Data}\nGenerators are the core component of generative models. \n\n\n\nA generator is a function that takes aVery Good
RedPajama-INCITE-7B-BaseGenerative AI is \ngenerating new content, and it’s not just text.\n\n## Generative Art\nGenerating art is a popular use case for generative models. \xa0The most popular generatives for art areVery Good
tiiuae/falcon-7bGenerative AI is “the ability of a computer program to automatically generate new data, such as text, images, or videos, that are similar to existing data.”\nThe term ‘generative’ refers to the ability to create new contentExcellent
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3Generative AI is “a type of artificial intelligence that is capable of generating new data, rather than just processing and analyzing existing data.”\nIn other words, it’s a type AI that can create new information, not just process and analyzeExcellent
openlm-research/open_llama_7bGenerative AI is 100% free to use.\nThe best part is that you can use it to create your own custom avatars, and even use them in your games. You can also use the avatar generator to makeNot Precise
openlm-research/open_llama_13bFailFail
Table 5: The responses from 16 different models to a given prompt, “Generative AI is “ and their accuracy assessments

Conclusion

In this article, we demonstrate how we can use the Nutanix Cloud Platform for inference benchmarking of open source LLMs with Apache 2.0 license. We cover 16 different models in this article. For reproducibility, we are releasing the underlying code: Git Repo

Moving Forward

This is the seventh blog on the topic of AI readiness of Nutanix. The previous blogs can be found here: Link.  In the future, we are planning to publish articles on foundation models, RLHF, and instruction tuning. 

Acknowledgement

The authors would like to acknowledge the contribution of Johnu George, Staff Engineer at Nutanix. 

Footnotes

  1.  an unofficial estimate
  2. A third-party user would need to interface with Prism Central to log into Prism Element running on a cluster with an appropriate role-based access control (RBAC) credential.

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.