Build Private LLM Apps on Nutanix Cloud Platform

LLMBlog

Table of Contents

Summary 

Large Language Models (LLMs) based on the transformer architecture (for example, GPT, T5, and BERT), have achieved state-of-the-art (SOTA) results in various Natural Language Processing (NLP) tasks, especially for text generation and conversational bots. The real benefits of LLMs can be tapped only when they can be translated into applications which generate end-user values. LLM application development involves LLM service, data connectors, raw data to prompt converter template, output parsing, and operationalizing auxiliary services such as vector databases, message queue, and others. As applications become more heterogeneous, an LLM application becomes prohibitively complex with multiple moving pieces. Another design challenge comes from data security with many LLM apps depending on third-party API services from organizations such as OpenAI, Hugging Face. In this scenario, zero-trust organizations such as banks, hospitals, and similar organizations demand security posture hardening by running the LLM models private and eliminating the need for the third-party API services. In this context, LangChain is a prevalent orchestration platform to glue together different application components. 

In this article, we present how a private LLM application can be developed using LangChain on Nutanix Cloud Platform (NCP), which is optimized for easy manageability and resiliency with state-of-the-art AI infrastructure including high-performance GPU, fast NVMe, and storage scaling. 

Benefits of Private LLM Applications

Some of the reasons why you may need to run your model locally, and not use an external API server, include: Local model hosting is preferable over sending POST request to an external API server for the following reasons:

  • Security: As you send data to external API servers, it compromises the enterprise attack services, increasing the risk of the man-in-the-middle (MITM) attack. 
  • Cost: As you send data to an external API, there is a cost associated with data transfer
  • Performance: The risk of downtime due to traffic congestion, configuration, and slow down due to long-tail latency increase significantly as you introduce an external API in your critical application component. 

LangChain

The burgeoning LLM ecosystem comes with three specialized domains: first, foundation models and related algorithmic techniques such as instruction-tuning, PEFT, QLoRA, quantization, in-context learning, reinforcement learning on human feedback; second, infrastructure resource related elements such as compute accelerators, MLflow, deepspeed, and application development platforms such as LangChain, LlamaIndex. LangChain is an orchestration framework for developing applications powered by language models. It is claimed to be data-aware, meaning they allow an LLM model to interact with data sources, and agentic, meaning they allow an LLM model to interact with its environment. It comes with modules such as model I/O, data connection, chains, agents, memory and callbacks. 

Model I/O

LangChain provides the constituents to interface with any language model. The key constituents are: prompts to templatize, dynamically select, and manage model inputs; model to invoke the LLM service; and output parser to format model output. Figure 1 shows how different Model I/O components interact. 

Core Structure Component of LangChain Model I/O
Figure 1: Core Structure Component of LangChain Model I/O

Data Connection

Many LLM applications need user-specific data that is not part of the model’s training set. LangChain offers an ETL pipeline to load, transform, store and query your data with its Document loadersDocument transformers, Text embedding models, Vector stores, and Retrievers

Chain 

Chains enable users to combine multiple components together to compose a single, coherent application. For example, a chain takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM. More complex chains can combine multiple chains together. The foundational pieces for LangChain are LLM, router, sequential, and transformation

Memory

Most LLM applications are centered around a conversational interface. It demands reference to information introduced earlier in a conversation. At its core, a conversational interface needs access to some window of previous messages. A more complex system needs a world model which allows it to do things like maintain information about entities and their relationships. The ability to reference past information is called memory. A memory system requires basic capabilities such as reading and writing. 

Agent

Agent can be dubbed as an automated chain that uses an LLM to choose a sequence of actions. In chains, a sequence of actions is hard-coded. In agents, an LLM is used as a reasoning engine to determine which actions to take and in which order.

Callbacks

A callbacks system allows hooking into the various events of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.

Nutanix Cloud Platform

With Nutanix Cloud Platform, Nutanix delivers the simplicity and agility of a public cloud alongside the performance, security, and control of a private cloud. At Nutanix, we are dedicated to enabling customers with the ability to build and deploy intelligent applications anywhere—edge, core data centers, service provider infrastructure, and public clouds. We offer a zero-touch AI infrastructure which reduces the infrastructure configuration and maintenance burden on machine learning scientists and data engineers.  Prism Element (PE) is a service built into the platform for every Nutanix cluster deployed. Prism Element enables a user to fully configure, manage, and monitor Nutanix clusters running any hypervisor. Therefore, the first step of the Nutanix infrastructure setup is to log into a Prism Element, as shown in Figure 3.

Log into Prism Element on a Cluster (UI shown in Figure 2)

Figure 2: The UI showing the setup for a Prism Element on which the transformer model for this article was trained. It shows the hypervisor summary (AHV), storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status. 

Set up the VM

After logging into Prism Element, we create a VM hosted on our Nutanix AHV cluster. As shown in  Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu operating system, 16 single core vCPUs, 64 GB of RAM, and NVIDIA A100 Tensor Core passthrough GPU with 80GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory. 

Figure 3: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this work.

Underlying A100 GPU 

NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere Architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. To peek into the detailed features of A100 GPU, we run  `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 5. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7.

The key features of the A100 40 GB GPU used
Figure 4: The key features of the A100 40 GB GPU used

Table 1 shows several key features of the A100 GPU we used. 

FeatureValue Description
GPU0GPU Index
Name NVIDIA A100GPU Name
Temp35 CCore GPU Temperature
PerfP0GPU Performance
Persistence-MOnPersistence Mode
Pwr: Usage/Cap65 W / 300 WGPU Power Usage and it capability
Bus Id00000000:00:06.0domain:bus:device.function
Disp. AOffDisplay Active
Memory-Usage18884MiB / 40960MiBMemory allocation out of total memory
Volatile Uncorr. ECC0Counter of uncorrectable ECC memory error
GPU-Util0%GPU Utilization
Compute M.DefaultCompute Mode
MIG M.DisabledMulti-Instance Mode
Table 1: Description of the key features of the underlying A100 GPU

Implementation of Private LLM

We ran an LLM service on an instance of the Nutanix Cloud Platform as described in the previous section. The LLM service features a Question-Answering API. It has been tested with ten prompts and queries, as shown in Table 2. The first three prompts are of type Knowledge Retrieval and the remaining seven questions are of type logical reasoning, cause and effect, analogical reasoning, inductive reasoning, deductive reasoning, counterfactual reasoning, and in context, respectively. 

IndexPromptType
1What is the capital of the United States?Knowledge Retrieval
2What is the capital of India?Knowledge Retrieval
3What is the capital of China?Knowledge Retrieval
4What is the next number in the sequence: 1, 1, 2, 3, 5, 8, …? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail?Logical Reasoning
5If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer?Cause and Effect
6In the same way pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to?Analogical Reasoning
7Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies?Inductive Reasoning
8All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella?Deductive Reasoning
9If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb?Counterfactual Reasoning
10The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet.  Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida.  Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea.   Where and when will the storm make landfall?In context
Table 2: Ten different test prompts and their types. 

Application Components

It uses three Python packages: transformers, langchain, and torch

From transformers (developed by HuggingFace) package, it uses three modules: AutoTokenizer, AutoModelForSeq2SeqLM, Pipeline

  • Hf AutoTokenizer: This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained() class method. The relevant statement for this instantiation: 

tokenizer = AutoTokenizer.from_pretrained(model_id)

  • Hf AutoModelForSeq2SeqLM: This is a generic model class that will be instantiated as one of the model classes of the library—with a sequence-to-sequence language modeling head—when created with the when created with the from_pretrained() class method or the from_config()  class method. The relevant statement for this instantiation:

model = AutoModelForSeq2SeqM.from_pretrained(model_id)

  • Hf Pipeline: Pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. It comes in two flavors: first,  pipeline() which is the most powerful object encapsulating all other pipelines. Second, domain-specific pipelines are available for audio, computer vision, natural language processing, and multimodal tasks. The relevant statement in the implementation:

pipe = pipeline(“text2textgeneration”, model=model, tokenizer=tokenizer, max_length=512)

From the LangChain package,  the application uses the HuggingFacePipeline class from llms interface.  HuggingFacePipeline facilitates local deployment of HuggingFace models.  The implementation can be found in: Link

Pre-trained Models Used

The application was developed with five different pre-trained models based on T5 transformer architecture class and their parameter counts are shown in Table 3.  The smallest model, google/flan-t5-small has 80M parameters, while the largest model google/flan-t5-xxl has 11B parameters. These models use different mixtures of datasets such as CoT, Muffin, T0-SF, NIV2. 

Model NameParameter Size
google/flan-t5-small80 M
google/flan-t5-base250 M
google/flan-t5-large780 M
google/flan-t5-xl3 B
google/flan-t5-xxl11 B
Table 3: Five different pre-trained models used in this article and their respective parameter count

Workflow Sequence

This application uses pre-train models directly without any finetuning. It uses a four step sequence for the answer generation from a question, as shown in Figure 5.

Table 4 shows tokenizer loading times for five different models. Apparently, all models have comparable tokenizer loading times because all models use a similar tokenizer class.

Workflow sequence for the LLM based QA application 
Figure 5: Workflow sequence for the LLM based QA application 
Model NameTokenizer Loading Time (s)
google/flan-t5-small0.26
google/flan-t5-base0.14
google/flan-t5-large0.14
google/flan-t5-xl0.14
google/flan-t5-xxl0.15
Table 4: Five different pre-trained models used in this article and their respective tokenizer loading times 

Table 5 shows model loading times for five different models. Apparently, the models with more parameters take more time to load. 

Model NameModel Loading Time (s)Parameter Count
google/flan-t5-small1.380 M
google/flan-t5-base3.5250 M
google/flan-t5-large9.8780 M
google/flan-t5-xl27.03 B
google/flan-t5-xxl93.311 B
Table 5: Five different pre-trained models used in this article and their respective model loading times

Table 6 shows pipeline loading times for five different models. Apparently, all models have comparable tokenizer loading times because all models use a similar pipeline class. 

Model NamePipeline Loading Time (s)
google/flan-t5-small0.05
google/flan-t5-base0.02
google/flan-t5-large0.05
google/flan-t5-xl0.19
google/flan-t5-xxl0.62
Table 6: Five different pre-trained models used in this article and their respective pipeline loading time

Effectiveness of Different Models in QA Tasks

Tables 7-9 shows the performance for five different models for three QA tasks of knowledge retrieval type. In general, the accuracy score increases with more complex models such as google/flan-t5-xxl, while the answer generation time also increases with the model complexity. 

What is the capital of the United States of America? (Knowledge Retrieval)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
AnswerSan FranciscoSan Franciscowashingtonwashington dcwashington dc
Accuracy Estimate000.71.01.0
Generation Time (s)0.060.120.160.722.85
Table 7: Responses from five different models on a knowledge retrieval question
What is the capital of India? (Knowledge Retrieval)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answersarajevochennaicalcuttabhopalnew delhi
Accuracy Estimate00001.0
Generation Time (s)0.050.10.260.622.49
Table 8: Responses from five different models on a knowledge retrieval question
What is the capital of China? (Knowledge Retrieval)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answershanghaiqingdaobeijingbeijingbeijing
Accuracy Estimate00111
Generation Time (s)0.050.120.190.522.00
Table 9: Responses from five different models on a knowledge retrieval question

Table 10 shows the performance for five different models for a QA task of logical reasoning type. In general, the accuracy remains high for all four models except google/flan-f5-small, while the answer generation time increases steadily with the model complexity. It indicates that it might not be a good idea to use a too complex model.

What is the next number in the sequence: 1, 1, 2, 3, 5, 8, …? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail? (Logical Reasoning)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answer1yesyesyesyes
Accuracy Estimate01111
Generation Time (s)0.030.060.160.451.47
Table 10: Responses from five different models on logical reasoning question

Table 11 shows the performance for five different models for a QA task of cause and effect type. The accuracy score is high for google/flan-t5-xl and google/flan-t5-xxl models. Surprisingly, google/flan-t5-base produces an accurate answer. In general, the answer generation time increases steadily with the model complexity.

If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer? (Cause and Effect)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answerif you eat too much junk food, you will have a higher risk of lung cancerSmoking increases the risk of lung cancer.noIt increases the risk of developing lung cancer.Smoking increases the risk of lung cancer.
Accuracy Estimate01011
Generation Time (s)0.160.170.131.153.6
Table 11: Responses from five different models on cause and effect question

Table 12 shows the performance for five different models for a QA task of analogical reasoning type. The accuracy score is high for google/flan-t5-xl and google/flan-t5-xxl models. In general, the answer generation time increases steadily with the model complexity.

In the same way that pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to? (Analogical Reasoning)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answerwoodtreebrick is related to brickbuildingbuilding
Accuracy Estimate00011
Generation Time (s)0.030.240.260.461.33
Table 12: Responses from five different models on analogical reasoning question

Table 13 shows the performance for five different models for a QA task of inductive reasoning type. All five models produce accurate answers. In general, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might not make sense to use too complex models. 

Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies? (Inductive Reasoning)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answeryesyesyesyesyes
Accuracy Estimate11111
Generation Time (s)0.030.060.150.481.52
Table 13: Responses from five different models on inductive reasoning question

Table 14 shows the performance for five different models for a QA task of deductive reasoning type. All models except 80M parameter google/flan-t5-small  produce accurate answers. In general, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might not make sense to use too complex models. 

All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella? (Deductive Reasoning)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
AnswernoMary will take an umbrella.yesyesyes
Accuracy Estimate01111
Generation Time (s)0.030.130.140.451.4
Table 14: Responses from five different models on deductive reasoning question

Table 15 shows the performance for five different models for a QA task of counterfactual reasoning type. The accuracy assessment was somewhat subjective depending on the nature of the question. In general, the larger models perform better. Also, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might be safer to use complex models. 

If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb? (Counterfactual Reasoning)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
AnswernoThe light bulb would have been invented by Thomas Edison.no one would have invented the light bulbthe world would be darkthe world would be a darker place
Accuracy Estimate000.20.20.2
Generation Time (s)0.020.20.370.793.43
Table 15: Responses from five different models on counterfactual reasoning question

Table 16 shows the performance for five different models for a QA task of counterfactual reasoning type. The accuracy assessment was somewhat subjective depending on the nature of the question. In general, the larger models perform better. Also, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might be safer to use complex models. 

The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet.  Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida.  Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea.   Where and when will the storm make landfall? (In Context)
google/flan-t5-smallgoogle/flan-t5-basegoogle/flan-t5-largegoogle/flan-t5-xlgoogle/flan-t5-xxl
Answerbetween 80W and 84W2/1800 UTCabout 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of CubaFort Myers in Florida425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?
Accuracy Estimate0.10.10.20.20.6
Generation Time (s)0.080.161.851.2568.34
Table 16: Responses from five different models on in context question

Model Choice Recommendation

The effectiveness assessment of five different models has empirically led us to a model choice policy depending on the type of QA type. In general, the computation time increases across all model types. The question is whether that extra computation leads to superior accuracy scores? From Table 7-16, we can conclude that whether the accuracy improves with larger models is subject to QA type. 

QA TypeRecommendation
Knowledge RetrievalAccuracy improves with model complexity. Hence, use a larger model.
Logical ReasoningAccuracy does not improve with model complexity. Hence, use a mid-size model.
Cause and EffectAccuracy improves with model complexity. Hence, use a larger model.
Analogical ReasoningAccuracy improves with model complexity. Hence, use a larger model.
Inductive ReasoningAccuracy does not improve with model complexity. Hence, use a mid-size model.
Deductive ReasoningAccuracy does not improve with model complexity. Hence, use a mid-size model.
Counterfactual ReasoningAccuracy improves with model complexity. Hence, use a larger model.
In contextAccuracy improves with model complexity. Hence, use a larger model.

Conclusion

In this article, we demonstrate how we can use the Nutanix Cloud Platform to use LangChain and build a private LLM application. It discusses the implementation of LangChain development. For reproducibility, we are releasing the underlying code: Git Repo. We also cover LLM benchmarking in the context of the evaluation of our QA app. We evaluate the QA app with 10 different prompts covering seven different logical patterns for five architectures of flan-T5 models. We also come up with an empirical recommendation policy for model selection depending on the QA type. 

Moving Forward

This is the sixth blog on the topic of AI readiness for Nutanix. The previous blogs can be found here: Link.  In the future, we are planning to publish articles on foundation models, RLHF, and instruction tuning. 

Acknowledgement

The authors would like to acknowledge the contribution of Johnu George, Staff Engineer at Nutanix. 

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.