Build Private LLM Apps on Nutanix Cloud Platform

By Rajat Ghosh, Staff Data Scientist and Debojyoti Dutta, VP of Engineering, Nutanix
August 3, 2023
7:00 am
Terms Of Use

Summary

Large Language Models (LLMs) based on the transformer architecture (for example, GPT, T5, and BERT), have achieved state-of-the-art (SOTA) results in various Natural Language Processing (NLP) tasks, especially for text generation and conversational bots. The real benefits of LLMs can be tapped only when they can be translated into applications which generate end-user values. LLM application development involves LLM service, data connectors, raw data to prompt converter template, output parsing, and operationalizing auxiliary services such as vector databases, message queue, and others. As applications become more heterogeneous, an LLM application becomes prohibitively complex with multiple moving pieces. Another design challenge comes from data security with many LLM apps depending on third-party API services from organizations such as OpenAI, Hugging Face. In this scenario, zero-trust organizations such as banks, hospitals, and similar organizations demand security posture hardening by running the LLM models private and eliminating the need for the third-party API services. In this context, LangChain is a prevalent orchestration platform to glue together different application components.

In this article, we present how a private LLM application can be developed using LangChain on Nutanix Cloud Platform (NCP), which is optimized for easy manageability and resiliency with state-of-the-art AI infrastructure including high-performance GPU, fast NVMe, and storage scaling.

Benefits of Private LLM Applications

Some of the reasons why you may need to run your model locally, and not use an external API server, include: Local model hosting is preferable over sending POST request to an external API server for the following reasons:

Security: As you send data to external API servers, it compromises the enterprise attack services, increasing the risk of the man-in-the-middle (MITM) attack.
Cost: As you send data to an external API, there is a cost associated with data transfer.
Performance: The risk of downtime due to traffic congestion, configuration, and slow down due to long-tail latency increase significantly as you introduce an external API in your critical application component.

LangChain

The burgeoning LLM ecosystem comes with three specialized domains: first, foundation models and related algorithmic techniques such as instruction-tuning, PEFT, QLoRA, quantization, in-context learning, reinforcement learning on human feedback; second, infrastructure resource related elements such as compute accelerators, MLflow, deepspeed, and application development platforms such as LangChain, LlamaIndex. LangChain is an orchestration framework for developing applications powered by language models. It is claimed to be data-aware, meaning they allow an LLM model to interact with data sources, and agentic, meaning they allow an LLM model to interact with its environment. It comes with modules such as model I/O, data connection, chains, agents, memory and callbacks.

Model I/O

LangChain provides the constituents to interface with any language model. The key constituents are: prompts to templatize, dynamically select, and manage model inputs; model to invoke the LLM service; and output parser to format model output. Figure 1 shows how different Model I/O components interact.

Data Connection

Many LLM applications need user-specific data that is not part of the model’s training set. LangChain offers an ETL pipeline to load, transform, store and query your data with its Document loaders, Document transformers, Text embedding models, Vector stores, and Retrievers.

Chain

Chains enable users to combine multiple components together to compose a single, coherent application. For example, a chain takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM. More complex chains can combine multiple chains together. The foundational pieces for LangChain are LLM, router, sequential, and transformation.

Memory

Most LLM applications are centered around a conversational interface. It demands reference to information introduced earlier in a conversation. At its core, a conversational interface needs access to some window of previous messages. A more complex system needs a world model which allows it to do things like maintain information about entities and their relationships. The ability to reference past information is called memory. A memory system requires basic capabilities such as reading and writing.

Agent

Agent can be dubbed as an automated chain that uses an LLM to choose a sequence of actions. In chains, a sequence of actions is hard-coded. In agents, an LLM is used as a reasoning engine to determine which actions to take and in which order.

Callbacks

A callbacks system allows hooking into the various events of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.

Nutanix Cloud Platform

With Nutanix Cloud Platform, Nutanix delivers the simplicity and agility of a public cloud alongside the performance, security, and control of a private cloud. At Nutanix, we are dedicated to enabling customers with the ability to build and deploy intelligent applications anywhere—edge, core data centers, service provider infrastructure, and public clouds. We offer a zero-touch AI infrastructure which reduces the infrastructure configuration and maintenance burden on machine learning scientists and data engineers. Prism Element (PE) is a service built into the platform for every Nutanix cluster deployed. Prism Element enables a user to fully configure, manage, and monitor Nutanix clusters running any hypervisor. Therefore, the first step of the Nutanix infrastructure setup is to log into a Prism Element, as shown in Figure 3.

Log into Prism Element on a Cluster (UI shown in Figure 2)

Figure 2: The UI showing the setup for a Prism Element on which the transformer model for this article was trained. It shows the hypervisor summary (AHV), storage summary, VM summary, hardware summary, monitoring for cluster-wide controller IOPS, monitoring for cluster-wide controller I/O bandwidth, monitoring for cluster-wide controller latency, cluster CPU usage, cluster memory usage, granular health indicators, and data resiliency status.

Set up the VM

After logging into Prism Element, we create a VM hosted on our Nutanix AHV cluster. As shown in Figure 4, the VM has following resource configuration settings: 22.04 Ubuntu operating system, 16 single core vCPUs, 64 GB of RAM, and NVIDIA A100 Tensor Core passthrough GPU with 80GB memory. The GPU is installed with the NVIDIA RTX 15.0 driver for Ubuntu OS (NVIDIA-Linux-x86_64-525.60.13-grid.run). The large deep learning models with transformer architecture require GPU or other compute accelerators with high memory bandwidth, large registers and L1 memory.

Figure 3: The VM resource configuration UI pane on Nutanix Prism Element. As shown, it helps a user configure the number of vCPU(s), the number of cores per vCPUs, memory size (GiB), and GPU choice. We used an NVIDIA A100 80G for this work.

Underlying A100 GPU

NVIDIA A100 Tensor Core GPU is designed to power the world’s highest-performing elastic data centers for AI, data analytics, and HPC. Powered by the NVIDIA Ampere Architecture, A100 is the engine of the NVIDIA data center platform. A100 provides up to 20X higher performance over the prior generation and can be partitioned into seven GPU instances to dynamically adjust to shifting demands. To peek into the detailed features of A100 GPU, we run `nvidia-smi` command which is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices. The output of the `nvidia-smi` command is shown in Figure 5. It shows the Driver Version to be 515.86.01 and CUDA version to be 11.7.

Figure 4: The key features of the A100 40 GB GPU used

Table 1 shows several key features of the A100 GPU we used.

Feature	Value	Description
GPU	0	GPU Index
Name	NVIDIA A100	GPU Name
Temp	35 C	Core GPU Temperature
Perf	P0	GPU Performance
Persistence-M	On	Persistence Mode
Pwr: Usage/Cap	65 W / 300 W	GPU Power Usage and it capability
Bus Id	00000000:00:06.0	domain:bus:device.function
Disp. A	Off	Display Active
Memory-Usage	18884MiB / 40960MiB	Memory allocation out of total memory
Volatile Uncorr. ECC	0	Counter of uncorrectable ECC memory error
GPU-Util	0%	GPU Utilization
Compute M.	Default	Compute Mode
MIG M.	Disabled	Multi-Instance Mode

Table 1: Description of the key features of the underlying A100 GPU

Implementation of Private LLM

We ran an LLM service on an instance of the Nutanix Cloud Platform as described in the previous section. The LLM service features a Question-Answering API. It has been tested with ten prompts and queries, as shown in Table 2. The first three prompts are of type Knowledge Retrieval and the remaining seven questions are of type logical reasoning, cause and effect, analogical reasoning, inductive reasoning, deductive reasoning, counterfactual reasoning, and in context, respectively.

Index	Prompt	Type
1	What is the capital of the United States?	Knowledge Retrieval
2	What is the capital of India?	Knowledge Retrieval
3	What is the capital of China?	Knowledge Retrieval
4	What is the next number in the sequence: 1, 1, 2, 3, 5, 8, …? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail?	Logical Reasoning
5	If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer?	Cause and Effect
6	In the same way pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to?	Analogical Reasoning
7	Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies?	Inductive Reasoning
8	All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella?	Deductive Reasoning
9	If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb?	Counterfactual Reasoning
10	The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?	In context

Table 2: Ten different test prompts and their types.

Application Components

It uses three Python packages: transformers, langchain, and torch.

From transformers (developed by HuggingFace) package, it uses three modules: AutoTokenizer, AutoModelForSeq2SeqLM, Pipeline

Hf AutoTokenizer: This is a generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the AutoTokenizer.from_pretrained() class method. The relevant statement for this instantiation:

tokenizer = AutoTokenizer.from_pretrained(model_id)

Hf AutoModelForSeq2SeqLM: This is a generic model class that will be instantiated as one of the model classes of the library—with a sequence-to-sequence language modeling head—when created with the when created with the from_pretrained() class method or the from_config() class method. The relevant statement for this instantiation:

model = AutoModelForSeq2SeqM.from_pretrained(model_id)

Hf Pipeline: Pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. It comes in two flavors: first, pipeline() which is the most powerful object encapsulating all other pipelines. Second, domain-specific pipelines are available for audio, computer vision, natural language processing, and multimodal tasks. The relevant statement in the implementation:

pipe = pipeline(“text2textgeneration”, model=model, tokenizer=tokenizer, max_length=512)

From the LangChain package, the application uses the HuggingFacePipeline class from llms interface. HuggingFacePipeline facilitates local deployment of HuggingFace models. The implementation can be found in: Link

Pre-trained Models Used

The application was developed with five different pre-trained models based on T5 transformer architecture class and their parameter counts are shown in Table 3. The smallest model, google/flan-t5-small has 80M parameters, while the largest model google/flan-t5-xxl has 11B parameters. These models use different mixtures of datasets such as CoT, Muffin, T0-SF, NIV2.

Model Name	Parameter Size
google/flan-t5-small	80 M
google/flan-t5-base	250 M
google/flan-t5-large	780 M
google/flan-t5-xl	3 B
google/flan-t5-xxl	11 B

Table 3: Five different pre-trained models used in this article and their respective parameter count

Workflow Sequence

This application uses pre-train models directly without any finetuning. It uses a four step sequence for the answer generation from a question, as shown in Figure 5.

Table 4 shows tokenizer loading times for five different models. Apparently, all models have comparable tokenizer loading times because all models use a similar tokenizer class.

Figure 5: Workflow sequence for the LLM based QA application

Model Name	Tokenizer Loading Time (s)
google/flan-t5-small	0.26
google/flan-t5-base	0.14
google/flan-t5-large	0.14
google/flan-t5-xl	0.14
google/flan-t5-xxl	0.15

Table 4: Five different pre-trained models used in this article and their respective tokenizer loading times

Table 5 shows model loading times for five different models. Apparently, the models with more parameters take more time to load.

Model Name	Model Loading Time (s)	Parameter Count
google/flan-t5-small	1.3	80 M
google/flan-t5-base	3.5	250 M
google/flan-t5-large	9.8	780 M
google/flan-t5-xl	27.0	3 B
google/flan-t5-xxl	93.3	11 B

Table 5: Five different pre-trained models used in this article and their respective model loading times

Table 6 shows pipeline loading times for five different models. Apparently, all models have comparable tokenizer loading times because all models use a similar pipeline class.

Model Name	Pipeline Loading Time (s)
google/flan-t5-small	0.05
google/flan-t5-base	0.02
google/flan-t5-large	0.05
google/flan-t5-xl	0.19
google/flan-t5-xxl	0.62

Table 6: Five different pre-trained models used in this article and their respective pipeline loading time

Effectiveness of Different Models in QA Tasks

Tables 7-9 shows the performance for five different models for three QA tasks of knowledge retrieval type. In general, the accuracy score increases with more complex models such as google/flan-t5-xxl, while the answer generation time also increases with the model complexity.

What is the capital of the United States of America? (Knowledge Retrieval)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	San Francisco	San Francisco	washington	washington dc	washington dc
Accuracy Estimate	0	0	0.7	1.0	1.0
Generation Time (s)	0.06	0.12	0.16	0.72	2.85

Table 7: Responses from five different models on a knowledge retrieval question

What is the capital of India? (Knowledge Retrieval)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	sarajevo	chennai	calcutta	bhopal	new delhi
Accuracy Estimate	0	0	0	0	1.0
Generation Time (s)	0.05	0.1	0.26	0.62	2.49

Table 8: Responses from five different models on a knowledge retrieval question

What is the capital of China? (Knowledge Retrieval)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	shanghai	qingdao	beijing	beijing	beijing
Accuracy Estimate	0	0	1	1	1
Generation Time (s)	0.05	0.12	0.19	0.52	2.00

Table 9: Responses from five different models on a knowledge retrieval question

Table 10 shows the performance for five different models for a QA task of logical reasoning type. In general, the accuracy remains high for all four models except google/flan-f5-small, while the answer generation time increases steadily with the model complexity. It indicates that it might not be a good idea to use a too complex model.

What is the next number in the sequence: 1, 1, 2, 3, 5, 8, …? If all cats have tails, and Fluffy is a cat, does Fluffy have a tail? (Logical Reasoning)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	1	yes	yes	yes	yes
Accuracy Estimate	0	1	1	1	1
Generation Time (s)	0.03	0.06	0.16	0.45	1.47

Table 10: Responses from five different models on logical reasoning question

Table 11 shows the performance for five different models for a QA task of cause and effect type. The accuracy score is high for google/flan-t5-xl and google/flan-t5-xxl models. Surprisingly, google/flan-t5-base produces an accurate answer. In general, the answer generation time increases steadily with the model complexity.

If you eat too much junk food, what will happen to your health? How does smoking affect the risk of lung cancer? (Cause and Effect)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	if you eat too much junk food, you will have a higher risk of lung cancer	Smoking increases the risk of lung cancer.	no	It increases the risk of developing lung cancer.	Smoking increases the risk of lung cancer.
Accuracy Estimate	0	1	0	1	1
Generation Time (s)	0.16	0.17	0.13	1.15	3.6

Table 11: Responses from five different models on cause and effect question

Table 12 shows the performance for five different models for a QA task of analogical reasoning type. The accuracy score is high for google/flan-t5-xl and google/flan-t5-xxl models. In general, the answer generation time increases steadily with the model complexity.

In the same way that pen is related to paper, what is fork related to? If tree is related to forest, what is brick related to? (Analogical Reasoning)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	wood	tree	brick is related to brick	building	building
Accuracy Estimate	0	0	0	1	1
Generation Time (s)	0.03	0.24	0.26	0.46	1.33

Table 12: Responses from five different models on analogical reasoning question

Table 13 shows the performance for five different models for a QA task of inductive reasoning type. All five models produce accurate answers. In general, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might not make sense to use too complex models.

Every time John eats peanuts, he gets a rash. Does John have a peanut allergy? Every time Sarah studies for a test, she gets an A. Will Sarah get an A on the next test if she studies? (Inductive Reasoning)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	yes	yes	yes	yes	yes
Accuracy Estimate	1	1	1	1	1
Generation Time (s)	0.03	0.06	0.15	0.48	1.52

Table 13: Responses from five different models on inductive reasoning question

Table 14 shows the performance for five different models for a QA task of deductive reasoning type. All models except 80M parameter google/flan-t5-small produce accurate answers. In general, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might not make sense to use too complex models.

All dogs have fur. Max is a dog. Does Max have fur? If it is raining outside, and Mary does not like to get wet, will Mary take an umbrella? (Deductive Reasoning)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	no	Mary will take an umbrella.	yes	yes	yes
Accuracy Estimate	0	1	1	1	1
Generation Time (s)	0.03	0.13	0.14	0.45	1.4

Table 14: Responses from five different models on deductive reasoning question

Table 15 shows the performance for five different models for a QA task of counterfactual reasoning type. The accuracy assessment was somewhat subjective depending on the nature of the question. In general, the larger models perform better. Also, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might be safer to use complex models.

If I had studied harder, would I have passed the exam? What would have happened if Thomas Edison had not invented the light bulb? (Counterfactual Reasoning)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	no	The light bulb would have been invented by Thomas Edison.	no one would have invented the light bulb	the world would be dark	the world would be a darker place
Accuracy Estimate	0	0	0.2	0.2	0.2
Generation Time (s)	0.02	0.2	0.37	0.79	3.43

Table 15: Responses from five different models on counterfactual reasoning question

Table 16 shows the performance for five different models for a QA task of counterfactual reasoning type. The accuracy assessment was somewhat subjective depending on the nature of the question. In general, the larger models perform better. Also, the answer generation time increases steadily with the model complexity. From these observations, we can empirically conclude that it might be safer to use complex models.

The center of Tropical Storm Arlene, at 02/1800 UTC, is near 26.7N 86.2W. This position is about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall? (In Context)
	google/flan-t5-small	google/flan-t5-base	google/flan-t5-large	google/flan-t5-xl	google/flan-t5-xxl
Answer	between 80W and 84W	2/1800 UTC	about 425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba	Fort Myers in Florida	425 km/230 nm to the west of Fort Myers in Florida, and it is about 550 km/297 nm to the NNW of the western tip of Cuba. The tropical storm is moving southward, or 175 degrees, 4 knots. The estimated minimum central pressure is 1002 mb. The maximum sustained wind speeds are 35 knots with gusts to 45 knots. The sea heights that are close to the tropical storm are ranging from 6 feet to a maximum of 10 feet. Precipitation: scattered to numerous moderate is within 180 nm of the center in the NE quadrant. Isolated moderate is from 25N to 27N between 80W and 84W, including parts of south Florida. Broad surface low pressure extends from the area of the tropical storm, through the Yucatan Channel, into the NW part of the Caribbean Sea. Where and when will the storm make landfall?
Accuracy Estimate	0.1	0.1	0.2	0.2	0.6
Generation Time (s)	0.08	0.16	1.85	1.25	68.34

Table 16: Responses from five different models on in context question

Model Choice Recommendation

The effectiveness assessment of five different models has empirically led us to a model choice policy depending on the type of QA type. In general, the computation time increases across all model types. The question is whether that extra computation leads to superior accuracy scores? From Table 7-16, we can conclude that whether the accuracy improves with larger models is subject to QA type.

QA Type	Recommendation
Knowledge Retrieval	Accuracy improves with model complexity. Hence, use a larger model.
Logical Reasoning	Accuracy does not improve with model complexity. Hence, use a mid-size model.
Cause and Effect	Accuracy improves with model complexity. Hence, use a larger model.
Analogical Reasoning	Accuracy improves with model complexity. Hence, use a larger model.
Inductive Reasoning	Accuracy does not improve with model complexity. Hence, use a mid-size model.
Deductive Reasoning	Accuracy does not improve with model complexity. Hence, use a mid-size model.
Counterfactual Reasoning	Accuracy improves with model complexity. Hence, use a larger model.
In context	Accuracy improves with model complexity. Hence, use a larger model.

Conclusion

In this article, we demonstrate how we can use the Nutanix Cloud Platform to use LangChain and build a private LLM application. It discusses the implementation of LangChain development. For reproducibility, we are releasing the underlying code: Git Repo. We also cover LLM benchmarking in the context of the evaluation of our QA app. We evaluate the QA app with 10 different prompts covering seven different logical patterns for five architectures of flan-T5 models. We also come up with an empirical recommendation policy for model selection depending on the QA type.

Moving Forward

This is the sixth blog on the topic of AI readiness for Nutanix. The previous blogs can be found here: Link. In the future, we are planning to publish articles on foundation models, RLHF, and instruction tuning.

Acknowledgement

The authors would like to acknowledge the contribution of Johnu George, Staff Engineer at Nutanix.

ai/ml, aos, ncp

© 2024 Nutanix, Inc. All rights reserved. Nutanix, the Nutanix logo and all Nutanix product, feature and service names mentioned herein are registered trademarks or trademarks of Nutanix, Inc. in the United States and other countries. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective holder(s). This post may contain links to external websites that are not part of Nutanix.com. Nutanix does not control these sites and disclaims all responsibility for the content or accuracy of any external site. Our decision to link to an external site should not be considered an endorsement of any content on such a site. Certain information contained in this post may relate to or be based on studies, publications, surveys and other data obtained from third-party sources and our own internal estimates and research. While we believe these third-party studies, publications, surveys and other data are reliable as of the date of this post, they have not independently verified, and we make no representation as to the adequacy, fairness, accuracy, or completeness of any information obtained from third-party sources.

Nutanix.dev