Seyana AI Cost-effective Generative AI apps
Introduction
Generative AI is transforming the world. However, when organizations and businesses embark on this journey, they start with the well-known brands providing turn-key access to advanced State-of-the-Art "SotA".
Within few months if not quarters, the people leading the finances of these spends would start to feel rather uncomfortable on looking at the cost of operations.
Seyana AI is happy to provide you insights into ways of reducing cost.
Roads not taken - to reduce cost
Common belief | "Road not taken" approach |
---|---|
Gen AI mandates use of Biggest GenAI models | Free and Open source models - Less is more |
Enterprise GenAI needs can only be met with GPUs | Use CPUs for inference |
Invoking APIs (online) is the most effective way to innovate | Use Batch-mode (offline) inference as much as possible |
There is no silver bullet to reducing GenAI Op-Ex costs | Methodical approach saves time, money and carbon footprint |
Hands on tutorial
Let's now take a look at how to approach the "Road Not Taken" strategy.
We will be learning through use of python on how to implement a carbon-effective and cost effective approach to using Generative AI.
Objective
In this tutorial we walk you though on how to use
Open Source LLM models
for inference- Just
CPU
for inference Batch mode
inference
Installing Ollama
Following command helps setup a tool called ollama to help run LLMs locally (even on CPU)
!curl https://ollama.ai/install.sh | sh
Upon running the command, one would see an output like this:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0>>> Downloading ollama...
100 10975 0 10975 0 0 34245 0 --:--:-- --:--:-- --:--:-- 34296
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Ollama initiate Serving
Following code initiates a sub process to serve ollama.
import subprocess
process = subprocess.Popen("ollama serve", shell=True)
Using Open source model
Of many open source models supported by Ollama, we are taking Phi-3 mini provided to community by Microsoft.
Using the following command we can download the phi3-mini from Ollama's website
!ollama pull phi3:mini
pulling manifest ⠙
...
...
pulling ed7ab7698fdd... 100% ▕▏ 483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
Launching LLM with Ollama
Following command runs the Phi-3 mini model behinds the scenes
# Phi3 min
!nohup ollama run phi3:mini &
You would now see something like this:
nohup: appending output to 'nohup.out'
At this point the system is running the gen-ai model behind the scenes.
Langchain with Ollama
Langchain is a fantastic open source module providing advanced GenAI frameworks and modules.
Langchain has first-class integrations with Ollama.
https://python.langchain.com/v0.2/docs/integrations/llms/ollama/
Using the following command we will be setting up langchain with ollama.
# install package
%pip install -U langchain-ollama
Collecting langchain-ollama
Downloading langchain_ollama-0.1.0rc0-py3-none-any.whl (12 kB)
Collecting langchain-core<0.3.0,>=0.2.20 (from langchain-ollama)
Downloading langchain_core-0.2.22-py3-none-any.whl (373 kB)
...
...
...
Installing collected packages: orjson, jsonpointer, h11, jsonpatch, httpcore, langsmith, httpx, ollama, langchain-core, langchain-ollama
Successfully installed h11-0.14.0 httpcore-1.0.5 httpx-0.27.0 jsonpatch-1.33 jsonpointer-3.0.0 langchain-core-0.2.22 langchain-ollama-0.1.0rc0 langsmith-0.1.93 ollama-0.3.0 orjson-3.10.6
Using Phi3 Mini via Ollama
Using the following code, we are going to ask a simple binary question (yes/no) to the model using a Prompt Template.
###########
# QUESTION
############
question='Is the sky blue? : Yes or No'
###################
# Hyper parameters
###################
temperature=0.0
num_predict=10
top_k=10
top_p=0.5
repeat_penalty=1.5
model_name='phi3:mini'
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
template = """Question: {question}
Answer: """
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model=model_name)
#####################
# Defining the chain
#####################
chain = prompt | model
##############################
# Invoking the Generative AI
##############################
chain.invoke({"question": question})
Following is an output from the generative AI model :
' Yes, under clear conditions where there are no particles in the air to scatter sunlight differently. However, during certain weather patterns like fog, smoke from fires, and volcanic ash after eruptions (as seen with Mount St. Helens), it can appear gray or even black due to these particulates scatnering light differently than usual.'
Batch mode - en-bulk!
Following is an array of series of binary questions.
We are using an array to encapsulate the questions
questions=['Is water wet? : Yes or No','Is the Earth round? : Yes or No','Is chocolate a vegetable? : Yes or No','Is fire cold? : Yes or No','Is the moon made of cheese? : Yes or No','Is time travel possible? : Yes or No','Is pineapple an acceptable pizza topping? : Yes or No','Is laughter contagious? : Yes or No','Is the sun a star? : Yes or No','Is a tomato a fruit? : Yes or No']
Using the following code , INSTEAD of using invoke
method, we are going to use batch
mode inference provided to us by Langchain modules
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
temperature=0.0
num_predict=10
top_k=10
top_p=0.5
repeat_penalty=1.5
model_name='phi3:mini'
model = OllamaLLM(model=model_name,temperature=temperature,nom_predict =num_predict,top_k=top_k,top_p=top_p,repeat_penalty=repeat_penalty)
template = """You are a helpful AI bot who gives SIMPLE YES or NO answers to questions .
Question: {question}
Answer: """
prompt = ChatPromptTemplate.from_template(template)
chain = prompt | model
##########################
# No recommended approach
##########################
# chain.invoke({"question": questions[2]})
######################
# RECOMMENDED approach
######################
# Batch mode invocation
chain.batch([{"question": questions[0]}, {"question": questions[1]}])
Output from the generative AI model is :
[' Answer:"Yes" because the common understanding of "wetness” involves liquid being in contact with another substance, and since we can touch both sides (surface) when holding a glass containing some amount. However scientifically it\'s more complex as water molecules are hydrogen bonded to each other but not necessarily wetting surfaces they come into direct physical interaction due their cohesive properties .',
' yes. The answer is "Yes." Accordingly, I would respond with just \'yes\' as per your instructions for simplicity and directness in communication style without any additional explanation required due its obvious nature that can be easily verified by observation from space images of Earth taken over time which consistently show a round shape regardless the angle or perspective.']
As you can see there are TWO outputs from a SINGLE batch invocation. This is the advantage of saving time with batch mode invocation.
Conclusion
If you were to solve the same problem space of answering binary questions using online invocation of commercial models using GPUs, you would have incurred following costs
- Vendor lock-in
- Operating Expenses
- Much larger carbon footprint
Seyana AI's hereby provides you the above guidance in this blog has helped you understand how to save the over all cost of operating Generative AI using the **road not taken
**strategies:
- Use Open source models
- Use CPU for inference
- Use Batch mode inference