Querying LLMs (Chatbots)

We will use LangChain, an open-source library for making applications with LLMs. If you get error messages, you may google the error message, in order to see if other users have had the same problen, and found the solution to it. It may also wbe wise to look for the documentation of each of the libraries that you have, if the name shows up.

Transformers and Huggingface

We are using models from HuggingFace . Huggingface is an american company that develops tools for machine learning. They are the inventor of the Transformers library, that provides tools for downloading pretrained models. This documentation has a chapter on literature and references 29_references, where you may find urls and information on some of these subjects.

Task 4.1:

JupyterLab uses a Python kernel to execute the code in each notebook. To free up GPU memory from the previous chapter, you should stop the kernel for that notebook. In the menu on the left side of JupyterLab, click the dark circle with a white square in it. Then click KERNELS and Shut Down All. Remember to repeat this task whenever you get the error message “Cuda out of memory”.

The Language model

Model Location

We should tell the HuggingFace LLM where to store its data. If you’re running on Educloud/Fox project ec443 the model is stored on the path below.

Task 5.1: Navigate to the location mentioned in cell 4, and look at the models. Do you recognize any names? Are there any European AIs in the collection?

Cell 1:

import os
os.environ['HF_HOME'] = '/fp/projects01/ec443/huggingface/cache/'

Loading the model

We are importing a module from the library. The pipeline allows us to feed the raw text into a model. The model will then be asked to perform a specifoc task such as text generation or summarization.

Cell 2:

from langchain_huggingface.llms import HuggingFacePipeline

Cell 3:

model_id = 'meta-llama/Llama-3.2-1B'
# model_id = 'mistralai/Mistral-7B-Instruct-v0.3'
# model_id = 'meta-llama/Llama-3.2-1B-Instruct'
# meta-llama/Llama-3.2-3B-Instruct
# mistralai/Ministral-8B-Instruct-2410

Cell 4:

task = 'text-generation'

If your computer has a GPU, using that will be much faster than using the CPU. The torch library can help check if we have a GPU:

Cell 5:

import torch
torch.cuda.is_available()

The GPU is enabled by setting the argument device=0

Cell 6:

device = 0 if torch.cuda.is_available() else -1

Loading the model.

Cell 7:

llm = HuggingFacePipeline.from_model_id(
    model_id,
    task,
    device=device
)

Using the model

We are sending som input to the model, to see how it responds.

Cell 8:

result = llm.invoke("What is the world's largest lake")
print(result)

Here: Comment on output.

Model Arguments

We can set a nulber of arguments to the language model. In the context of this pipeline, the arguments are calles kwargs; keyword arguments. In the example below, only the argument max_new_tokensis effective.

Task 5.2: How can you see that the arguments do_sampleand temperatureare not effective?

Cell 9:

llm = HuggingFacePipeline.from_model_id(
    model_id,
    task,
    device=device,
    pipeline_kwargs={
        'max_new_tokens': 100,
        #'do_sample': True,
        #'temperature': 0.3,
        #'num_beams': 4,
    }
)

This is a summary of the arguments to the pipeline:

model_id: the name of the model on HuggingFace
task: the task you want to use the model for
device: the GPU hardware device to use. If we don’t specify a device, no GPU will be used.
- pipeline_kwargs: additional parameters that are passed to the model.
- max_new_tokens: maximum length of the generated text
- do_sample: if False, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead. The default value seems to be True.
- temperature: the temperature controls the statistical distribution of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.
- num_beams: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.

Recommended literature for those interested in reading more on parameters is Jurafsky and Martin’s textbook Speech and Language Processing.

Making a prompt

Cell 10:

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage

Cell 11:

messages = [
      SystemMessage("You are a pirate chatbot who always responds in pirate speak in complete sentences!"),
      MessagesPlaceholder(variable_name="messages")
    ]

Cell 12:

prompt = ChatPromptTemplate.from_messages(messages)

Cell 13:

chatbot = prompt | llm

Cell 14:

result = chatbot.invoke([HumanMessage("Who are you?")])
print(result)y

Cell 15:

result = chatbot.invoke([HumanMessage("Tell me about your ideal boat?")])
print(result)

Task 4.2:

The model meta-llama/Llama-3.2-1B is a small model and will yield low accuracy on many tasks. To get the benefit of the power of the GPU, we should use a larger model. Try to change the code in the pirate example to use the model mistralai/Mistral-7B-Instruct-v0.3 instead. How does this change the output?

Task 4.3:

Continue using the model mistralai/Mistral-7B-Instruct-v0.3.. Change the temperature parameter. The value needs a decimal in order to work, for example 0.9 or 10.0. For the temperature to have an effect, you must also set the parameter ‘do_sample’: True. How does this change the output?

Task 4.4:

Find the relevant input cells in your notebook, and replace some of the code with this:

SystemMessage(“You are a world class economist chatbot who always responds in understandable speak in complete sentences!”),
result = chatbot.invoke([HumanMessage(“Tell me about income equality and colonial history?”)]),