{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "3f3664ff-162c-4333-bbf1-e796c66f20a8", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Summarization\n", "\n", "We are going to sum up the documents that we have in the folder\n", "\"documents\". The pdfs from this lesson are located on the path\n", "\"/fp/projects01/ec443/documents\". In task number 2 in the chapter on \"Getting started\", you were asked to make your own documents folder in your home directory. In case you did not get the time, you will now get a second chance. Do the task below, if not done before:\n", "\n", "![fox_dokument.png](../fox_dokument.png)\n", "\n", "The easiest way doing this, is to use the browser view for Fox. The idea is that you are a researcher with a specific subject in mind. In this case, there was a search for \"terrorism\" and \"western europe\" in DOAJ.\n", "\n", "\n", "\n", "```{admonition} Task 5.1:\n", " Copy all of the content from this path:\n", "``/fp/projects01/ec443/documents``, and move it into your own documents folder named \"documents\" on your own home directory.\n", "```\n", "\n", "```{admonition} Task 5.2:\n", "JupyterLab uses a Python kernel to execute the code in each notebook. To free up GPU memory from the previous chapter, you should stop the kernel for that notebook. In the menu on the left side of JupyterLab, click the dark circle with a white square in it. Then click KERNELS and Shut Down All.\n", "```\n", "\n", "Cell 1:" ] }, { "cell_type": "code", "execution_count": null, "id": "b5a11836-a2db-4856-a56c-1cc38738bbc1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "document_folder = '/fp/projects01/ec443/documents/terrorism'" ] }, { "cell_type": "markdown", "id": "4e56f380-89ea-43bf-97b6-e2d1ed0f855b", "metadata": {}, "source": [ "Repeating the location of the models, just in case\n", "\n", "Cell 2:" ] }, { "cell_type": "code", "execution_count": null, "id": "2ee67962-e0ff-46b9-9e37-3acaa7ce7a0e", "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['HF_HOME'] = '/fp/projects01/ec443/huggingface/cache/'" ] }, { "cell_type": "markdown", "id": "2b20c979", "metadata": {}, "source": [ "We want to check if we have a GPU available.\n", "\n", "Cell 3:" ] }, { "cell_type": "code", "execution_count": null, "id": "f30bec81", "metadata": {}, "outputs": [], "source": [ "import torch\n", "device = 0 if torch.cuda.is_available() else -1" ] }, { "cell_type": "markdown", "id": "9dad5d94-29b7-4c7f-87dd-ee7a8a4d0b0d", "metadata": {}, "source": [ "Cell 4:" ] }, { "cell_type": "code", "execution_count": null, "id": "9d36006e-2d57-4d3f-ae47-b3023d8fcdf7", "metadata": {}, "outputs": [], "source": [ "from langchain_huggingface.llms import HuggingFacePipeline\n", "\n", "llm = HuggingFacePipeline.from_model_id(\n", " model_id='mistralai/Ministral-8B-Instruct-2410',\n", " task='text-generation',\n", " device=0,\n", " pipeline_kwargs={\n", " 'max_new_tokens': 1000,\n", " #'do_sample': True,\n", " #'temperature': 0.3,\n", " #'num_beams': 4,\n", " }\n", " )" ] }, { "cell_type": "markdown", "id": "175d3ebd", "metadata": {}, "source": [ "We can give some arguments to the pipeline:\n", "- `model_id`: the name of the model on HuggingFace\n", "- `task`: the task you want to use the model for\n", "- `device`: the GPU hardware device to use. If we don't specify a device, no GPU will be used.\n", "- `pipeline_kwargs`: additional parameters that are passed to the model.\n", " - `max_new_tokens`: maximum length of the generated text\n", " - `do_sample`: if `False`, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead. The default value seems to be `True`.\n", " - `temperature`: the temperature controls the statistical *distribution* of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.\n", " - `num_beams`: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.\n" ] }, { "cell_type": "markdown", "id": "1165e38a-f724-460b-8057-99b87794eae4", "metadata": {}, "source": [ "## Making a prompt\n", "\n", "Cell 5:" ] }, { "cell_type": "code", "execution_count": null, "id": "8274a2c0-7509-4824-8de5-4e1b00a16094", "metadata": {}, "outputs": [], "source": [ "from langchain_classic.chains.combine_documents import create_stuff_documents_chain\n", "from langchain_classic.chains.llm import LLMChain\n", "from langchain_classic.prompts import PromptTemplate" ] }, { "cell_type": "markdown", "id": "687575de-361a-41b2-9aab-4b9e2e69eb88", "metadata": {}, "source": [ "Cell 6:" ] }, { "cell_type": "code", "execution_count": null, "id": "880f239e-9e91-4be7-bf5d-6e2f65ea28b3", "metadata": {}, "outputs": [], "source": [ "separator = '\\nYour Summary:\\n'\n", "prompt_template = '''Write a summary of the following:\n", "\n", "{context}\n", "''' + separator\n", "prompt = PromptTemplate(template=prompt_template,\n", " input_variables=['context'])" ] }, { "cell_type": "markdown", "id": "1145cf41-a132-4599-b814-c92edac672e1", "metadata": {}, "source": [ "## Separating the Summary from the Input\n", "\n", "LangChain returns both the input prompt and the generated response in one long text.\n", "To get only the summary, we must split the summary from the document that we sent as input.\n", "We can use the LangChain *output parser*\n", "[RegexParser](https://api.python.langchain.com/en/latest/langchain/output_parsers/langchain.output_parsers.regex.RegexParser.html) for this.\n", "\n", "Cell 6:" ] }, { "cell_type": "code", "execution_count": null, "id": "caa4ff32", "metadata": {}, "outputs": [], "source": [ "from langchain_classic.output_parsers import RegexParser\n", "import re\n", "\n", "output_parser = RegexParser(\n", " regex=rf'{separator}(.*)',\n", " output_keys=['summary'],\n", " flags=re.DOTALL)" ] }, { "cell_type": "markdown", "id": "73f554fe", "metadata": {}, "source": [ "## Create chain\n", "\n", "The document loader loads each PDF page as a separate 'document'. This is partly for technical reasons because that is the way PDFs are structured. Therefore, we use the chain called `create_stuff_documents_chain` which joins multiple documents into a single large document.\n", "\n", "Cell 7:" ] }, { "cell_type": "code", "execution_count": null, "id": "15856ef3-cd9c-4988-acda-a89286843840", "metadata": {}, "outputs": [], "source": [ "# chain = create_stuff_documents_chain(\n", "# llm, prompt, output_parser=output_parser)\n", "\n", "chain = create_stuff_documents_chain(llm, prompt)" ] }, { "cell_type": "markdown", "id": "40b02bae-25d1-4a3f-ab8d-ff30591740eb", "metadata": {}, "source": [ "A function to split the summary from the input. LangChain returns both the input prompt and the generated response in one long text. To get only the summary, we must split the summary from the document that we sent as input.\n", "\n", "Cell 8:" ] }, { "cell_type": "code", "execution_count": null, "id": "ff718953-a6c4-4a27-9f22-eedb8bc149a7", "metadata": {}, "outputs": [], "source": [ "def split_result(result):\n", " \"Split the reply from the prompt, should be done with output parser?\"\n", " position = result.find(separator)\n", " summary = result[position + len(separator) :]\n", " return summary" ] }, { "cell_type": "markdown", "id": "722ee223-5d63-4d2f-bb80-6a8833640aa2", "metadata": {}, "source": [ "## Loading the Documents\n", "\n", "We use LangChain’s DirectoryLoader to load all in files in\n", "document_folder. document_folder is defined at the start of this\n", "Notebook.\n", "\n", "Cell 8:" ] }, { "cell_type": "code", "execution_count": null, "id": "0510094f-f67f-4e95-81af-8b3e0ed6e54b", "metadata": {}, "outputs": [], "source": [ "from langchain_community.document_loaders import DirectoryLoader\n", "\n", "loader = DirectoryLoader(document_folder)\n", "documents = loader.load()\n", "print('number of documents:', len(documents))" ] }, { "cell_type": "markdown", "id": "836f2c14-64ee-481e-92dc-d28d5e4b537c", "metadata": {}, "source": [ "## Creating the Summaries\n", "\n", "Now, we can iterate over these documents with a for-loop.\n", "\n", "Cell 9:" ] }, { "cell_type": "code", "execution_count": null, "id": "8b507946-2c15-4a00-ab31-376a8ee908aa", "metadata": {}, "outputs": [], "source": [ "summaries = {}\n", "\n", "for document in documents:\n", " filename = document.metadata['source']\n", " print(filename)\n", " summary = chain.invoke({\"context\": [document]})\n", " summary = split_result(summary)\n", " summaries[filename] = summary\n", " print('Summary of file', filename)\n", " print(summary)" ] }, { "cell_type": "markdown", "id": "0bda5a33-132e-4b46-b73d-edff672c0703", "metadata": {}, "source": [ "## Saving the Summaries to Text Files\n", "\n", "Finally, we save the summaries for later use. In the example below, we\n", "save all the summaries in the file summaries.txt.\n", "\n", "Cell 10:" ] }, { "cell_type": "code", "execution_count": null, "id": "0eb1f5b9-1b43-4742-84d2-d1a0b2b2f592", "metadata": {}, "outputs": [], "source": [ "with open('summaries_2.txt', 'w') as outfile:\n", " for filename in summaries:\n", " print('Summary of ', filename, file = outfile)\n", " print(summaries[filename], file=outfile)\n", " print(file=outfile)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5c5713dd-681a-4a4b-8b86-59b89bc7cfbb", "metadata": {}, "source": [ "## Make an overall summary\n", "\n", "See here under [bonus\n", "material](https://uio-library.github.io/LLM-course/3_summarizing.html)\n", "\n", "```{admonition} Task 5.3:\n", "The processes of the Chapters Chatbot and Summarization, may be done on the largest and the second largest GPU at Fox (40GB memory). As we advance to the next chapter with RAG, we depend on the largest GPU with its 80GB memory. Make sure you have your job running on the mentioned GPU resource. Also go to the menu in Jupyter lab, and choose\n", "as shown in the illustration below: Kernel --\\> Shut down all kernels. Now, you are going to open a new workbook, save it with a name you choose, and run the RAG process in that new document, without any other content in the cells.\n", "\n", "![shut_kernel.png](../shut_kernel.png)\n", "```\n", "\n", "```{admonition} Task 5.4:\n", "How can you see if a single kernel is running and how do you\n", "shut them down one by one?\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 5 }