Blog - Einleitung

Ancud Blog

Herzlich Willkommen zum Ancud Blog. Hier finden Sie eine Vielzahl von interessanten Artikeln zu verschiedenen Themen. Tauchen Sie ein in unsere Welt des Wissens! 

Blogs

Blogs

𝐋𝐞𝐚𝐫𝐧 𝐜𝐨𝐧𝐭𝐢𝐧𝐮𝐨𝐮𝐬𝐥𝐲, 𝐬𝐮𝐜𝐜𝐞𝐞𝐝 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐭𝐥𝐲

Who here hasn’t come across ChatGPT and large language models? I believe many of you are already using it in some way and are familiar with the topic. Great! Generative AI is becoming increasingly important, and it’s crucial for us to pay attention to it these days. As I always strive to stay up-to-date with the latest trends, I also ensure that technical skills are a priority. My goal is to explain things that can be useful for you in this evolving landscape. 

Each article always features a step-by-step tutorial, allowing you to easily build the project and apply the knowledge shared. Additionally, the tutorial includes a well-explained architecture, ensuring that you understand the various services involved. Certainly! Let’s begin our journey! 

Introduction : 

As we’re aware, data is omnipresent and its types are rapidly diversifying due to various applications in use. The absence of a standardized data format poses a challenge. How can we effectively keep up with this diversity and derive valuable insights? Particularly noteworthy is the fact that unstructured data (like emails, slides, images, social media, and meeting records) constitutes a substantial 80%, outweighing the structured data which accounts for only 20%. 

So, what’s our course of action? What’s the optimal solution for this situation? 

Yes, congratulations! You guessed it correctly. Large language models are the key to enabling humanity to process all sorts of data without concerns about its structure. Generative AI and large language models play a crucial role. And here’s the exciting part — we’re going to build our own localGPT using 100% private data. 

What is a Transformer ? 

A transformer is a type of neural network architecture widely used in natural language processing, known for its ability to efficiently process and understand sequential data using self-attention mechanisms. 

Imagine you have a sentence: “The cat sat on the mat.” Now, instead of reading it from left to right like we do, a transformer looks at each word and decides how much attention to give to every other word. It figures out relationships between words in the sentence and uses that understanding to perform tasks like translation or summarization. It’s like having a really smart friend who can analyze and connect information from all parts of a story at once, rather than reading it linearly. 

If you’re curious about how transformers really work, check out popular ones like BERT, GPT, and T5. See how they’re used in everyday situations to get a better idea. 

However, if you want to grasp the concepts I’m discussing in my article, let’s move on to the next section. I’ll do my best to explain things clearly so you can get a good sense of what’s happening. 

What is Langchain ? 

When discussing LangChain, it’s important for you to grasp that it is a Python framework designed to empower applications with the capabilities of large language models. 

Certainly, here’s a more detailed explanation: 

It’s imperative to comprehend that LangChain consists of four fundamental components that collectively contribute to its functionality: 

1. Indexes: This component encompasses document loaders, text splitters, and vector stores. These elements work harmoniously to enhance data access by making it not only faster but also more efficient. Document loaders facilitate the retrieval of documents, text splitters help in breaking down text into manageable units, and vector stores optimize the storage and retrieval of vector representations, collectively streamlining the overall data access process. 

2. Prompts: The prompt-related functionalities involve three key aspects — prompt management, prompt optimization, and prompt serialization. Prompt management ensures the effective handling and organization of prompts, while prompt optimization focuses on refining prompts for improved performance. Prompt serialization involves the conversion of prompts into a format suitable for storage or transmission, ensuring seamless integration within the LangChain framework. 

3. Models: This component acts as an interface to various model types. By providing a versatile interface, LangChain accommodates different types of models, allowing users to leverage the strengths of diverse models for specific applications. This flexibility enhances the adaptability of LangChain across a range of use cases and scenarios. 

4. Chains: Going beyond a single Large Language Model (LLM) call, the concept of chains within LangChain is pivotal. Chains enable the creation of sequences of calls, allowing for the orchestration of multiple LLM operations. This capability is particularly useful when dealing with complex tasks or scenarios that necessitate a series of interconnected language model interactions. 

5. Agents: Agents serve as entities within the LangChain framework that utilize Large Language Models (LLMs) to make decisions. These decisions revolve around selecting appropriate actions to take. After executing an action, agents observe the outcomes and iterate through the decision-making process until their assigned task is successfully completed. This iterative approach, driven by LLMs, empowers agents to navigate and accomplish tasks efficiently within the LangChain ecosystem. 

Let’s understand better and see how we can make use of what is in this section by splitting it to the major steps that have to be implemented in every LLM project . 

1 . Load data : 

LangChain offers a feature to bring in your data. It includes various functions called data loaders, which bring your data in as documents. A document is like a package with both the actual data and some information about it. You can use these loaders for text files (.txt), PDFs, CSV files, and many other types that it supports. 

2. Split data into chunks : 

If your data is large, we need to break it into smaller pieces or chunks. These chunks will then be treated as individual data points and stored in a vector store database. 

3. Embeddings : 

Our task now is to enable our algorithms to understand and compare the text components effectively. This requires finding a method to convert human language into a digital format, using bits and bytes to represent the information. 

4. Build Semantic Index : 

Building a Semantic Index is like creating a clever summary of important ideas from a text. First, pick out key words and understand how they relate. Then, make a smart list considering the context. This list becomes a quick, intelligent index that tells you what the text is about — a bit like having a clever friend summarize a long story for you! 

5. Create a vector store : 

A vector database is a place to store information in a way that computers can quickly understand. Instead of traditional rows and columns, it uses vectors (think of them like arrows with directions). Each vector represents data points and their relationships, making it efficient for tasks like searching and comparing 

6. Define our Prompt Template : 

Now that we have short bits of information, let’s tell the LLM how to answer questions. We can use the LLM for various tasks, like summarizing text, answering questions, or even writing emails. For our case, we want the LLM to act like a friendly helper that answers questions. So, we give it a simple instruction to use only the given info. This makes sure the answers are reliable and can be trusted, which is important for making good decisions in a company. 

7. Choose the model : 

Langchain has different models, including OpenAI’s GPT. If we choose GPT, we start by getting an API Key. There’s free usage with OpenAI, but once we use a lot of tokens, it becomes a paid service. Answering simple questions with GPT is cheap, but if we ask complex questions, especially about personal stuff, it can get costly because it uses more tokens. But hey, you can set a cost limit to keep things in check. 

Tutorial : 

Let’s start doing things for our project. The idea for the project comes from this repository , please show it some love by giving it a star and thanking the person who did a great job. To make our project more interesting and add some data handling features, we’ll use Apache Airflow. If you want to know more, check out my article. 


Let me give you a quick overview of the workflow. Apache Airflow will be in charge of both extracting and orchestrating our data. Langchain will assist in generating local embeddings, saving them in a vector database (Chroma DB). To interact with the model, we’ll develop a Streamlit app for the interface. 

What’s crucial here is that I’ll also guide you on running your project either locally or using a Docker container and how to attach a GPU to it. 

  1. Understand project architecture : 


In this part, I’ll break down the essential documents you need to understand to run the project. This will give you the freedom to adjust it according to what you need. 

Here are the essential files you should know about: 

1. SOURCES_DOCUMENTS: This folder is where you store all your data. 

2. ingest.py: This script loads your data from SOURCES_DOCUMENTS, breaks them into chunks, and creates embeddings. 

3. run_localGPT.py: This script uses the local language model (LLM) to answer questions. The context for generated questions comes from the local vector store, using a similarity search. Run this script in your terminal. 

4. localGPT_UI.py: Similar to the previous file, but with this script, you get a Streamlit UI. It provides a user interface to interact with the model. 

5. constants.py: In this file, you specify the embedding model and LLM you want to use. 

2 . Steps to run the project : 

a. Clone the project 

b. Create a virtual environment : 

conda create -n GPT python=3.10.0 

c. Activate the environment : 

conda activate GPT2  

d. Install all the requirements : 

pip install -r requirements.txt 

e. Run the ingestion script : 

python ingest.py  

At this point, the project will download models from Hugging Face and initiate the data processing steps we discussed. A new folder called “DB” will be created to store all your embeddings. 


f. Interact with the model : 

streamlit run localGPT_UI.py  

And here we are : 

Now, let me guide you on dockerizing the entire project and launching it as a Docker container. If you’re not familiar with Docker, you can refer to my articles, they’ll provide valuable assistance.  


Let’s take a closer look at the Dockerfile together : 

FROM nvidia/cuda:11.7.1-runtime-ubuntu22.04 

This line specifies the base image for your Docker container. It uses an official NVIDIA CUDA base image with version 11.7.1, runtime, and based on Ubuntu 22.04. This base image provides an environment with NVIDIA CUDA support, which is commonly used for GPU-related workloads. 

RUN apt-get update && apt-get install -y python3.10docker build -t localgpt . 

This block installs Python version 3.10 inside the container. It updates the package list (apt-get update) and then installs Python 3.10 (apt-get install -y python3.10). This step is necessary to run Python-based applications in your container. 

WORKDIR /app 

This line sets the working directory inside the container to /app. The WORKDIR instruction is used to set the working directory for any subsequent COPY, RUN, CMD, ENTRYPOINT, or ADD instructions that follow in the Dockerfile. 

RUN apt-get update && \ 
    apt-get install -y python3-pip gcc-11 && \ 
    rm -rf /var/lib/apt/lists/* 

This block installs additional system dependencies. It updates the package list, installs python3-pip (Python package installer) and gcc-11 (GNU Compiler Collection), and then cleans up the package cache to reduce the image size. 

COPY requirements.txt /app/ 
RUN pip3 install --no-cache-dir -r requirements.txt 

This block copies a requirements.txt file from the host machine to the /app/ directory inside the container. It then installs the Python dependencies listed in the requirements.txt file using pip3. This is a common practice to manage and install dependencies in Python projects. 

COPY . /app/ 

This line copies the entire content of the current directory from the host machine into the /app/ directory inside the container. This is used to bring your application code and files into the container. 

EXPOSE 8501 

This line informs Docker that the container will listen on port 8501. However, it doesn’t actually publish the port to the host machine. Port exposure is more of a documentation feature in Docker, and to publish the port, you would typically use the -p option when running the container. 

# Run app.py when the container launches 
CMD ["streamlit", "run", "localGPT_UI.py"] 

This line sets the default command that will be executed when the container starts. It runs a Streamlit application named localGPT_UI.py using the command streamlit run. Streamlit is a Python library for creating web applications for data science and machine learning. 

Let’s build the image : 

docker build -t localgpt .   

after the image is built , we can now start the container : 

docker run -it --mount src="$HOME/.cache",target=/root/.cache,type=bind --gpus all -d -p 8501:8501 localgpt  

Conclusion : 

Wrapping it up, I hope you enjoyed the article. As I continue to share updates on IT topics beneficial to data engineering, I value your engagement and welcome your suggestions for new project ideas. Thanks again for reading! 

Authorname Chiheb Mhamdi