Deploy Large Language Models Into Production At NO COST!

3 min read 1 year ago
Published on Aug 06, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Introduction

In this tutorial, you will learn how to deploy large language models into production at no cost using the vLLM framework. This guide will walk you through setting up an API server to serve your model using Google Colab and Postman, making it easy for you to experiment with large language models in a practical setting.

Step 1: Set Up Google Colab

  1. Open Google Colab.
  2. Select Runtime from the menu.
  3. Click on Change runtime type.
  4. Select GPU from the hardware accelerator dropdown (T4 or A10 is recommended).

Step 2: Install vLLM Framework

  1. In a code cell, run the following command to install vLLM:
    pip install vllm
    

Step 3: Import Required Libraries

  1. After installation, you need to import necessary modules from vLLM. Add the following code:
    from vllm import LLM, SamplingParameters
    

Step 4: Define Your Prompts

  1. Prepare the prompts you want your model to complete. For instance:

    • "Abidjan is located in"
    • "A data scientist is a person who"
    • "The future of agriculture in Africa is"
  2. Set the sampling parameters:

    • Temperature: 0.8
    • Top probability: 0.95
    • Maximum tokens: 50

Step 5: Load the Model

  1. Load the model you want to use. For example, you can use the Facebook OPT 1.125M model:
    model = LLM("facebook/opt-1.125m")
    

Step 6: Generate Text Outputs

  1. Run the model against your prompts to generate outputs:

    outputs = model.generate(prompts, sampling_parameters=sampling_parameters)
    
  2. Print the generated outputs to see the completions for each prompt.

Step 7: Set Up the API Server

  1. To set up an API server, use the following command:

    python -m vllm.api --model facebook/opt-1.125m --port 8000
    
  2. Make sure the server is running to handle requests.

Step 8: Create a Public Endpoint

  1. To make your API accessible online, use localtunnel:

    npx localtunnel --port 8000
    
  2. This will generate a public URL. Copy the URL for use in Postman.

Step 9: Test the API with Postman

  1. Open Postman and create a new request.

  2. Set the request type to POST.

  3. Enter the public URL followed by /completions (e.g., https://your_public_url/completions).

  4. In the body of the request, specify the parameters:

    • Model: facebook/opt-1.125m
    • Prompt: "Abidjan is located in"
    • Maximum tokens: 50
    • Temperature: 0.8
    • Top P: 0.95
  5. Send the request and observe the response for generated text.

Step 10: Experiment with Different Prompts

  1. You can modify the prompt in Postman to test various queries and see how the model performs with different inputs.
  2. Keep experimenting with the sampling parameters and different models available from Hugging Face.

Conclusion

You have successfully set up an API server to serve a large language model using the vLLM framework and Google Colab. This setup allows you to experiment with various prompts and models at no cost. For further exploration, consider trying out different models listed on Hugging Face or tweaking the sampling parameters for varied results. Happy experimenting!