Run Llama 2 with 32k Context Length!

3 min read 1 year ago
Published on Apr 24, 2024 This response is partially generated with the help of AI. It may contain inaccuracies.

Table of Contents

Step-by-Step Tutorial: Running Llama 2 with 32k Context Length

  1. Access the Notebook on Github:

    • Visit github.com and navigate to the repositories.
    • Click on the repository named "codelama 32k" to access the notebook.
  2. Open the Notebook in Google Colab:

    • Click on the notebook file (ipynb) to open it in Google Colab.
    • Ensure that you run the notebook as a free notebook by clicking on Runtime > Change runtime type and selecting GPU or T4 to get 8k tokens.
  3. Run the Notebook for 16k Context:

    • Once the notebook is running, you'll see a summary of The Berkshire Hathaway 2023 meeting.
    • The model has been set to 16,000 tokens, but you can adjust it in the notebook to reach the full 32k context length.
  4. Deploy the Notebook:

    • Click on Continue and then Deploy to connect to the Jupiter lab tab.
    • In the Jupiter lab, you can access the workspace and use the Pro version for more features like adjusting context length and uploading files.
  5. Run All Cells in the Notebook:

    • Run all cells in the notebook to check the installation progress and load the model onto the GPU.
    • The model shards will be loaded onto the GPU, and the CPU usage will decrease as the GPU usage increases.
  6. Use the 13 Billion Parameters Model:

    • Switch to using the 13 billion parameters model for better quality and accuracy.
    • Run all cells again to load the model on the GPU and start generating summaries.
  7. Monitor Memory Usage and Summary Generation:

    • Keep an eye on the memory usage as the model generates summaries.
    • The GPU memory will increase as it samples for the next token and generates a summary of The Berkshire region transcript.
  8. Optimizing for Long Context Length:

    • To achieve a long context length, use a large GPU and optimize memory usage.
    • Utilize techniques like Flash attention, quantization, and smarter software to reduce memory requirements and improve quality.
  9. Consider Quality and Memory Usage:

    • Quality of the model is crucial for accurate summaries, so choose the appropriate model size and parameters.
    • Ensure that you fine-tune the model and use techniques like rope scaling to extrapolate to longer context lengths.
  10. Final Notes and Tips:

    • Experiment with different models and context lengths to find the optimal balance between quality and memory usage.
    • Include a system prompt to guide the model towards providing accurate summaries.
    • Check the video description for additional resources and tips on running Llama 2 with a 32k context length.

By following these steps, you can effectively run Llama 2 with a 32k context length and generate accurate summaries using the provided notebook in Google Colab.