MBart Language Translation (English to Tamil/Hindi) using Hugging Face | Python NLP
Table of Contents
Introduction
In this tutorial, you'll learn how to use the MBart model from Facebook AI to translate text from English to Tamil and Hindi using the Hugging Face Transformers library. This guide provides a straightforward approach to leverage advanced natural language processing (NLP) capabilities in Python.
Step 1: Set Up Your Environment
To start, you need to set up your Python environment and install the necessary libraries.
-
Install the Hugging Face Transformers library and PyTorch:
pip install transformers torch
-
You can also run this tutorial in Google Colab for a more streamlined experience. Access the notebook here.
Step 2: Import Necessary Libraries
Once your environment is ready, import the required libraries in your Python script or notebook.
from transformers import MBartForConditionalGeneration, MBartTokenizer
Step 3: Load the MBart Model and Tokenizer
You will need to load the pre-trained MBart model and its corresponding tokenizer from the Hugging Face Model Hub.
model_name = "facebook/mbart-large-50-one-to-many-mmt"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
Step 4: Prepare the Input Text for Translation
Before you can perform the translation, you need to prepare your input text by tokenizing it. Specify the target language by setting the appropriate language code.
- For Hindi, use
target_lang='hi_HIN'
- For Tamil, use
target_lang='ta_TAM'
Here’s how to prepare your text:
input_text = "This is a test sentence."
tokenized_input = tokenizer(input_text, return_tensors="pt", padding=True)
Step 5: Perform the Translation
Now that you have your tokenized input, you can call the model to generate the translation.
translated_tokens = model.generate(**tokenized_input, forced_bos_token_id=tokenizer.lang_code_to_id['hi_HIN']) # For Hindi
# OR
translated_tokens = model.generate(**tokenized_input, forced_bos_token_id=tokenizer.lang_code_to_id['ta_TAM']) # For Tamil
Step 6: Decode and Display the Translation
After generating the translated tokens, decode them back to a human-readable format.
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translated_text)
Practical Tips
- Ensure you have a compatible version of Python and libraries to avoid compatibility issues.
- Use a GPU if available, as translation models can be computationally intensive.
- Experiment with different input sentences to see how the model performs across various contexts.
Conclusion
In this tutorial, you learned how to set up the MBart model for translating text from English to Tamil and Hindi using Python. With just a few lines of code, you can leverage advanced machine translation capabilities. Explore further by trying out different sentences and integrating this functionality into your applications. Happy coding!