MBart Language Translation (English to Tamil/Hindi) using Hugging Face | Python NLP

2 min read 3 days ago
Published on Mar 28, 2025 This response is partially generated with the help of AI. It may contain inaccuracies.

Introduction

In this tutorial, you'll learn how to use the MBart model from Facebook AI to translate text from English to Tamil and Hindi using the Hugging Face Transformers library. This guide provides a straightforward approach to leverage advanced natural language processing (NLP) capabilities in Python.

Step 1: Set Up Your Environment

To start, you need to set up your Python environment and install the necessary libraries.

  • Install the Hugging Face Transformers library and PyTorch:

    pip install transformers torch
    
  • You can also run this tutorial in Google Colab for a more streamlined experience. Access the notebook here.

Step 2: Import Necessary Libraries

Once your environment is ready, import the required libraries in your Python script or notebook.

from transformers import MBartForConditionalGeneration, MBartTokenizer

Step 3: Load the MBart Model and Tokenizer

You will need to load the pre-trained MBart model and its corresponding tokenizer from the Hugging Face Model Hub.

model_name = "facebook/mbart-large-50-one-to-many-mmt"
tokenizer = MBartTokenizer.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

Step 4: Prepare the Input Text for Translation

Before you can perform the translation, you need to prepare your input text by tokenizing it. Specify the target language by setting the appropriate language code.

  • For Hindi, use target_lang='hi_HIN'
  • For Tamil, use target_lang='ta_TAM'

Here’s how to prepare your text:

input_text = "This is a test sentence."
tokenized_input = tokenizer(input_text, return_tensors="pt", padding=True)

Step 5: Perform the Translation

Now that you have your tokenized input, you can call the model to generate the translation.

translated_tokens = model.generate(**tokenized_input, forced_bos_token_id=tokenizer.lang_code_to_id['hi_HIN'])  # For Hindi
# OR
translated_tokens = model.generate(**tokenized_input, forced_bos_token_id=tokenizer.lang_code_to_id['ta_TAM'])  # For Tamil

Step 6: Decode and Display the Translation

After generating the translated tokens, decode them back to a human-readable format.

translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)
print(translated_text)

Practical Tips

  • Ensure you have a compatible version of Python and libraries to avoid compatibility issues.
  • Use a GPU if available, as translation models can be computationally intensive.
  • Experiment with different input sentences to see how the model performs across various contexts.

Conclusion

In this tutorial, you learned how to set up the MBart model for translating text from English to Tamil and Hindi using Python. With just a few lines of code, you can leverage advanced machine translation capabilities. Explore further by trying out different sentences and integrating this functionality into your applications. Happy coding!