hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language —

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

To the best of my knowledge on September 18, 2018

Update: nlp-for-hindi uses sentencepiece instead of the word based spacCy tokenizer which I use. On those tokens, the measured perplexity for that LM is ~35. I encourage you to check that work out as well.

Downloads

Pretrained Language Models that you can use in your classification for transfer learning
EXCLUSIVE: BBC Hindi data of 4335 documents for text classification and text summarization. Release Notes
Raw Data for Language Model shared above: Hindi Wikipedia with about 21k unique tokens for minfreq = 50
- Wikipedia Processed Data - please use this to train your model

TODO

Language modeling based on wikipedia dump
Release Language Models: Hindi Language Model
Create Text classification Datasets: BBC Hindi
Benchmark text classification with FastText

Idea Dump

Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
MTL tasks for training and inference using custom heads
Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib’s v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.