Virtually all mobile phones support autocompletion nowadays. This might look like a casual feature but when you start thinking about how to implement this, something I’ve been wanting to know for a very long time, it can be quite complex. If you know a little bit about deep learning, it’s not that hard to implement autocomplete yourself using the
Before we start, make sure to copy the notebook on Google Colab to your account so you can follow along. Don’t forget to set the accelator to GPU.
Introduction to Natural Language Processing
Natural language processing, or NLP for short, is a subfield or machine learning that focuses on text processing and understanding. As I described in an earlier post, natural language processing increases the rate at which humans and computers can communicate.
They key step in natural language processing is tokenization. This is a preprocessing step that turns words and phases into tokens which can be easier understood by a computer. A few things tokenization does are
- Removing HTML code;
- Dealing with punctuation;
- Splitting words as “aren’t” into “are n’t” to make to make the presence of negation more obvious;
- Substituting names with a special token.
Other preprocessing steps are:
- Stemming: turning words with the same meaning into a shared abbreviated form. “Works, worked” -> “work” for example. In Python, this can be done using
- Lemmatization: assigning a token to a group of words which together have a meaning. “Natural language processing” and “machine learning” are examples of groups a lemmatization algorithm would assign a single token to.
Unfortunately, there is no way to use those in
fastai yet. I think it would be great to make a contribution to the library adding this feature after I get more familiar with the framework.
Getting the data
Because this model will predict the next word in general texts, not just reviews, I used the [wiki text dataset] to train my model on in contrast to the IMDB dataset used in the lecture.
As this dataset is part of the fastai framework, downloading and loading is easy:
path = untar_data(URLs.WIKITEXT_TINY) path.ls() data_lm = (TextList.from_csv(path, 'train.csv') .split_by_rand_pct(0.1) # select 10% for validation .label_for_lm() # prepare the data to be used for NLP .databunch(bs=bs)) data_lm.save('data_lm.pkl')
Because the tokenization process might take a while, we usually save the data so we only have to perform this step once. You could load the data from disk like so:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)
What the dataset looks like:
Note the “xxmaij” tokens. These are tokens that were not recognized by the tokenizer and therefore updated to this placeholder. The
language_model_learnerwill interpret these as names.
After the preprocessing step, we create a
language_model_learner class. Similar to the vision models in previous posts, this model also uses transfer learning.
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)
Choosing the learning rate is almost always done using the
lr_find() method of the model. In this case, the graph looks like this and it seems that would be an appropriate learning rate.
After training for 6 epochs, you should expect to see a validation accuracy of about . This might seem very low compared to the image classifier, but it’s quite good considering the model is able to correctly predict the next word out of all 60 000 words present in the token set a third of the time.
Trying it out!
This is the most fun part of this project, having the model predict things you are going to say. Similar to the api of
fastai.vision we can just call
.predict to get the models prediction. Because
learn is a
language_model_learner, you can pass
temperature to this call. Citing from the documentation: “Lowering temperature will make the texts less randomized.” This gives a vague sense of what it does, but in order to really understand it, you should try out some values.
text = "<Enter some text here>" learn.predict(text, n_words = 1, temperature = 0.75)
You should definitely try out some sentences. You’ll find the model is quite good most of the time but sometimes it will also fail hilariously or in an ungraspable manner.
n_words equal to 1 because that’s what the auto prediction feature in the keyboard does. Higher values can be used though; the model is able to write entire sentences.
Where to go from here?
If you want to learn more about natural language processing, here are some links you might find interesting:
- My post about natural language processing in Swift;
- Stanford’s course on natural language processing;
- The Natural Language Processing Toolkit: a Python framework for NLP.
A huge thanks to Sam Miserendino for proofreading this post!