Introduction to Word Embeddings

Intro

Word embeddings represent the meaning of words in multidimensional space. These embeddings are typically learned by performing some surrogate task, such as predicting neighboring words (see models below). These embeddings can then be extracted and used for other tasks. See Figure 1 below for the general architecture.

Figure 1: Architecture to learn word embeddings. After training the word embedding weights are extracted and used for other tasks.

Skip-gram Model

Overview

The idea is that words with similar meanings will have similar words around them. These "words around them" are called the context words. Use a target (aka center) word to predict context words. The context is a fixed size window of size 2m, meaning the context is the m words on either side of the target word. So given the target word, \(w_t\), we want to predict \(w_{t-m}\) to \(w_{t+m}\) (excluding \(w_{t}\) of course). That is, we want to predict the 2m context words.

Training

The input is thus the index of the target word in the vocab list, and the labels are the indices of the context words in the vocab list. The training for this is computed by first getting the probabilities of all words in the vocab list given the input word:

\[ P(w_j | w_t, \theta) = \frac{\exp(w_j^T \cdot w_t)}{\sum_{i=1}^{V} \exp(w_i^T \cdot w_t)} \tag{1}\label{1}\]

Then we compute the average over the batch of the log of the probabilities of the context words:

\[ J(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \sum_{j \in \text{context}(w_t)} \log P(w_j | w_t, \theta) \tag{2}\label{2}\]

Here T is the number of target words (which could be considered the batch size), \(\theta\) is the model parameters, \(w_t\) is the target word, and \(w_j\) is one of the 2m context words.

Negative Sampling

There is one caveat to the above explanation: Since the vocab list is usually large (typically hundreds of thousands to millions), computing the full denominator of the softmax (eq \(\ref{1}\)) is prohibitive. To address this, a technique called negative sampling is often employed. Negative sampling involves sampling a small number of negative examples (words that are not in the context) and updating the model based on these sampled negatives. This approximation allows for a more scalable and efficient computation of the loss during training.

More concretely, for negative sample we treat the problem as binary classification, where given the target word and possible context word, the value should be 1 if the context word is really a context word and 0 if it's not actually context word. This means the weight matrix for the output layer changes from d x V to d x 1. The probability of a context word being a positive context word given the target word is the sigmoid of their dot products:

\[ P(D=1 | w_t, w_c) = \sigma(\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_c}) = \frac{1}{1 + e^{-\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_c}}} \tag{3}\label{3}\]

And the probability of a negative sample not being a positive context word is the sigmoid of their negative products:

\[ P(D=0 | w_t, w_{\text{neg}}) = \sigma(-\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_{\text{neg}}}) = \frac{1}{1 + e^{\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_{\text{neg}}}}} \tag{4}\label{4}\]

Thus, the loss for a training example with 1 positive context word and \(k\) negative samples is:

\[ \text{Loss} = -\log \sigma(\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_c}) - \sum_{w_{\text{neg}} \in \mathcal{N}} \log \sigma(-\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_{\text{neg}}}) \tag{5}\label{5}\]

Where \( \mathcal{N} \) represents the set of \(k\) negative samples drawn from the noise distribution. Note that typically \(k\) ranges from about 5 to 20. The loss over a batch is thus:

\[ \text{BatchLoss} = \frac{1}{N} \sum_{i=1}^{N} \left( -\log \sigma(\mathbf{v}_{w_{t,i}} \cdot \mathbf{u}_{w_{c,i}}) - \sum_{w_{\text{neg}} \in \mathcal{N}} \log \sigma(-\mathbf{v}_{w_{t,i}} \cdot \mathbf{u}_{w_{\text{neg},i}}) \right) \tag{6}\label{6}\]

Continuous Bag of Words Model (CBOW)

Overview

CBOW is nearly identical to the skip-gram model, except that instead of predicting the context (surrounding) words from the target (center) word, we do the opposite: Predict the target word from the context words.

Training

Training is more straightforward for CBOW:

Sum the embeddings of all the context words together
Multiply this by a d x V matrix, where d is the embedding dimension and V is the Vocab size
Get the softmax of the resulting size V vector
Use cross entropy loss with the target word

Since the softmax is likely a large computation due to large vocab size, just like with skip-gram we can use negative sampling. That is, instead of outputting a value for each word in the vocab list, we output a single value for a pair of words. If the true target word is given in this pair then the sigmoid of the output should be close to 1, while if a fake (i.e. negatively sampled) target word is present then the sigmoid of the output should be close to 0.