note – deep contextualized word representations

Peters et al. - 2018 - Deep contextualized word representations

Introduction

They introduce a new type of deep contextualised word representation complex characteristics of word use. Their word vectors are learned functions of the internal states of a deep bidirectional language model (biLM).

Bidirectional language models

Given $(t_1, t_2, ldots, t_N)$, a forward language model computes the probability:

A backward LM is similar to a forward LM:

The formulation jointly maximises the log likelihood:

The parameters both the token representation $Theta_{x}$ and softmax layer $Theta_{s}$ are shared in the forward and backward direction.