Por lo que he visto, la fórmula de suavizado de Kneser-Ney (segundo orden) se da de una forma u otra como
con el factor de normalización dado como
y la probabilidad de continuación de una palabra w n
donde es el número de contextos w en los que se vio o, más simple, el número de palabras distintas ∙ que preceden a la palabra dada w . Por lo que he entendido, la fórmula se puede aplicar de forma recursiva.
Ahora esto maneja bien las palabras conocidas en contextos desconocidos para diferentes longitudes de n gramos, pero lo que no explica es qué hacer cuando hay palabras fuera del diccionario. Intenté seguir este ejemplo que establece que en el paso de recursión para unigramas, . El documento luego usa esto, citando a Chen y Goodman, para justificar la fórmula anterior comoP 1 K N (w)=Pcont( .
Sin embargo, no veo cómo funciona en presencia de una palabra desconocida . En estos casos P c o n t ( desconocido ) = 0 ya que, obviamente, la palabra desconocida no continúa nada con respecto al conjunto de entrenamiento. Del mismo modo, el recuento de n-gramos seráC(wn-1,desconocido)=0.
Además, todo el term might be zero if a sequence of unknown words - say, a trigram of OOD words - is encountered.
What am I missing?
Respuestas:
Dan Jurafsky ha publicado un chapter on N-Gram models which talks a bit about this problem:
I've tried to find out what this means, but am not sure ifϵ just means limx→0x . If this is the case, and you assume that as the count goes to zero, maybe λ(ϵ) goes to d , according to:
then the unknown word just gets assigned a fraction of the discount, i.e.:
I'm not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.
Update: Digging around some more, it seems likeϵ is typically used to denote the empty string (""), but it's still not clear how this affects the calculation of λ . d|V| is still my best guess
fuente
There are many ways to train a model with
<UNK>
though Jurafsky suggests to choose those words that occur very few times in training and simply change them to<UNK>
.Then simply train the probabilities as you normally would.
See this video starting at 3:40 –
https://class.coursera.org/nlp/lecture/19
Another approach is to simply consider a word as
<UNK>
the very first time it is seen in training, though from my experience this approach assigns too much of the probability mass to<UNK>
.fuente
Just a few thoughts, I am far from being an expert on the matter so I do not intend to provide an answer to the question but to analyze it.
The simple thing to do would be to calculateλ(ϵ) by forcing the sum to be one. This is reasonable since the empty string is never seen in the training set (nothing can be predicted out of nothing) and the sum has to be one.
If this is the case, λ(ϵ) can be estimated by:
Another option would be to estimate the
<unk>
probability with the methods mentioned by Randy and treating it as a regular token.I think this step is made to ensure that the formulas are consistent. Notice that the termλ(ϵ)|V| does not depend on the context and assigns fixed values to the probabilities of every token. If you want to predict the next word you can prescind this term, on the other hand if you want to compare the Kneser - Ney probability assigned to each token under two or more different contexts you might want to use it.
fuente