En el suavizado de Kneser-Ney, ¿cómo se manejan las palabras invisibles?

15

Por lo que he visto, la fórmula de suavizado de Kneser-Ney (segundo orden) se da de una forma u otra como

$\begin{align} P^2_{KN}(w_n|w_{n-1}) &= \frac{\max \left\{ C\left(w_{n-1}, w_n\right) - D, 0\right\}}{\sum_{w'} C\left(w_{n-1}, w'\right)} + \lambda(w_{n-1}) \times P_{cont}(w_n) \end{align}$

con el factor de normalización dado como $\lambda(w_{n-1})$

$\begin{align} \lambda(w_{n-1}) &= \frac{D}{\sum_{w'} C\left(w_{n-1}, w'\right)} \times N_{1+}\left(w_{n-1}\bullet\right) \end{align}$

y la probabilidad de continuación de una palabra $P_{cont}(w_n)$ $w_n$

$\begin{align} P_{cont}(w_n) &= \frac{N_{1+}\left(\bullet w_{n}\right)}{\sum_{w'} N_{1+}\left(\bullet w'\right)} \end{align}$

donde es el número de contextos se vio o, más simple, el número de palabras distintas que preceden a la palabra dada . Por lo que he entendido, la fórmula se puede aplicar de forma recursiva. $N_{1+}\left(\bullet w\right)$ $w$ $\bullet$ $w$

Ahora esto maneja bien las palabras conocidas en contextos desconocidos para diferentes longitudes de n gramos, pero lo que no explica es qué hacer cuando hay palabras fuera del diccionario. Intenté seguir este ejemplo que establece que en el paso de recursión para unigramas, . El documento luego usa esto, citando a Chen y Goodman, para justificar la fórmula anterior como $P_{cont}(/) = P^0_{KN}(/) = \frac{1}{V}$ . $P^1_{KN}(w) = P_{cont}(w)$

Sin embargo, no veo cómo funciona en presencia de una palabra desconocida . En estos casos $w = \text{unknown}$ ya que, obviamente, la palabra desconocida no continúa nada con respecto al conjunto de entrenamiento. Del mismo modo, el recuento de n-gramos será. $P_{cont}(\text{unknown}) = \frac{0}{\text{something}}$ $C\left(w_{n-1}, \text{unknown}\right) = 0$

Además, todo el $\sum_{w'} C\left(w_{n-1}, w'\right)$ term might be zero if a sequence of unknown words - say, a trigram of OOD words - is encountered.

What am I missing?

machine-learning natural-language naive-bayes smoothing language-models sol
fuente

Estoy luchando con KN también. Creo que la probabilidad de un bigram invisible P (w1w2) podría retroceder a la probabilidad de continuación del último unigram w2. Cuando te quedas con un unigram invisible, no tienes nada. ¿Qué hacer a continuación? No lo sé.

momobo

Estoy tratando de implementar KN yo mismo en este momento y estoy atrapado con este mismo problema. ¿Alguno de ustedes dos logró encontrar una solución?

jbaiter

Volví al suavizado Good-Turing para unigramas invisibles (ajustando una función de potencia a las frecuencias y frecuencia de frecuencias) ... con resultados variables.

sunside

6

Dan Jurafsky ha publicado un chapter on N-Gram models which talks a bit about this problem:

Al final de la recursión, los unigramas se interpolan con la distribución uniforme:

$\begin{align} P_{KN}(w) = \frac{\max(c_{KN}(w)-d,0)}{\sum_{w'}c_{KN}(w')}+\lambda(\epsilon)\frac{1}{|V|} \end{align}$

If we want to include an unknown word <UNK>, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be:

$\begin{align} \frac{\lambda(\epsilon)}{|V|} \end{align}$

I've tried to find out what this means, but am not sure if $\epsilon$ just means $\lim_{x\rightarrow0}x$ . If this is the case, and you assume that as the count goes to zero, maybe $\lambda(\epsilon)$ goes to $d$ , according to:

$\begin{align} \lambda(w_{i-1}) = \frac{d}{c(w_{i-1})}\vert\{w:c(w_{i-1},w)>0\}\vert \end{align}$

then the unknown word just gets assigned a fraction of the discount, i.e.:

$\begin{align} \frac{\lambda(\epsilon)}{|V|} = \frac{d}{|V|} \end{align}$

I'm not confident about this answer at all, but wanted to get it out there in case it sparks some more thoughts.

Update: Digging around some more, it seems like $\epsilon$ is typically used to denote the empty string (""), but it's still not clear how this affects the calculation of $\lambda$ . $\frac{d}{|V|}$ is still my best guess

abroekhof
fuente

2

Good answer but like you I'm not 100% confident in it. I implemented a version of the perl script research.microsoft.com/en-us/um/redmond/groups/srg/papers/… in python - but realized it only works as-is if you have a closed vocabulary (0 prob issue) - i.e. all test unigrams are also in train. As suggested by Jan lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf I replaced each word's first instance with <UNK> during pre-processing. However, when partitioning, there are some test unigrams not in train like "goofedup". So I used d/|V| here. Thanks!

Josh Morel

1

There are many ways to train a model with <UNK> though Jurafsky suggests to choose those words that occur very few times in training and simply change them to <UNK>.

Then simply train the probabilities as you normally would.

See this video starting at 3:40 –

https://class.coursera.org/nlp/lecture/19

Another approach is to simply consider a word as <UNK> the very first time it is seen in training, though from my experience this approach assigns too much of the probability mass to <UNK>.

Randy
fuente

0

Just a few thoughts, I am far from being an expert on the matter so I do not intend to provide an answer to the question but to analyze it.

The simple thing to do would be to calculate $\lambda(\epsilon)$ by forcing the sum to be one. This is reasonable since the empty string is never seen in the training set (nothing can be predicted out of nothing) and the sum has to be one. If this is the case, $\lambda(\epsilon)$ can be estimated by:

λ (ϵ) = 1 - \frac{\sum_{w} m a x (C_{K N} (w) - d, 0)}{\sum_{w^{'}} C_{K N} (w)}

$\lambda(\epsilon)=1-\frac{\sum_w{max(C_{KN}(w) - d, 0)}}{\sum_{w'}{C_{KN}(w)}}$ Remember that here

C_{K N} (w)

$C_{KN}(w)$ is obtained from the bigram model.

Another option would be to estimate the <unk> probability with the methods mentioned by Randy and treating it as a regular token.

I think this step is made to ensure that the formulas are consistent. Notice that the term $\frac{\lambda(\epsilon)}{|V|}$ does not depend on the context and assigns fixed values to the probabilities of every token. If you want to predict the next word you can prescind this term, on the other hand if you want to compare the Kneser - Ney probability assigned to each token under two or more different contexts you might want to use it.

Daniel Villegas
fuente

Answers are suppose to be for actual answers.

Michael R. Chernick

En el suavizado de Kneser-Ney, ¿cómo se manejan las palabras invisibles?

Respuestas: