[Translation] Everything you know about word2vec is not true

[Translation] Everything you know about word2vec is not true

A classic explanation of word2vec as a Skip-gram architecture with a negative sample in the original scientific article and countless blog posts looks like this:

  while (1) {
  1. vf = vector of focus word
  2. vc = vector of focus word
  3. train such that (vc. Vf = 1)
  4. for (0 & lt; = i & lt; = negative samples):
  vneg = vector of word * not * in context
  train such that (vf. vneg = 0)

Indeed, if you google [word2vec skipgram], what we see:

But all these implementations are wrong .

The original implementation of word2vec on C works differently and is radically different from this one. Those who professionally implement systems with word2vec word attachments do one of the following:

  1. Directly call the source implementation C.
  2. Use a gensim implementation that is transliterated from C source to the extent that the variable names match.

Indeed, gensim is the only true C implementation known to me .

C implementation

The implementation in C actually supports two vectors for each word . One vector for the word in focus, and the second for the word in context. (Seems familiar? True, the GloVe developers borrowed the idea from word2vec without mentioning this fact!)

The implementation in C is exceptionally literate:

  • The syn0 array contains the vector embedding of a word, if it appears as a focused word. Here is random initialization .

      for (a = 0; a & lt; vocab_size; a ++) for (b = 0; b & lt; layer1_size; b ++) {
      next_random = next_random * (unsigned long long) 25214903917 + 11;
      syn0 [a * layer1_size + b] =
      (((next_random & amp; 0xFFFF)/(real) 65536) - 0.5)/layer1_size;
  • Another syn1neg array, contains the word vector when it appears as a context word. Here initialization is zero .
  • During training (Skip-gram, negative sample, although other cases are about the same), we first select the focus word. It persists throughout the course of training in positive and negative examples. The gradients of the focus vector accumulate in the buffer and are applied to the focal word after training both in positive and negative examples.

      if (negative & gt; 0) for (d = 0; d & lt; negative + 1; d ++) {
    //if we are performing negative sampling, in the 1st iteration,
    //pick up the product target
      if (d == 0) {
      target = word;
      label = 1;
      } else {
    //for all other iterations, pick a word randomly and set the dot
    //product target to 0
      next_random = next_random * (unsigned long long) 25214903917 + 11;
      target = table [(next_random & gt; & gt; 16)% table_size];
      if (target == 0) target = next_random% (vocab_size - 1) + 1;
      if (target == word) continue;
      label = 0;
      l2 = target * layer1_size;
      f = 0;
    //find dot product vector
    //store in f
      for (c = 0; c & lt; layer1_size; c ++) f + = syn0 [c + l1] * syn1neg [c + l2];
    //set g = sigmoid (f) (roughly, the actual formula is slightly more complex)
      if (f & gt; MAX_EXP) g = (label - 1) * alpha;
      else if (f & lt; -MAX_EXP) g = (label - 0) * alpha;
      else g = (label - expTable [(int) ((f + MAX_EXP) * (EXP_TABLE_SIZE/MAX_EXP/2))]) * alpha;
    //1. update the vector syn1neg,
    //2. DO NOT UPDATE syn0
    //3. STORE THE syn0 gradient in a temporary buffer neu1e
      for (c = 0; c & lt; layer1_size; c ++) neu1e [c] + = g * syn1neg [c + l2];
      for (c = 0; c & lt; layer1_size; c ++) syn1neg [c + l2] + = g * syn0 [c + l1];
     }//Finally, after all samples, update syn1 from neu1e
     https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L541//Learn weights input - & gt;  hidden
     for (c = 0; c & lt; layer1_size; c ++) syn0 [c + l1] + = neu1e [c];  

Why random and zero initialization?

Once again, since this is not explained at all in the original articles and nowhere on the Internet , I can only guess.

The hypothesis is that when negative samples come from the entire text and are not weighted by frequency, you can choose any word , and most often a word, which vector is not trained at all . If this vector has a value, then it randomly shifts the really important word into focus.

The point is to set all negative examples to zero, so that the representation of another vector is affected by only vectors that occur more or less frequently .

Actually, this is pretty clever, and I have never thought about how important initialization strategies are.

Why am I writing this

I spent two months of my life trying to reproduce word2vec as described in the original scientific publication and countless articles on the Internet, but it did not work out. I could not achieve the same results as word2vec, although I tried my best.

I could not imagine that the authors of the publication literally fabricated an algorithm that does not work, while the implementation does something completely different.

In the end, I decided to study the source code. For three days I was confident that I understood the code incorrectly, because literally everyone on the Internet was talking about a different implementation.

I have no idea why the original publication and articles on the Internet say nothing about the real mechanism of the word2vec operation, so I decided to publish this information myself.

It also explains the radical choice of GloVe to set separate vectors for a negative context - they just did what word2vec does, but told people about it :).

Is it a scientific hoax? I don't know, hard question. But honestly, I'm incredibly angry. Probably, I will never be able to take seriously the explanation of algorithms in machine learning: the next time I immediately go see the source.

Source text: [Translation] Everything you know about word2vec is not true