Practical Approach to Word Embeddings using Word2Vec and Genism

By Harsh Panwar

Department of Computer Science and Engineering
Jaypee University of Information Technology

Dec 15,2019 - 5 min read

Email  /  CV  /  ResearchGate  /  Google Scholar  /  Github  /  Twitter

Introduction

Word Embeddings are used to make a relation between a bag of words which are related to each other in some sense. For instance if there is a dataset of 100 candidates containing skills of each candidate in the form of a list, using word embeddings we can make a relation between each word so that Java and C++ are more related than Java and Business Development. Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. So with the help of a practical example I will try to explain the concepts of Word Embeddings. I have used Word2Vec as the algorithm implemented through Gensim. The full Github repo is available here .

Task

We are given a dataset which is in the form:

[[skill1,skill2,skill3,...],[skill1,skill4,skill6,...],....]

Each list inside the main list is of a unique candidate. So the number of candidates is equal to the length of the outer list. Which in our dataset is around 3 million. We want to learn an embedding representation of each skill and plot it using TensorBoard to analyze. The idea to solve is of word embeddings where words of similar meaning tend to lie close to each other for example red and black lie closer to each other in the embedding space than semantically diTerent words like black and apple. Since our dataset contains skills instead of general words skills like Python and Java should lie closer to each other than say Python and Sales Management.

Data Pre-Processing

Now we have to clean our data and preprocess it since our data is very big and unstructured so it cannot be understood by our model in this format. Firstly, since our data is in .txt format we need to open it using the with open command in python as follows:

   with open(“all_linkedin_skill_data”,mode=’r’) as i, open(‘out.txt’,’w’) as
     x = i.read(1)

Then to organize the data and make it more understandable we can use one of the following approaches to clean the data:
Thanks to Uwe Ziegenhagen for answering my question on stackover8ow and giving me this better approach to process the data.

  while True:
     x = i.read(1)
  if x == ‘’: # end of file has been reached break
     elif x==’ ‘:
     pass
     elif x==’]’:
     pass
     elif x==’[‘:
     if lastreadchar == ‘[‘:
     # at the beginning of the file, don’t do anything pass
     elif lastreadchar == ‘\n’: # a new line pass
     elif lastreadchar == ‘,’: # a new line pass
     elif x==’,’:
     if lastreadchar == ‘]’: # at the beginning of the file
     o.write(‘\n’)
     else:
    o.write(x)
     else:
     o.write(x)
     lastreadchar = x

Or you can also use a simpler approach which I wrote myself

   with open('file.txt') as f:
    mylist = list(f)
    temp = mylist[0]
     l = temp.split(']')
     l.pop(0)
for x in range(0,len(l)):
         l[x] = l[x][3:]
         l[x] = l[x].split(', ')

If you have any doubt regarding the preprocessing of this data please refer to this Stackoverflow thread.

Word2Vec

Word2Vec is of two types CBOW (Continuous Bag of Words) or continuous skip gram.It can be understood as a two layered NN (Neural Net) that takes a corpus of text as input and return a vector. It can not be considered a Deep Neural Network but it is able to turn texts into numeric form understandable by deep nets.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.

Now we are going to implement CBOW model Xrst and then later on we will implement the Skip Gram model.

   if not os.path.exists(‘model_out’):
     model1 = gensim.models.Word2Vec(l, min_count = 1, size = 100, window = 5)
    model1.save(‘model_out’)
  skill1 = input(“Enter first skill:”).lower()
   skill2 = input(“Enter second skill:”).lower()
   model_new = Word2Vec.load('model_out')
   print(model_new.similarity(skill1,skill2))

An open source library known as Gensim which is built for implementing unsupervised machine learning algorithms can be used here to implement our Word2Vec models. It can be simply installed by using pip.

pip install Genism

after importing the Word2Vec model from te Gensim .models we can directly use the Word2Vec() method for training our model. We can change our parameters by Hit and Trial to obtain better results.

from genism.models import Word2Vec

To obtain the similarity of diTerent skills we can use the model.similarity(skill1,skill2) method which will give out the result in percentage.

Visualisation using TensorBoard

TensorBoard provides the visualisation and tooling needed for machine learning experimentation.

The following code can be used for visualising the data with the help of TensorBoard projector which creates a mesh of words using PCA and each word is connected to other by calculating their Cosine similarities.

   model2 = gensim.models.keyedvectors.KeyedVectors.load(‘model_out’)
   max_size = len(model2.wv.vocab)-1
  w2v = np.zeros((max_size,model2.layer1_size))
  if not os.path.exists(‘projections’):
     os.makedirs(‘projections’)
  with open(“projections/metadata.tsv”,”w+”) as file_metadata:
     for i, word in enumerate(model2.wv.index2word[:max_size]):
            w2v[i] = model2.wv[word]
          file_metadata.write(word + ‘\n’)     
sess = tf.InteractiveSession() with tf.device(“/cpu:0”):
     embedding = tf.Variable(w2v, trainable=False, name=’embedding’)     
tf.global_variables_initializer().run()
  saver = tf.train.Saver()
  writer = tf.summary.FileWriter(‘projections’,sess.graph)     
config = projector.ProjectorConfig()
  embed = config.embeddings.add()
  embed.tensor_name = ‘embedding’
  embed.metadata_path = ‘metadata.tsv’     
projector.visualize_embeddings(writer, config)     
saver.save(sess, ‘projections/model.ckpt’, global_step=max_size)      

This is a little bit complex and hard to understand but have a look at my Github repo here for better understanding and raise a issue if you are struck anywhere. After we have successfully created our code for the implementation of TensorBoard the output of it can be generated by the following line of code.

tensorboard --logdir=projections --port=8000

This will create a localhost server accessable at http://localhost:8000/.

Conclusion

Word Embeddings can be helpful in creating many NLP related models such as chatbot, text translator, documents similarities, plagiarism detection. Through establishing a connection between the words our Machine learning model can better understand natural language and it's intents.

The full code repo is available on my Github account here .