Understanding TF-IDF and Word2Vec (with Simple Examples)

 

Understanding TF-IDF and Word2Vec (with Simple Examples)

Understanding TF-IDF and Word2Vec


When you work with text data in Natural Language Processing (NLP), computers cannot “understand” text directly — they need numbers.
To make machines understand text, we convert words into numerical form.
That’s where TF-IDF and Word2Vec come in — two popular techniques for representing words as numbers.

Let’s break them down in simple terms 👇


🔹 1. What is TF-IDF?

TF-IDF stands for Term Frequency – Inverse Document Frequency.
It’s a numerical value that shows how important a word is in a document compared to all other documents.

🧩 Step-by-Step Concept:

Let’s imagine you have 3 small documents:

DocumentText
D1“I love Python programming”
D2“Python is great for data science”
D3“Data science and machine learning are related”

Now let’s see what TF and IDF mean 👇


📍 1. Term Frequency (TF)

TF measures how often a word appears in a document.

👉 Formula:

TF=Number of times word appears in documentTotal number of words in documentTF = \frac{\text{Number of times word appears in document}}{\text{Total number of words in document}}

Example:
In document D1 → “I love Python programming”

  • Total words = 4

  • Word “Python” appears 1 time
    TF(Python, D1) = 1/4 = 0.25


📍 2. Inverse Document Frequency (IDF)

Some words like “I”, “is”, “and” appear in almost every document.
These are common words — they don’t tell us much about the topic.

IDF helps reduce the importance of such common words.

👉 Formula:

IDF=log(Total number of documentsNumber of documents containing the word)IDF = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)

Example:
Word “Python” appears in 2 out of 3 documents (D1, D2).

IDF(Python)=log(3/2)=0.176IDF(Python) = \log(3 / 2) = 0.176


📍 3. TF-IDF = TF × IDF

It combines both values —

  • High TF means word appears frequently in a document

  • High IDF means word is rare across all documents

So, TF-IDF highlights unique but frequent words.

👉 Example Result:

WordTFIDFTF-IDF
Python (D1)0.250.1760.044
love (D1)0.250.4770.119

Here, “love” has higher TF-IDF → it’s more unique in D1 compared to “Python”.


🧠 Intuition:

  • Common words like “the”, “is”, “and” → Low TF-IDF

  • Rare, topic-specific words like “Python”, “machine learning” → High TF-IDF

That’s how TF-IDF helps identify keywords in a document.


🔹 2. What is Word2Vec?

TF-IDF tells how important a word is,
but it doesn’t understand the meaning or context of the word.

That’s where Word2Vec helps.

🧩 Concept:

Word2Vec converts words into vectors (arrays of numbers) in such a way that words with similar meanings have similar vector representations.

👉 Example:

  • "King" → [0.7, 0.3, 0.9]

  • "Queen" → [0.6, 0.4, 0.9]

  • "Apple" → [0.1, 0.8, 0.2]

Here, “King” and “Queen” are closer in vector space, meaning they are related.


🧠 How does Word2Vec learn meanings?

It uses two main models:

🔸 a. CBOW (Continuous Bag of Words)

Predicts a word from its surrounding context.

Example:
Sentence: “The cat sat on the ___.”
→ The model predicts “mat”.

🔸 b. Skip-Gram

Opposite of CBOW — it predicts surrounding words from the current word.

Example:
Word: “cat”
→ Predicts surrounding words like “the”, “sat”, “on”.


🧩 Simple Example

Let’s say we train Word2Vec on many sentences about animals and people:

  • “King is a man”

  • “Queen is a woman”

  • “Man is strong”

  • “Woman is kind”

Now Word2Vec learns relationships like:

King – Man + Woman = Queen 😮

This means the model has captured the semantic meaning of the words!


🧩 TF-IDF vs Word2Vec — Key Differences

FeatureTF-IDFWord2Vec
TypeStatisticalNeural Network-based
Captures Meaning❌ No✅ Yes
RepresentationSparse (large vectors)Dense (small vectors)
Example UseKeyword extractionSemantic analysis, chatbots
OutputImportance scoreWord embedding (vector)


🧠 Real-World Applications

ApplicationUses TF-IDFUses Word2Vec
Search Engines✅ Rank documents by keywords✅ Understand context
Chatbots✅ Identify intent and meaning
Spam Detection✅ Identify spam keywords✅ Analyze context
Recommendation Systems✅ Keyword similarity✅ Semantic similarity


💡 Example Code in Python

from sklearn.feature_extraction.text import TfidfVectorizer from gensim.models import Word2Vec # Sample data docs = [ "I love Python programming", "Python is great for data science", "Data science and machine learning are related" ] # --- TF-IDF Example --- tfidf = TfidfVectorizer() X = tfidf.fit_transform(docs) print("TF-IDF Matrix:") print(X.toarray()) print("Feature Names:", tfidf.get_feature_names_out()) # --- Word2Vec Example --- sentences = [doc.lower().split() for doc in docs] model = Word2Vec(sentences, vector_size=5, window=3, min_count=1) print("\nWord2Vec vector for 'python':") print(model.wv['python'])


🔚 Conclusion

  • TF-IDF focuses on how important a word is in a document.

  • Word2Vec focuses on what the word actually means based on its context.
    Together, they form the foundation of most modern NLP tasks.

Post a Comment

Ask any query by comments

Previous Post Next Post