Tokenization, Stemming, and Lemmatization in NLP (with Python Examples)
🌟 Introduction
When working with Natural Language Processing (NLP), one of the biggest challenges is helping a machine understand human language.
Words, punctuation, and grammar can vary endlessly — and before any AI model can learn patterns, you need to clean and prepare your text data.
That’s where tokenization, stemming, and lemmatization come in.
These three techniques are the foundation of text preprocessing in NLP.
In this tutorial, we’ll cover:
✅ What these terms mean
✅ Why they’re essential
✅ How to implement them in Python (with examples)
✅ Real-world use cases for each
By the end, you’ll know how to transform raw text into a structured form that machines can understand — all while keeping it intuitive and practical.
💬 What Is Tokenization in NLP?
Ever tried reading a paragraph without spaces?
“ILovetoLearnPythonEveryDay.”
It’s impossible to read, right?
Tokenization solves that by breaking text into smaller parts, called tokens — usually words or sentences.
🧩 Example:
Output:
✅ Explanation:
-
word_tokenize()splits the text into words. -
sent_tokenize()splits it into sentences.
This step helps models analyze text word by word or sentence by sentence.
🔍 Real-World Use Case:
In sentiment analysis, tokenization helps separate each word of a review:
“I love this phone, but the battery drains fast.”
Tokens: [‘I’, ‘love’, ‘this’, ‘phone’, ‘but’, ‘the’, ‘battery’, ‘drains’, ‘fast’]
Now, an NLP model can analyze which words express positive (love) and negative (drains fast) sentiment.
🌿 What Is Stemming in NLP?
Stemming means reducing words to their base or root form — often by chopping off prefixes or suffixes.
For example:
-
“Playing”, “Played”, “Plays” → “Play”
-
“Running”, “Runner”, “Runs” → “Run”
🧩 Example:
Output:
✅ Explanation:
The stemmer removes suffixes like -ing, -ly, or -ed to reduce words to their base.
But notice — sometimes the result isn’t a real word (like “easili”).
That’s why stemming is considered a rough cut, not grammatically perfect, but fast and useful in large-scale NLP tasks.
⚖️ When to Use Stemming
Use stemming when:
-
You’re working with a large dataset
-
Speed matters more than linguistic accuracy
-
Applications: search engines, spam filtering, or quick topic tagging
Example:
If users search for “connects”, “connecting”, “connection”, stemming helps return results for all related forms.
📘 What Is Lemmatization in NLP?
Lemmatization is similar to stemming — but smarter.
Instead of chopping words blindly, it uses linguistic rules and dictionaries to find the base (lemma) of each word.
For example:
-
“Running” → “Run”
-
“Better” → “Good”
-
“Was” → “Be”
🧩 Example:
Output:
✅ Explanation:
The lemmatizer uses the part of speech (POS) tag to identify the correct lemma.
If you don’t specify pos='v' (verb), it may not always return the expected base form.
🔍 Difference Between Stemming and Lemmatization
| Feature | Stemming | Lemmatization |
|---|---|---|
| Method | Cuts word endings | Uses dictionary & grammar rules |
| Output | May not be real words | Always real words |
| Accuracy | Lower | Higher |
| Speed | Faster | Slower |
| Example | “easily” → “easili” | “easily” → “easy” |
✅ Tip:
If accuracy matters (like in chatbots or summarization), use lemmatization.
If performance matters (like in large-scale indexing), use stemming.
🧠 Example: Tokenization + Stemming + Lemmatization Pipeline
Let’s combine all three steps using Python:
Output:
✅ Observation:
-
Stemming gives rough results like “happili”
-
Lemmatization gives meaningful words like “child” and “be”
This demonstrates why lemmatization is preferred when meaning matters.
🧩 A Practical Example: Sentiment Analysis Preprocessing
Let’s use an example review:
“The movie was amazing! I loved the storyline and the actors were outstanding.”
After preprocessing:
-
Tokenization:
['The', 'movie', 'was', 'amazing', '!', 'I', 'loved', 'the', 'storyline', 'and', 'the', 'actors', 'were', 'outstanding', '.'] -
Stemming:
['the', 'movi', 'wa', 'amaz', 'i', 'love', 'the', 'storylin', 'and', 'the', 'actor', 'were', 'outstand'] -
Lemmatization:
['the', 'movie', 'be', 'amazing', 'i', 'love', 'the', 'storyline', 'and', 'the', 'actor', 'be', 'outstanding']
✅ Cleaned text like this can now be used for:
-
Sentiment analysis (positive/negative)
-
Topic classification
-
Keyword extraction
🔭 Modern NLP Approach (2025 Update)
In 2025, NLP models like BERT, GPT, and spaCy Transformers handle text differently.
They use subword tokenization and contextual embeddings, meaning:
-
They can understand variations like “run”, “running”, and “ran” in the same context.
-
However, basic preprocessing (like tokenization, lemmatization) is still useful when working with smaller datasets or classical ML models.
Example (using spaCy):
Output:
✅ spaCy automatically performs tokenization and lemmatization in one go — fast and efficient.
🚀 The 3-Step NLP Preprocessing Framework (My Original)
I call this the “T-S-L Framework” — Tokenize → Stem → Lemmatize.
Step 1: Tokenize — break text into pieces
Step 2: Stem — simplify the variations
Step 3: Lemmatize — restore meaning
Use TSL whenever preparing text for machine learning.
It’s like cleaning your ingredients before cooking a great meal 🍳.
🧭 Conclusion
Tokenization, stemming, and lemmatization are core techniques in Natural Language Processing.
They make your text cleaner, smaller, and more meaningful, helping AI models understand patterns effectively.
Remember the T-S-L flow:
-
Tokenize → Split text into units
-
Stem → Simplify the forms
-
Lemmatize → Get real, meaningful roots
These techniques bridge the gap between human language and machine understanding — turning words into intelligence.
🧹 Clean text means smarter models!
📘 IT Tech Language
☁️ Cloud Computing - What is Cloud Computing – Simple Guide
- History and Evolution of Cloud Computing
- Cloud Computing Service Models (IaaS)
- What is IaaS and Why It’s Important
- Platform as a Service (PaaS) – Cloud Magic
- Software as a Service (SaaS) – Enjoy Software Effortlessly
- Function as a Service (FaaS) – Serverless Explained
- Cloud Deployment Models Explained
🧩 Algorithm - Why We Learn Algorithm – Importance
- The Importance of Algorithms
- Characteristics of a Good Algorithm
- Algorithm Design Techniques – Brute Force
- Dynamic Programming – History & Key Ideas
- Understanding Dynamic Programming
- Optimal Substructure Explained
- Overlapping Subproblems in DP
- Dynamic Programming Tools
🤖 Artificial Intelligence (AI) - Artificial intelligence and its type
- Policy, Ethics and AI Governance
- How ChatGPT Actually Works
- Introduction to NLP and Its Importance
- Text Cleaning and Preprocessing
- Tokenization, Stemming & Lemmatization
- Understanding TF-IDF and Word2Vec
- Sentiment Analysis with NLTK
📊 Data Analyst - Why is Data Analysis Important?
- 7 Steps in Data Analysis
- Why Is Data Analysis Important?
- How Companies Can Use Customer Data and Analytics to Improve Market Segmentation
- Does Data Analytics Require Programming?
- Tools and Software for Data Analysis
- What Is the Process of Collecting Import Data?
- Data Exploration
- Drawing Insights from Data Analysis
- Applications of Data Analysis
- Types of Data Analysis
- Data Collection Methods
- Data Cleaning & Preprocessing
- Data Visualization Techniques
- Overview of Data Science Tools
- Regression Analysis Explained
- The Role of a Data Analyst
- Time Series Analysis
- Descriptive Analysis
- Diagnostic Analysis
- Predictive Analysis
- Pescriptive Analysis
- Structured Data in Data Analysis
- Semi-Structured Data & Data Types
- Can Nextool Assist with Data Analysis and Reporting?
- What Kind of Questions Are Asked in a Data Analyst Interview?
- Why Do We Use Tools Like Power BI and Tableau for Data Analysis?
- The Power of Data Analysis in Decision Making: Real-World Insights and Strategic Impact for Businesses
📊 Data Science - The History and Evolution of Data Science
- The Importance of Data in Science
- Why Need Data Science?
- Scope of Data Science
- How to Present Yourself as a Data Scientist?
- Why Do We Use Tools Like Power BI and Tableau
- Data Exploration: A Simple Guide to Understanding Your Data
- What Is the Process of Collecting Import Data?
- Understanding Data Types
- Overview of Data Science Tools and Techniques
- Statistical Concepts in Data Science
- Descriptive Statistics in Data Science
- Data Visualization Techniques in Data Science
- Data Cleaning and Preprocessing in Data Science
🧠 Machine Learning (ML) - How Machine Learning Powers Everyday Life
- Introduction to TensorFlow
- Introduction to NLP
- Text Cleaning and Preprocessing
- Sentiment Analysis with NLTK
- Understanding TF-IDF and Word2Vec
- Tokenization and Lemmatization
🗄️ SQL
💠 C++ Programming - Introduction of C++
- Brief History of C++ || History of C++
- Characteristics of C++
- Features of C++ || Why we use C++ || Concept of C++
- Interesting Facts About C++ || Top 10 Interesting Facts About C++
- Difference Between OOP and POP || Difference Between C and C++
- C++ Program Structure
- Tokens in C++
- Keywords in C++
- Constants in C++
- Basic Data Types and Variables in C++
- Modifiers in C++
- Comments in C++
- Input Output Operator in C++ || How to take user input in C++
- Taking User Input in C++ || User input in C++
- First Program in C++ || How to write Hello World in C++ || Writing First Program in C++
- How to Add Two Numbers in C++
- What are Control Structures in C++ || Understanding Control Structures in C++
- What are Functions and Recursion in C++ || How to Define and Call Functions
- Function Parameters and Return Types in C++ || Function Parameters || Function Return Types
- Function Overloading in C++ || What is Function Overloading
- Concept of OOP || What is OOP || Object-Oriented Programming Language
- Class in C++ || What is Class || What is Object || How to use Class and Object
- Object in C++ || How to Define Object in C++
- Polymorphism in C++ || What is Polymorphism || Types of Polymorphism
- Compile Time Polymorphism in C++
- Operator Overloading in C++ || What is Operator Overloading
- Python vs C++ || Difference Between Python and C++ || C++ vs Python
🐍 Python - Why Python is Best for Data
- Dynamic Programming in Python
- Difference Between Python and C
- Mojo vs Python – Key Differences
- Sentiment Analysis in Python
🌐 Web Development
🚀 Tech to Know & Technology
- The History and Evolution of Data Science
- The Importance of Data in Science
- Why Need Data Science?
- Scope of Data Science
- How to Present Yourself as a Data Scientist?
- Why Do We Use Tools Like Power BI and Tableau
- Data Exploration: A Simple Guide to Understanding Your Data
- What Is the Process of Collecting Import Data?
- Understanding Data Types
- Overview of Data Science Tools and Techniques
- Statistical Concepts in Data Science
- Descriptive Statistics in Data Science
- Data Visualization Techniques in Data Science
- Data Cleaning and Preprocessing in Data Science

