Descriptive Statistics in Data Science

Descriptive Statistics in Data Science


In the fast-evolving world of data science, understanding data is crucial. Descriptive statistics is one of the fundamental techniques that help data scientists summarize and interpret data in a meaningful way. This blog aims to provide a comprehensive yet simple guide to descriptive statistics, its key concepts, and how it is applied in data science.

What is Descriptive Statistics?

Descriptive statistics involves summarizing and organizing data so that it can be easily understood. Unlike inferential statistics, which aims to make predictions or inferences about a population based on a sample, descriptive statistics focuses purely on describing the data at hand. It provides simple summaries and visualizations of the data's main characteristics.

Key Concepts in Descriptive Statistics

1. Measures of Central Tendency

Measures of central tendency are statistical measures that describe the center or typical value of a dataset. The main measures of central tendency are:

Mean

The mean, or average, is calculated by adding all the numbers in a dataset and then dividing by the count of numbers. It provides a quick snapshot of the dataset's overall value.

  Mean = (Σx) / n
  

Example: Consider the dataset: [10, 20, 20, 40, 50]

  Mean = (10 + 20 + 20 + 40 + 50) / 5 = 140 / 5 = 28
  

Median

The median is the middle value of a dataset when it is ordered from lowest to highest. If there is an even number of observations, the median is the average of the two middle numbers. The median is particularly useful when the dataset contains outliers, as it is not affected by extremely large or small values.

Example: For the dataset: [10, 20, 20, 40, 50]

Ordered dataset: [10, 20, 20, 40, 50]

Median = 20 (the middle value)

For an even-numbered dataset: [10, 20, 30, 40]

Ordered dataset: [10, 20, 30, 40]

  Median = (20 + 30) / 2 = 25
  

Mode

The mode is the value that appears most frequently in a dataset. A dataset can have one mode, more than one mode, or no mode at all. The mode is useful for understanding the most common value in a dataset.

Example: For the dataset: [10, 20, 20, 40, 50]

Mode = 20 (it appears most frequently)

2. Measures of Variability

Measures of variability, or dispersion, describe the spread of data points in a dataset. They help to understand the distribution of the data.

Range

The range is the difference between the highest and lowest values in a dataset. It gives a quick sense of the spread but is highly sensitive to outliers.

  Range = Maximum Value - Minimum Value
  

Example: For the dataset: [10, 20, 20, 40, 50]

  Range = 50 - 10 = 40
  

Variance

Variance measures the average degree to which each point differs from the mean. It is calculated by taking the average of the squared differences between each data point and the mean.

  Variance = Σ (x - μ)² / n
  

Example: For the dataset: [10, 20, 20, 40, 50]

Mean = 28

  Variance = [(10-28)² + (20-28)² + (20-28)² + (40-28)² + (50-28)²] / 5
           = (324 + 64 + 64 + 144 + 484) / 5
           = 1080 / 5
           = 216
  

Standard Deviation

The standard deviation is the square root of the variance and provides a measure of the average distance of each data point from the mean. It is widely used because it is in the same units as the data, making it more interpretable.

  Standard Deviation = √(Σ (x - μ)² / n)
  

Example: For the dataset: [10, 20, 20, 40, 50]

Variance = 216

  Standard Deviation = √216 ≈ 14.7
  

3. Distribution Shape

The shape of the data distribution is important for understanding the data's characteristics.

Skewness

Skewness measures the asymmetry of the data distribution. If the data is symmetrically distributed, skewness will be close to zero. Positive skew indicates a distribution with a long right tail, while negative skew indicates a long left tail.

Example: For a positively skewed dataset: [10, 20, 30, 40, 100]

The long tail on the right side shows positive skewness.

Kurtosis

Kurtosis measures the "tailedness" of the data distribution. High kurtosis means more of the variance is due to infrequent extreme deviations, whereas low kurtosis means more of the variance is due to frequent modestly sized deviations.

Example: For a dataset with high kurtosis: [10, 10, 10, 10, 1000]

The high peak and heavy tails show high kurtosis.

Applying Descriptive Statistics in Data Science

Descriptive statistics are vital in the initial stages of data analysis. They help data scientists understand the basic features of the data, identify patterns, and detect anomalies. Here’s how descriptive statistics are typically applied in data science:

1. Data Summarization:

Descriptive statistics help summarize large datasets into understandable summaries. This includes calculating the mean, median, mode, range, variance, and standard deviation.

2. Data Visualization:

Visual tools like histograms, bar charts, pie charts, and box plots are used to graphically represent data. These visualizations provide insights into the distribution, central tendency, and variability of the data.

3. Detecting Outliers:

By understanding the spread of the data, data scientists can detect outliers that may affect the analysis. Outliers are extreme values that differ significantly from other observations.

4. Data Cleaning:

Descriptive statistics aid in identifying missing values, errors, and inconsistencies in the data. This is an essential step before performing further analysis.

5. Hypothesis Testing:

Before conducting inferential statistics, descriptive statistics help formulate hypotheses and determine the appropriate statistical tests to use.



To learn more you can just click the below topics:

Data Science

Data Science Tools and Techniques
Scope of Data Science
Why learn Data Science? | Why Data Science?
Impact of Data Science
The Importance of Data in Science | Introduction to Data Science
What is Data Analysis | Data Analyst for Beginners

C++

INTRODUCTION OF C++ || Definition of C++
Brief history of C++ || history of C++
Features of C++ || why we use C++ || concept of C++
Concept of OOP || What is OOP || Object oriented programming language
Difference Between OOP And POP || Different Between C and C++
Characteristics of C++
Interesting fact about C++ || Top 10 interesting fact about C++
C++ Program Structure
Writing first program in C++ || how to write hello world in C++
Basic Data Type And Variable In C++
Identifier in C++
Keywords in C++
Token in C++
Comment in C++
Constant in C++
Modifier in C++
Taking User Input in C++ | User input in C++
Input Output Operator In C++
C++ Operators | Operator in programming language
How to Add two number in C++
Polymorphism in C++
Compile Time Polymorphism in C++
Function overloading in C++
Operator Overloading in C++
What are Control Structures in C++ || Understanding Control Structures in C++ | How to use if, else, switch
What are Functions and Recursion in C++ | How to Defining and Calling Functions

Class in C++
Object in C++

Algorithm

Why algorithm | The Importance of Algorithms in Modern Technology

Tech to know

Which is better | BSc in Computer Science or BTech?


2 Comments

Ask any query by comments

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi guys,


    If you enjoyed this blog, please share it! Feel free to ask any questions in the comments below.

    ReplyDelete
Previous Post Next Post