ADVERTISEMENT
ADVERTISEMENT

Pandas Descriptive Statistics Functions

In this tutorial we will learn how to use Pandas for statistical analysis, using examples to illustrate the concepts.

The Basics of Pandas

Pandas operates with two main data structures - Series (one-dimensional) and DataFrame (two-dimensional). You can create a Pandas DataFrame in Python using the 'pd.DataFrame()' function.

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 22, 35, 58],
    'Salary': [3000, 3500, 4500, 6500]
}

df = pd.DataFrame(data)

print(df)

# output

#     Name  Age  Salary
# 0   John   28    3000
# 1   Anna   22    3500
# 2  Peter   35    4500
# 3  Linda   58    6500

This will generate a DataFrame storing names, ages, and salaries.

Basic Statistical Functions in Pandas

1. mean(): This function returns the average of the values for the requested axis.

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 22, 35, 58],
    'Salary': [3000, 3500, 4500, 6500]
}

df = pd.DataFrame(data)
print(df)
print("Mean Salary")
print(df['Salary'].mean())

# output

#     Name  Age  Salary
# 0   John   28    3000
# 1   Anna   22    3500
# 2  Peter   35    4500
# 3  Linda   58    6500
# Mean Salary
# 4375.0

2. median(): This function returns the median of the values for the requested axis.

print(df['Age'].median())

3. mode(): This function returns the mode of the values for the requested axis.

print(df['Age'].mode())

What is describe() function in Pandas?

The describe() function in pandas is a convenient method to generate descriptive statistics of a DataFrame or Series. It's a great way to quickly grasp an understanding of the central tendencies, dispersion, and shape of the dataset’s distribution, excluding NaN values.

By default, it provides the following statistics:

  1. count: Number of non-null observations.
  2. mean: Mean of the values.
  3. std: Standard deviation of the observations.
  4. min: Minimum value in the dataset.
  5. 25%: First quartile (25th percentile).
  6. 50%: Second quartile or Median (50th percentile).
  7. 75%: Third quartile (75th percentile).
  8. max: Maximum value in the dataset.

If the DataFrame includes non-numeric data, describe() will provide a different set of statistics:

  1. count: Number of non-null observations.
  2. unique: Number of distinct objects in the series.
  3. top: Most frequent object in the series.
  4. freq: Number of times the top object appears in the series.

Pandas provides a powerful method, describe(), which generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 22, 35, 58],
    'Salary': [3000, 3500, 4500, 6500]
}

df = pd.DataFrame(data)
print(df)
print(df.describe())

Output of the above is given below:

 Name  Age  Salary
0   John   28    3000
1   Anna   22    3500
2  Peter   35    4500
3  Linda   58    6500
             Age       Salary
count   4.000000     4.000000
mean   35.750000  4375.000000
std    15.755951  1547.847968
min    22.000000  3000.000000
25%    26.500000  3375.000000
50%    31.500000  4000.000000
75%    40.750000  5000.000000
max    58.000000  6500.000000

This function will return count, mean, standard deviation, minimum and maximum values, and the quantiles of the data.

List of Pandas Statistical Functions with Description

Here is a list of basic statistical functions provided by the pandas library, along with more detailed descriptions:

Function Description
count() This function returns the number of non-null observations in a Series or DataFrame.
sum() It adds up all the values in a Series or DataFrame along the specified axis.
mean() This function computes the arithmetic mean of a Series or DataFrame along the specified axis.
median() It finds the middle value of a Series or DataFrame when the data is sorted in ascending order.
min() This function returns the smallest value in a Series or DataFrame.
max() It returns the largest value in a Series or DataFrame.
mode() This function finds the most frequently occurring value in a Series or DataFrame.
abs() It computes the absolute value for each element in a Series or DataFrame.
prod() This function returns the product of all the values in a Series or DataFrame.
std() It computes the standard deviation, a measure of the amount of variation or dispersion in a set of values.
var() This function calculates the variance, a statistical measurement of the spread between numbers in a data set.
sem() It computes the standard error of the mean of a Series or DataFrame.
skew() This function returns unbiased skew over the requested axis, which is a measure of the asymmetry of the data.
kurt() It computes the kurtosis, a measure of the 'tailedness' of the probability distribution, over the requested axis.

 


ADVERTISEMENT

ADVERTISEMENT