Pandas Descriptive Statistics Functions
In this tutorial we will learn how to use Pandas for statistical analysis, using examples to illustrate the concepts.
The Basics of Pandas
Pandas operates with two main data structures - Series (one-dimensional) and DataFrame (two-dimensional). You can create a Pandas DataFrame in Python using the 'pd.DataFrame()' function.
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 22, 35, 58],
'Salary': [3000, 3500, 4500, 6500]
}
df = pd.DataFrame(data)
print(df)
# output
# Name Age Salary
# 0 John 28 3000
# 1 Anna 22 3500
# 2 Peter 35 4500
# 3 Linda 58 6500
This will generate a DataFrame storing names, ages, and salaries.
Basic Statistical Functions in Pandas
1. mean()
: This function returns the average of the values for the requested axis.
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 22, 35, 58],
'Salary': [3000, 3500, 4500, 6500]
}
df = pd.DataFrame(data)
print(df)
print("Mean Salary")
print(df['Salary'].mean())
# output
# Name Age Salary
# 0 John 28 3000
# 1 Anna 22 3500
# 2 Peter 35 4500
# 3 Linda 58 6500
# Mean Salary
# 4375.0
2. median()
: This function returns the median of the values for the requested axis.
print(df['Age'].median())
3. mode()
: This function returns the mode of the values for the requested axis.
print(df['Age'].mode())
What is describe() function in Pandas?
The describe()
function in pandas is a convenient method to generate descriptive statistics of a DataFrame or Series. It's a great way to quickly grasp an understanding of the central tendencies, dispersion, and shape of the dataset’s distribution, excluding NaN
values.
By default, it provides the following statistics:
count
: Number of non-null observations.mean
: Mean of the values.std
: Standard deviation of the observations.min
: Minimum value in the dataset.25%
: First quartile (25th percentile).50%
: Second quartile or Median (50th percentile).75%
: Third quartile (75th percentile).max
: Maximum value in the dataset.
If the DataFrame includes non-numeric data, describe()
will provide a different set of statistics:
count
: Number of non-null observations.unique
: Number of distinct objects in the series.top
: Most frequent object in the series.freq
: Number of times thetop
object appears in the series.
Pandas provides a powerful method, describe()
, which generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution.
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 22, 35, 58],
'Salary': [3000, 3500, 4500, 6500]
}
df = pd.DataFrame(data)
print(df)
print(df.describe())
Output of the above is given below:
Name Age Salary
0 John 28 3000
1 Anna 22 3500
2 Peter 35 4500
3 Linda 58 6500
Age Salary
count 4.000000 4.000000
mean 35.750000 4375.000000
std 15.755951 1547.847968
min 22.000000 3000.000000
25% 26.500000 3375.000000
50% 31.500000 4000.000000
75% 40.750000 5000.000000
max 58.000000 6500.000000
This function will return count, mean, standard deviation, minimum and maximum values, and the quantiles of the data.
List of Pandas Statistical Functions with Description
Here is a list of basic statistical functions provided by the pandas library, along with more detailed descriptions:
Function | Description |
count() | This function returns the number of non-null observations in a Series or DataFrame. |
sum() | It adds up all the values in a Series or DataFrame along the specified axis. |
mean() | This function computes the arithmetic mean of a Series or DataFrame along the specified axis. |
median() | It finds the middle value of a Series or DataFrame when the data is sorted in ascending order. |
min() | This function returns the smallest value in a Series or DataFrame. |
max() | It returns the largest value in a Series or DataFrame. |
mode() | This function finds the most frequently occurring value in a Series or DataFrame. |
abs() | It computes the absolute value for each element in a Series or DataFrame. |
prod() | This function returns the product of all the values in a Series or DataFrame. |
std() | It computes the standard deviation, a measure of the amount of variation or dispersion in a set of values. |
var() | This function calculates the variance, a statistical measurement of the spread between numbers in a data set. |
sem() | It computes the standard error of the mean of a Series or DataFrame. |
skew() | This function returns unbiased skew over the requested axis, which is a measure of the asymmetry of the data. |
kurt() | It computes the kurtosis, a measure of the 'tailedness' of the probability distribution, over the requested axis. |