What is a DataFrame in Pandas?
A DataFrame in Pandas is a two-dimensional, size-mutable, heterogeneous tabular data structure that allows for data manipulation in Python. It's one of the primary data structures in Pandas, alongside the Series.
A DataFrame is a two-dimensional table of data with rows and columns. Like a series, the rows are all labeled with a unique index. However, in a DataFrame, the columns are also labeled. DataFrames are great for representing real-world data because they allow you to store heterogeneous types of data (numeric, date-time, text, etc.) in the same table, all aligned by the same index.
For example, imagine we have a DataFrame that contains information about those same four fruits, such as their color and weight. In a pandas DataFrame, this data will look something like this:
Index | Item | Color | Weight |
0 | Apple | Red | 150 |
1 | Banana | Yellow | 120 |
2 | Cherry | Red | 5 |
3 | Blueberry | Blue | 1 |
Features of DataFrame
Here are some key features of DataFrame:
-
Size Mutability: DataFrame size can be changed in a variety of ways, including the insertion or deletion of columns.
-
Labelled Axes: Rows and columns are labelled, and can be referenced based on these labels.
-
Heterogeneous Data: A DataFrame can contain a mix of different data types: integers, floats, strings, Python objects, and so on.
-
Handling Missing Data: DataFrame can handle missing data and represent it as NaN for easy detection and modification.
-
Efficient Operations: DataFrame provides a lot of functionality to perform efficient data operations like join, merge, group by, reshape, etc.
-
Flexibility: DataFrames allow you to manipulate the data in many ways. You can slice the data, index it, and subset the DataFrame.
-
Data Alignment: The data alignment feature is intrinsic, meaning that links between labels and data will not be broken unless explicitly done so.
-
Robust IO Tools: DataFrames provide robust I/O tools for loading data from flat files (CSV and delimited), Excel files, databases, and even saving/loading data from the ultrafast HDF5 format.
Syntax for DataFrame
Creating a DataFrame in Pandas is quite straightforward. Here's the basic syntax:
pandas.DataFrame(data, index, columns, dtype, copy)
Let's go through each parameter:
-
data: This is the input data that will be used to form the DataFrame. This could take several forms, such as a dict, Series, list, constants, or another DataFrame.
-
index: This is the index to use for the DataFrame. By default, this will be
range(n)
if no index is provided, wheren
is the length of your data. -
columns: This is an optional parameter where you can provide the column labels. If none are given and the data inputted is not a dict, columns will be
range(n)
, wheren
is the number of elements in your data. -
dtype: This is another optional parameter. You can use it to force the data type of the data. If it's not specified, the dtype will be inferred from the data.
-
copy: This is a boolean parameter (default False). If set to True, the data will be copied.
Create an Empty DataFrame in Pandas
Here's how you can create an empty DataFrame in Pandas, and add columns to it later:
import pandas as pd
# Create an empty DataFrame
df = pd.DataFrame()
print(df)
# Output:
# Empty DataFrame
# Columns: []
# Index: []
# Now, let's say you have some data and want to add a new column
data = [1, 2, 3, 4, 5]
df['Roll No'] = data
print(df)
# Output:
# Roll No
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
How to Create a DataFrame from List?
If you have a single list, you can still create a DataFrame, but it will only have one column. Here's an example:
import pandas as pd
# create a single list
data = ['Alex', 'Bob', 'Clarke']
# create DataFrame
df = pd.DataFrame(data, columns=['Name'])
print(df)
# Output:
# Name
# 0 Alex
# 1 Bob
# 2 Clarke
Here's an example of creating a DataFrame from a list of lists, where each sub-list represents a row in the DataFrame:
import pandas as pd
# create a list of lists
data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
# create DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
# Output:
# Name Age
# 0 Alex 10
# 1 Bob 12
# 2 Clarke 13
How to Create DataFrame using Dictionary in Pandas?
Creating a DataFrame from a dictionary in Pandas is straightforward. The keys of the dictionary become the column labels, and the values associated with these keys form the data in the respective columns. Here's an example:
import pandas as pd
# create a dictionary
data = {
'Name': ['Tom', 'Nick', 'John', 'Peter'],
'Age': [20, 21, 19, 22]
}
# create DataFrame
df = pd.DataFrame(data)
print(df)
# Output:
# Name Age
# 0 Tom 20
# 1 Nick 21
# 2 John 19
# 3 Peter 22
How to Create DataFrame using Series in Pandas?
Creating a DataFrame from a Pandas Series is also straightforward. If you use a single Series, the DataFrame will have a single column. If you use multiple Series, each Series will form a column. Here's an example:
import pandas as pd
# create Series
s1 = pd.Series([1, 2, 3, 4], name='Numbers')
s2 = pd.Series(['a', 'b', 'c', 'd'], name='Letters')
# create DataFrame using multiple series
df = pd.DataFrame({s1.name: s1, s2.name: s2})
print(df)
# Output:
# Numbers Letters
# 0 1 a
# 1 2 b
# 2 3 c
# 3 4 d
How to Create DataFrame using NumPy ndarrays in Pandas?
Creating a DataFrame from a NumPy ndarray is quite easy in Pandas. Similar to the creation from a list, each sublist in the ndarray would represent a row in the DataFrame. Here's an example:
import pandas as pd
import numpy as np
# create a 2d numpy array
data = np.array([['', 'Column1', 'Column2'],
['Row1', 1, 2],
['Row2', 3, 4]])
# create DataFrame
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
print(df)
# Output:
# Column1 Column2
# Row1 1 2
# Row2 3 4
How to Create DataFrame using Another DataFrame in Pandas?
Creating a DataFrame from another DataFrame in Pandas is very straightforward. Essentially, you're making a copy of the original DataFrame. Here's how you can do it:
import pandas as pd
# Create a DataFrame
data = {'Name': ['Tom', 'Nick', 'John', 'Peter'],
'Age': [20, 21, 19, 22]}
df1 = pd.DataFrame(data)
# Create another DataFrame from df1
df2 = pd.DataFrame(df1)
print(df2)
# Output:
# Name Age
# 0 Tom 20
# 1 Nick 21
# 2 John 19
# 3 Peter 22
In this example, df1
is the original DataFrame, created from a dictionary. df2
is created from df1
by simply passing df1
to the DataFrame constructor. The resulting df2
is a copy of df1
.