ADVERTISEMENT
ADVERTISEMENT

What is a DataFrame in Pandas?

A DataFrame in Pandas is a two-dimensional, size-mutable, heterogeneous tabular data structure that allows for data manipulation in Python. It's one of the primary data structures in Pandas, alongside the Series.

A DataFrame is a two-dimensional table of data with rows and columns. Like a series, the rows are all labeled with a unique index. However, in a DataFrame, the columns are also labeled. DataFrames are great for representing real-world data because they allow you to store heterogeneous types of data (numeric, date-time, text, etc.) in the same table, all aligned by the same index.

For example, imagine we have a DataFrame that contains information about those same four fruits, such as their color and weight. In a pandas DataFrame, this data will look something like this:

Index Item Color Weight
0 Apple Red 150
1 Banana Yellow 120
2 Cherry Red 5
3 Blueberry Blue 1

Features of DataFrame

Here are some key features of DataFrame:

  1. Size Mutability: DataFrame size can be changed in a variety of ways, including the insertion or deletion of columns.

  2. Labelled Axes: Rows and columns are labelled, and can be referenced based on these labels.

  3. Heterogeneous Data: A DataFrame can contain a mix of different data types: integers, floats, strings, Python objects, and so on.

  4. Handling Missing Data: DataFrame can handle missing data and represent it as NaN for easy detection and modification.

  5. Efficient Operations: DataFrame provides a lot of functionality to perform efficient data operations like join, merge, group by, reshape, etc.

  6. Flexibility: DataFrames allow you to manipulate the data in many ways. You can slice the data, index it, and subset the DataFrame.

  7. Data Alignment: The data alignment feature is intrinsic, meaning that links between labels and data will not be broken unless explicitly done so.

  8. Robust IO Tools: DataFrames provide robust I/O tools for loading data from flat files (CSV and delimited), Excel files, databases, and even saving/loading data from the ultrafast HDF5 format.

Syntax for DataFrame 

Creating a DataFrame in Pandas is quite straightforward. Here's the basic syntax:

pandas.DataFrame(data, index, columns, dtype, copy)

Let's go through each parameter:

  1. data: This is the input data that will be used to form the DataFrame. This could take several forms, such as a dict, Series, list, constants, or another DataFrame.

  2. index: This is the index to use for the DataFrame. By default, this will be range(n) if no index is provided, where n is the length of your data.

  3. columns: This is an optional parameter where you can provide the column labels. If none are given and the data inputted is not a dict, columns will be range(n), where n is the number of elements in your data.

  4. dtype: This is another optional parameter. You can use it to force the data type of the data. If it's not specified, the dtype will be inferred from the data.

  5. copy: This is a boolean parameter (default False). If set to True, the data will be copied.

Create an Empty DataFrame in Pandas 

Here's how you can create an empty DataFrame in Pandas, and add columns to it later:

import pandas as pd

# Create an empty DataFrame
df = pd.DataFrame()
print(df)
# Output:
# Empty DataFrame
# Columns: []
# Index: []

# Now, let's say you have some data and want to add a new column
data = [1, 2, 3, 4, 5]
df['Roll No'] = data

print(df)
# Output:
#       Roll No
# 0        1
# 1        2
# 2        3
# 3        4
# 4        5

How to Create a DataFrame from List?

If you have a single list, you can still create a DataFrame, but it will only have one column. Here's an example:

import pandas as pd

# create a single list
data = ['Alex', 'Bob', 'Clarke']

# create DataFrame
df = pd.DataFrame(data, columns=['Name'])

print(df)
# Output:
#      Name
# 0    Alex
# 1     Bob
# 2  Clarke

Here's an example of creating a DataFrame from a list of lists, where each sub-list represents a row in the DataFrame:

import pandas as pd

# create a list of lists
data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]

# create DataFrame
df = pd.DataFrame(data, columns=['Name', 'Age'])

print(df)
# Output:
#      Name  Age
# 0    Alex   10
# 1     Bob   12
# 2  Clarke   13

How to Create DataFrame using Dictionary in Pandas?

Creating a DataFrame from a dictionary in Pandas is straightforward. The keys of the dictionary become the column labels, and the values associated with these keys form the data in the respective columns. Here's an example:

import pandas as pd

# create a dictionary
data = {
    'Name': ['Tom', 'Nick', 'John', 'Peter'],
    'Age': [20, 21, 19, 22]
}

# create DataFrame
df = pd.DataFrame(data)

print(df)
# Output:
#    Name  Age
# 0   Tom   20
# 1  Nick   21
# 2  John   19
# 3 Peter   22

How to Create DataFrame using Series in Pandas?

Creating a DataFrame from a Pandas Series is also straightforward. If you use a single Series, the DataFrame will have a single column. If you use multiple Series, each Series will form a column. Here's an example:

import pandas as pd

# create Series
s1 = pd.Series([1, 2, 3, 4], name='Numbers')
s2 = pd.Series(['a', 'b', 'c', 'd'], name='Letters')

# create DataFrame using multiple series
df = pd.DataFrame({s1.name: s1, s2.name: s2})

print(df)
# Output:
#    Numbers Letters
# 0        1       a
# 1        2       b
# 2        3       c
# 3        4       d

How to Create DataFrame using NumPy ndarrays in Pandas?

Creating a DataFrame from a NumPy ndarray is quite easy in Pandas. Similar to the creation from a list, each sublist in the ndarray would represent a row in the DataFrame. Here's an example:

import pandas as pd
import numpy as np

# create a 2d numpy array
data = np.array([['', 'Column1', 'Column2'],
                 ['Row1', 1, 2],
                 ['Row2', 3, 4]])

# create DataFrame
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

print(df)
# Output:
#      Column1 Column2
# Row1       1       2
# Row2       3       4

How to Create DataFrame using Another DataFrame in Pandas?

Creating a DataFrame from another DataFrame in Pandas is very straightforward. Essentially, you're making a copy of the original DataFrame. Here's how you can do it:

import pandas as pd

# Create a DataFrame
data = {'Name': ['Tom', 'Nick', 'John', 'Peter'],
        'Age': [20, 21, 19, 22]}
df1 = pd.DataFrame(data)

# Create another DataFrame from df1
df2 = pd.DataFrame(df1)

print(df2)
# Output:
#    Name  Age
# 0   Tom   20
# 1  Nick   21
# 2  John   19
# 3 Peter   22

In this example, df1 is the original DataFrame, created from a dictionary. df2 is created from df1 by simply passing df1 to the DataFrame constructor. The resulting df2 is a copy of df1.


ADVERTISEMENT

ADVERTISEMENT