Different Types of Data Structures in Pandas
Pandas has two main types of data structures: Series and DataFrame. These data structures are built on top of NumPy, which means they are fast.
1. Series
A Series is a one-dimensional array-like object that can hold any data type (integers, strings, floating points, Python objects, etc.). It labels each data point with a unique identifier, which by default is a number from 0 to N (N being the length of the data - 1).
For example, imagine we have a series of four different fruits. In a pandas Series, this data will look something like this:
Index | Item |
0 | Apple |
1 | Banana |
2 | Cherry |
3 | Blueberry |
In the example above, each fruit is associated with a unique index (0 to 3).
2. DataFrame
A DataFrame is a two-dimensional table of data with rows and columns. Like a series, the rows are all labeled with a unique index. However, in a DataFrame, the columns are also labeled. DataFrames are great for representing real-world data because they allow you to store heterogeneous types of data (numeric, date-time, text, etc.) in the same table, all aligned by the same index.
For example, imagine we have a DataFrame that contains information about those same four fruits, such as their color and weight. In a pandas DataFrame, this data will look something like this:
Index | Item | Color | Weight |
0 | Apple | Red | 150 |
1 | Banana | Yellow | 120 |
2 | Cherry | Red | 5 |
3 | Blueberry | Blue | 1 |
In the example above, each fruit is associated with an index (0 to 3), and each attribute of the fruit (color, weight) is a labeled column.
Key Points about Pandas Data Structure
key points about pandas data structures:
-
Two Main Data Structures: Pandas provides two key data structures - Series and DataFrame. A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
-
Handling of Different Data Types: Both Series and DataFrames can hold various data types such as integer, float, string, and Python objects. A DataFrame can hold different types of data in each column.
-
Data Alignment: One of the critical features of pandas data structures is the behavior of the arithmetic operations between objects with different indexes. Pandas automatically aligns data in calculations by the index labels.
-
Handling Missing Data: Pandas data structures cater well to missing data. It represents missing or NA values using the
np.nan
object from NumPy. -
Manipulation and Transformation: Pandas data structures are mutable. They can be modified directly or transformed to derive new objects. You can add, remove, or update values. This makes pandas powerful for data wrangling and preprocessing.