Introduction to Pandas Library in Python
Pandas is a popular Python library for data manipulation and analysis. It provides easy-to-use data structures and functions for working with structured data, making it an essential tool for data scientists, analysts, and anyone working with tabular data.
Usage of Pandas:
Pandas is incredibly versatile and can help you with a wide range of tasks:
- Data Cleaning: Pandas makes it easy to handle missing data, duplicate records, and outliers. You can filter, fill, or drop data as needed.
- Data Exploration: You can quickly summarize and explore your data by calculating statistics, plotting charts, and aggregating information.
- Data Transformation: Transforming data for analysis or machine learning is a breeze with Pandas. You can reshape, pivot, or merge datasets effortlessly.
- Data Analysis: Perform data analysis tasks like grouping, sorting, and filtering data to gain insights from your dataset.
DataFrames in Pandas
A DataFrame in pandas represents a versatile and powerful data structure that resembles a rectangular table of data. It is a fundamental component for data manipulation and analysis. In this guide, we'll explore DataFrames in depth, covering their characteristics, creation, data access, modification, and more.
Characteristics of DataFrames:
A DataFrame possesses several key characteristics:
- Rectangular Structure: DataFrames are organized into rows and columns, creating a grid-like structure.
- Row and Column Indexes: A DataFrame includes both row and column indexes. These indexes allow for efficient data retrieval and manipulation.
Creating a DataFrame
You can create a DataFrame from a dictionary of equal-length lists or NumPy arrays. Here's an example:
import pandas as pd
data = {
"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
print(frame)
This code creates a DataFrame called frame with the given data. The resulting DataFrame will look like this:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
We can also specify a custom index:
# Specify a custom index using the 'index' parameter
custom_index = ["a", "b", "c", "d", "e", "f"]
frame2 = pd.DataFrame(data, index=custom_index)
print(frame2)
The output will look like this:
state year pop
a Ohio 2000 1.5
b Ohio 2001 1.7
c Ohio 2002 3.6
d Nevada 2001 2.4
e Nevada 2002 2.9
f Nevada 2003 3.2
Exploring DataFrames
1. head():
The head()
method in pandas allows you to view the first few rows of a DataFrame. It's useful for quickly inspecting the beginning of the DataFrame and understanding its structure. By default, head() displays the first 5 rows, but you can specify the number of rows to display as an argument.
Syntax:
DataFrame.head(n=5)
n (optional): The number of rows to display. By default, it's set to 5.
Example:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45]}
df = pd.DataFrame(data)
# Display the first 3 rows
print(df.head(3))
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
2. tail():
The tail()
method, similar to head(), allows you to view the last few rows of a DataFrame. It's helpful for examining the end of the DataFrame, especially when dealing with large datasets. By default, tail() displays the last 5 rows, but you can specify the number of rows to display as an argument.
Syntax:
DataFrame.tail(n=5)
n (optional): The number of rows to display. By default, it's set to 5.
Example:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 45]}
df = pd.DataFrame(data)
# Display the last 2 rows
print(df.tail(2))
Output:
Name Age
3 David 40
4 Eve 45
3. Getting Info of DataFrame:
The info()
method provides a concise summary of a DataFrame, including information about the data types of each column and the number of non-null values. It's a quick way to get an overview of your dataset.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Display information about the DataFrame
df.info()
# Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Age 3 non-null int64
2 City 3 non-null object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
4. Getting Summary of DataFrame:
The describe()
method provides summary statistics of the numerical columns in a DataFrame, including count, mean, std (standard deviation), min, 25%, 50%, 75%, and max values.
# Generate summary statistics for numerical columns
df.describe()
# Output:
Age
count 3.0
mean 30.0
std 5.0
min 25.0
25% 27.5
50% 30.0
75% 32.5
max 35.0
5. Check Column Names:
To print the column names of a DataFrame, you can use the columns
attribute.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Print column names
print(df.columns)
# Index(['Name', 'Age', 'City'], dtype='object')
You can rename columns using the rename() method or by directly assigning new column names to the columns attribute.
# Rename columns using the rename() method
df = df.rename(columns={'Name': 'Full Name', 'Age': 'Years'})
# Rename columns by directly assigning new names
df.columns = ['Name', 'Age', 'City']
6. Check Index:
To print the index (row labels) of a DataFrame, you can use the index attribute.
# Print index
print(df.index)
# Output:
# RangeIndex(start=0, stop=3, step=1)
7. Check Values:
To print the values of a DataFrame, you can use the values attribute. This returns the values as a NumPy array.
# Print values
print(df.values)
# Output:
# [['Alice' 25 'New York']
# ['Bob' 30 'San Francisco']
# ['Charlie' 35 'Los Angeles']]
8. Check Shape:
To print the shape of a DataFrame, you can use the shape attribute. This returns a tuple representing the dimensionality of the DataFrame.
# Print shape
print(df.shape)
# Output:
# (3, 3)
Accessing Data in a DataFrame
You can access data within a DataFrame using various methods:
1. Accessing by Column Name:
import pandas as pd
data = {
"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
"year": [2000, 2001, 2002, 2001, 2002, 2003],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
state_column = frame["state"]
print(state_column)
This code retrieves the "state" column. Similarly, you can access other columns by their names.
# Output:
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
5 Nevada
Name: state, dtype: object
2. Accessing by Row and Column:
row = frame.loc[1] # Get the second row (index 1)
print(row)
print('\n')
specific_row = frame.loc[1,["state"]] # Get the specified second row of state column
print(specific_row)
This code uses .loc to access a specific row by its label (index). You can replace 1 with any row label you want.
# Output:
state Ohio
year 2001
pop 1.7
Name: 1, dtype: object
state Ohio
Name: 0, dtype: object
Slicing:
subset = frame.loc[2:4, ["state", "pop"]]
print(subset)
This code slices the DataFrame:
- 2 : 4 specifies the row slice. It includes rows with indices 2, 3, and 4 (inclusive). This is a range-based slicing of rows.
- ["state", "pop"] specifies the columns you want to include in the subset. You're selecting only the "state" and "pop" columns.
state pop
2 Ohio 3.6
3 Nevada 2.4
4 Nevada 2.9
Filtering Data
You can filter rows based on specific conditions using boolean indexing.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "San Francisco", "Los Angeles"]
}
df = pd.DataFrame(data)
adults = df[df["Age"] >= 30]
print(adults)
# Output:
Name Age City
1 Bob 30 San Francisco
2 Charlie 35 Los Angeles
Sorting Data:
Use the sort_values()
method to sort the DataFrame based on one or more columns.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "San Francisco", "Los Angeles"]
}
df = pd.DataFrame(data)
sorted_df = df.sort_values("Age", ascending = False)
In the code above, we are sorting the DataFrame df
by the "Age" column in descending order (ascending=False
).
Sorting a DataFrame:
- This first parameter specifies the column or columns by which you want to sort the DataFrame. It can accept either a single column name as a string or a list of column names for multi-level sorting. If you provide multiple columns, pandas will sort first by the first column, then by the second column, and so on.
ascending
: By default, this parameter is set toTrue
, which sorts the data in ascending order. If you want to sort in descending order, set it toFalse
.
# Sorting by multiple columns
sorted_df = df.sort_values(["City", "Age"], ascending=[True, False])
In-Place Sorting:
By default, sort_values() returns a new DataFrame with the sorted data, leaving the original DataFrame unchanged. If you want to sort the DataFrame in-place (i.e., modify the original DataFrame), you can use the inplace = True parameter.
df.sort_values("Age", ascending=False, inplace=True)
Adding and Dropping Columns
In pandas, you can easily add and drop columns from a DataFrame, which allows you to manipulate and tailor your data as needed.
Adding Columns:
To add a new column to a DataFrame, you can simply assign a new column of data to it. You can create a new column from a Python list or a NumPy array. Here's how to add columns:
import pandas as pd
# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Age": [25, 30, 22, 35, 28]
}
df = pd.DataFrame(data)
# Adding a new column
df["City"] = ["New York", "San Francisco", "Los Angeles", "Chicago", "Boston"]
In the code above, we added a new column "City" to the DataFrame df
with values provided as a Python list.
Dropping Columns:
To drop one or more columns from a DataFrame, you can use the drop()
method. Specify the column name(s) or label(s) you want to drop and set axis=1
to indicate that you're dropping columns (as opposed to rows).
# Dropping a single column
df = df.drop("City", axis=1)
To drop multiple columns, provide a list of column names:
# Dropping multiple columns
columns_to_drop = ["Age", "Birth Year"]
df = df.drop(columns=columns_to_drop, axis=1)
When you drop columns, pandas returns a new DataFrame with the specified columns removed. If you want to modify the original DataFrame in-place, you can use the inplace = True parameter:
# Drop "Age" column in-place
df.drop("Age", axis=1, inplace=True)
Adding and Dropping Rows
In pandas, you can easily add and drop rows from a DataFrame, which allows you to manipulate and tailor your data as needed.
Adding Rows:
You can use the append()
method to add rows to an existing DataFrame. This is useful when you want to combine data from different sources or add new observations to your DataFrame.
Here's how you can do it step by step:
import pandas as pd
data = {'Name': ['Alice', 'Bob'],
'Age': [25, 30]}
df = pd.DataFrame(data)
new_data = {'Name': ['Charlie', 'David'],
'Age': [35, 28]}
new_df = pd.DataFrame(new_data)
combined_df = df.append(new_df, ignore_index=True)
print(combined_df)
Make sure to set ignore_index = True to reindex the resulting DataFrame.
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 28
Dropping Rows:
To drop one or more rows from a DataFrame, use the drop() method and specify the row label(s) you want to drop. By default, axis=0 indicates that you're dropping rows.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Drop the row with index 1
df_dropped_row = df.drop(1, axis=0)
print(df_dropped_row)
# Output:
# Name Age City
# 0 Alice 25 New York
# 2 Charlie 35 Los Angeles
You can also drop multiple rows by passing a list of row labels:
# Drop rows with index 0 and 2
df_dropped_multiple_rows = df.drop([0, 2], axis=0)
print(df_dropped_multiple_rows)
# Output:
# Name Age City
# 1 Bob 30 San Francisco
Reset Indexing
In pandas, the reset_index()
function is used to reset the index. This operation is often necessary when you perform data manipulations that leave the DataFrame's index in a state that is not suitable for further analysis, and you want to revert to the default integer-based index.
Here's how to use reset_index():
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
# Set a custom index
df.set_index('Name', inplace=True)
# Reset the index
df = df.reset_index()
In this example, we first set the 'Name' column as the index using set_index(). Later, we use reset_index() to revert to the default integer-based index.
Frequency of Unique values
The value_counts()
method is used to determine the frequency of unique values in a column. It is particularly useful for categorical data.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Get the frequency of values in a column
city_counts = df['City'].value_counts()
# Output:
New York 1
San Francisco 1
Los Angeles 1
Name: City, dtype: int64
# For Sorted Date: city_counts = df['City'].value_counts(sort = True)
# For Normalized Data: city_counts = df['City'].value_counts(normalize = True)
You can also use unique() function that returns an array of unique values found in the specified column.
# Get unique values in a column
unique_cities = df['City'].unique()
# Output:
array(['New York', 'San Francisco', 'Los Angeles'], dtype=object)
Maximum Value Index
idxmax
is a pandas function used to find the index (row label) of the maximum value in a Series or DataFrame.- It's like asking pandas, "Which row has the highest value for a particular column?"
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 92, 78]}
df = pd.DataFrame(data)
# Find the index (row label) of the maximum score
max_index = df['Score'].idxmax()
print(max_index)
# Output:
# 1
Drop Duplicate values of DataFrame
The drop_duplicates()
method allows you to remove duplicate rows from a DataFrame, keeping only the first occurrence.
DataFrame.drop_duplicates(subset = None, keep = 'first', inplace = False, ignore_index = False)
In pandas, when you're dealing with duplicate rows in a DataFrame, you can control which duplicates to keep based on the order of appearance using the keep
parameter within the drop_duplicates() method. The keep parameter accepts three possible values: 'first'
, 'last'
, and False
. Here's what each of these options means:
1. keep = 'first': This option keeps the first occurrence of a duplicated row and removes all subsequent duplicates.
df = df.drop_duplicates(subset = "columns", keep='first')
In this case, the first occurrence of a duplicated row is retained, and any subsequent duplicates are dropped.
2. keep = 'last': This option keeps the last occurrence of a duplicated row and removes all previous duplicates.
df = df.drop_duplicates(subset = "columns", keep='last')
Here, the last occurrence of a duplicated row is retained, and any previous duplicates are dropped.
3. keep = False: This option removes all duplicated rows entirely, keeping none of them.
df = df.drop_duplicates(subset = "columns", keep=False)
Summary Statistics
Pandas provides a variety of methods to compute summary statistics for DataFrames. Here are some of the most commonly used methods:
1. Sum:
The sum()
method returns the sum of the values for the requested axis.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [70000, 80000, 90000]}
df = pd.DataFrame(data)
# Sum of the Age and Salary columns
print(df[['Age', 'Salary']].sum())
# Output:
# Age 90
# Salary 240000
# dtype: int64
2. Mean:
The mean()
method returns the mean of the values for the requested axis.
# Mean of the Age and Salary columns
print(df[['Age', 'Salary']].mean())
# Output:
# Age 30.0
# Salary 80000.0
# dtype: float64
3. Median:
The median()
method returns the median of the values for the requested axis.
# Median of the Age and Salary columns
print(df[['Age', 'Salary']].median())
# Output:
# Age 30.0
# Salary 80000.0
# dtype: float64
4. Count:
The count()
method returns the number of non-null values for each column.
# Count of non-null values in each column
print(df.count())
# Output:
# Name 3
# Age 3
# Salary 3
# dtype: int64
5. Max:
The max()
method returns the maximum value for each column.
# Maximum value of the Age and Salary columns
print(df[['Age', 'Salary']].max())
# Output:
# Age 35
# Salary 90000
# dtype: int64
6. Min:
The min()
method returns the minimum value for each column.
# Minimum value of the Age and Salary columns
print(df[['Age', 'Salary']].min())
# Output:
# Age 25
# Salary 70000
# dtype: int64
7. Quantile:
The quantile()
method returns the value at the given quantile. By default, it calculates the 50th percentile (median).
# 25th, 50th, and 75th percentiles of the Age and Salary columns
print(df[['Age', 'Salary']].quantile([0.25, 0.5, 0.75]))
# Output:
# Age Salary
# 0.25 27.5 75000.0
# 0.50 30.0 80000.0
# 0.75 32.5 85000.0
Grouping and Aggregating Data
In pandas, you can perform grouping and aggregating operations on your DataFrame to summarize and analyze your data effectively. Grouping allows you to group rows of data based on one or more columns, and aggregation functions help you compute summary statistics for each group. Let's dive into the details of grouping and aggregating data in pandas.
Grouping Data:
To group your data, you can use the groupby()
method in pandas. This method allows you to specify one or more columns by which you want to group your data. Here's how you can group data by a single column:
import pandas as pd
# Sample DataFrame
data = {
"Category": ["A", "B", "A", "B", "A", "B"],
"Value": [10, 15, 12, 18, 14, 20]
}
df = pd.DataFrame(data)
# Grouping by a single column
grouped = df.groupby("Category")
In this example, we grouped the DataFrame df
by the "Category" column. Now, we can perform aggregation operations on each group.
Aggregating Data:
Once you've grouped your data, you can apply various aggregation functions to compute summary statistics for each group. Some common aggregation functions include sum()
, mean()
, median()
, count()
, max()
, min()
, and more. Here are some examples:
# Calculating the sum for each group
sums = grouped["Value"].sum()
# Calculating the mean for each group
means = grouped["Value"].mean()
# Counting the number of elements in each group
counts = grouped["Value"].count()
print(sums)
print(means)
print(counts)
# Output:
Category
A 36
B 53
Name: Value, dtype: int64
Category
A 12.000000
B 17.666667
Name: Value, dtype: float64
Category
A 3
B 3
Name: Value, dtype: int64
You can also apply multiple aggregation functions simultaneously using the agg() method:
# Applying multiple aggregation functions
result = grouped["Value"].agg(["sum", "mean", "count"])
Grouping by Multiple Columns:
You can also group data by multiple columns by passing a list of column names to the groupby() method:
# Grouping by multiple columns
grouped = df.groupby(["Category", "Subcategory"])
Pivot Tables
Pivot tables are a powerful feature in pandas that allow you to reshape and summarize data. They are especially useful for aggregating data across multiple dimensions. Pandas provides the pivot_table()
function to create pivot tables.
The pivot_table() function aggregates data based on one or more keys and applies a specified aggregation function.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
'Month': ['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb'],
'Sales': [200, 150, 300, 220, 180, 320]}
df = pd.DataFrame(data)
# Create a pivot table
pivot = df.pivot_table(values='Sales', index='Name', columns='Month', aggfunc='sum')
print(pivot)
# Output:
# Month Feb Jan
# Name
# Alice 220 200
# Bob 180 150
# Charlie 320 300
You can apply multiple aggregation functions to the pivot table.
# Create a pivot table with multiple aggregation functions
pivot = df.pivot_table(values='Sales', index='Name', columns='Month', aggfunc=['sum', 'mean'])
print(pivot)
# Output:
# sum mean
# Month Feb Jan Feb Jan
# Name
# Alice 220 200 220.0 200.0
# Bob 180 150 180.0 150.0
# Charlie 320 300 320.0 300.0
You can fill missing values in the pivot table with a specified value using the fill_value
parameter.
# Create a pivot table and fill missing values with 0
pivot = df.pivot_table(values='Sales', index='Name', columns='Month', aggfunc='sum', fill_value=0)
print(pivot)
# Output:
# Month Feb Jan
# Name
# Alice 220 200
# Bob 180 150
# Charlie 320 300
The margins parameter in the pivot_table() function allows you to add subtotals and grand totals to the pivot table. When set to True, it adds a "All" row and column, which contains the sum of the values for each row and column.
# Create a pivot table with margins
pivot = df.pivot_table(values='Sales', index='Name', columns='Month', aggfunc='sum', fill_value=0, margins=True)
print(pivot)
# Output:
# Month Feb Jan All
# Name
# Alice 220 200 420
# Bob 180 150 330
# Charlie 320 300 620
# All 720 650 1370
Handling Missing Data
isna()
checks if each element in a DataFrame is null (missing).
- It returns a DataFrame of Boolean values (
True
if null,False
otherwise).
import pandas as pd
data = {'A': [1, 2, None, 4],
'B': [None, 5, 6, 7]}
df = pd.DataFrame(data)
# Check for null values in the DataFrame
null_mask = df.isna()
print(null_mask)
Output:
A B
0 False True
1 False False
2 True False
3 False False
isna().sum()
calculates the total number of missing values in each column of a DataFrame.
# Calculate the total number of null values in each column
null_counts = df.isna().sum()
print(null_counts)
Output:
A 1
B 1
dtype: int64
In this example, 'A' has 1 missing value, and 'B' has 1 missing value.
Pandas provides methods for dealing with missing values. You can use dropna()
to remove rows with missing values and fillna()
to fill missing values.
# DataFrame.dropna(axis = 0, subset = None, inplace = True)
df.dropna(axis = 0) # Removes rows with missing values
df.dropna(subset = ["Age"]) # Drop rows with null values in the 'Age' column
# df["specific_column_name"].fillna(fill_value, inplace = True)
df.fillna(0) # Fills missing values with 0
Merging or Joining Data
Merging or joining data in pandas involves combining multiple DataFrames based on common columns or indices. These operations are essential when you have data distributed across different tables that you want to consolidate for analysis. In pandas, you can perform merging or joining using various options.
Types of Joins:
- Inner Join (default): An inner join returns only the rows that have matching values in both DataFrames. It retains only the common keys.
- Outer Join: An outer join returns all rows from both DataFrames, filling in missing values with NaN where there is no match.
- Left Join: A left join returns all rows from the left DataFrame and the matching rows from the right DataFrame. Unmatched rows from the left DataFrame are filled with NaN.
- Right Join: A right join is similar to a left join but returns all rows from the right DataFrame and the matching rows from the left DataFrame.
Merging DataFrames:
The merge()
function in pandas is used for merging DataFrames. Here's how you can use it:
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]})
# Inner join (default)
inner_merged = df1.merge(df2, on='key')
# Outer join
outer_merged = df1.merge(df2, on='key', how='outer')
# Left join
left_merged = df1.merge(df2, on='key', how='left')
# Right join
right_merged = df1.merge(df2, on='key', how='right')
print("Inner Join (default):")
print(inner_merged)
print("\nOuter Join:")
print(outer_merged)
print("\nLeft Join:")
print(left_merged)
print("\nRight Join:")
print(right_merged)
df1 = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'],
'key2': [1, 2, 3, 4],
'value1': [10, 20, 30, 40]})
df2 = pd.DataFrame({'key1': ['B', 'D', 'E', 'F'],
'key2': [2, 4, 6, 8],
'value2': [50, 60, 70, 80]})
# Inner join on multiple columns
inner_merged = df1.merge(df2, on=['key1', 'key2'])
print("\nInner Join on multiple columns:")
print(inner_merged)
df3 = pd.DataFrame({'left_key': ['A', 'B', 'C', 'D'],
'value1': [10, 20, 30, 40]})
df4 = pd.DataFrame({'right_key': ['B', 'D', 'E', 'F'],
'value2': [50, 60, 70, 80]})
# Inner join using `left_on` and `right_on`
inner_merged_lr = df3.merge(df4, left_on='left_key', right_on='right_key')
print("\nInner Join using left_on and right_on:")
print(inner_merged_lr)
In the code above, on = "key" specifies the column to merge on. You can use multiple columns by passing a list to the on parameter.
# Output:
key value_x value_y
0 B 2 5
1 D 4 6
key value_x value_y
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0
key value_x value_y
0 A 1 NaN
1 B 2 5.0
2 C 3 NaN
3 D 4 6.0
key value_x value_y
0 B 2.0 5
1 D 4.0 6
2 E NaN 7
3 F NaN 8
key1 key2 value1 value2
0 B 2 20 50
1 D 4 40 60
left_key value1 right_key value2
0 B 20 B 50
1 D 40 D 60
Concatenation of Data
Concatenation in pandas involves combining DataFrames along a particular axis (either rows or columns). Pandas provides the concat()
function for this purpose.
Here's how you can use the concat() function to concatenate DataFrames:
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7']})
# Concatenating along rows (axis=0)
result = pd.concat([df1, df2], ignore_index = True)
In this example, we concatenate df1
and df2
along rows (axis=0), effectively stacking df2 below df1.
You can also concatenate along columns (axis=1):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
# Concatenating along columns (axis=1)
result = pd.concat([df1, df2], axis=1)
In this case, df2 is concatenated next to df1 along columns. Make sure the index values match when concatenating along columns, or you may end up with misaligned data.
Pandas with Functions
Working with functions in pandas involves applying custom or built-in functions to DataFrame objects to perform data transformations, aggregations, or other operations.
You can also apply functions to DataFrames along rows or columns using apply(). Here's an example:
# Create a sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)
# Apply a function to each column
def double_column(col):
return col * 2
# For whole DataFrame
df = df.apply(double_column)
# For one Column of DataFrame
df = df['A'].apply(double_column)
Pandas with CSV Data
In pandas, you can easily read and write data from/to CSV files. This is a common and essential part of data analysis, as many datasets are stored in CSV (Comma-Separated Values) format. Here's how you can work with CSV data in pandas:
Reading Data from CSV Files:
You can use the pd.read_csv()
function to read data from a CSV file into a pandas DataFrame. Here's an example:
import pandas as pd
# Read data from a CSV file into a DataFrame
df = pd.read_csv('data.csv')
In this example, replace 'data.csv'
with the path to your CSV file.
Writing Data to CSV Files:
You can use the to_csv()
method to write a pandas DataFrame to a CSV file. Here's an example:
# Write a DataFrame to a CSV file
df.to_csv('output.csv', index=False)
In this example, replace output.csv with the desired output file path. Setting index = False will omit writing the DataFrame index to the CSV file.