Excel in 2024 — Nah! Let’s learn and use Pandas Library in Python!
Pandas is a powerful open-source data manipulation and analysis library for Python. It is widely used for cleaning, aggregating, and transforming data, and is an essential tool for many data scientists and analysts.
We can handle any task when is time taking in excel and more important is we can automate those repititive task. Specially for people who do not like Excel Macros as it freezes, make error and even sometime shutdown all system while rulling if have some bugs. Pandas let you handle all those excel works you can imagine just with few line of codes.
I recommend to start using Spyder Kernal to start with as it provide a interface to see the dataframe uploaded into system .
To start using pandas, you will need to install it. You can install pandas using pip
, the Python package manager, by running the following command in your terminal:
pip install pandas
Once you have installed pandas, you can start using it in your Python scripts by importing it:
The pd
alias is a common convention and will be used throughout this tutorial.
Reading data
One of the most basic tasks in pandas is reading data into a DataFrame. A DataFrame is a two-dimensional size-mutable tabular data structure with rows and columns. You can think of it as a spreadsheet or a SQL table.
There are several ways to read data into a DataFrame. Some of the most common ones are:
pd.read_csv()
: Read a CSV (comma-separated values) filepd.read_excel()
: Read an Excel filepd.read_json()
: Read a JSON (JavaScript Object Notation) filepd.read_sql()
: Read data from a SQL database
Here’s an example of how to read a CSV file into a DataFrame using pd.read_csv()
:# First, import pandas and load the Excel file
import pandas as pd
# Replace 'path/to/file.xlsx' with the path to your Excel file
df = pd.read_excel('path/to/file.xlsx')
# You can view the first few rows of the DataFrame using the head() method
print(df.head())
# You can select specific columns by name using the square bracket [] operator
print(df['column_name'])
# You can also select multiple columns by passing a list of column names
print(df[['column_1', 'column_2']])
# You can filter the DataFrame using a boolean mask
mask = df['column_name'] > value
filtered_df = df[mask]
# You can also use the query() method to filter the DataFrame using a string expression
filtered_df = df.query('column_name > value')
# You can group the data by one or more columns and apply a function to each group
grouped_df = df.groupby('column_name')['column_name_2'].mean()
# You can apply multiple functions to a group by using the agg() method
grouped_df = df.groupby('column_name').agg(['mean', 'std'])
This will read the data.csv
file into a DataFrame and store it in the df
variable. The read_csv()
function has many optional parameters that allow you to customize the way the data is read. For example, you can use the sep
parameter to specify a different delimiter, or the header
parameter to specify which row should be used as the column names.
Exploring data
Once you have your data in a DataFrame, you can start exploring it and performing various operations on it. Here are a few common tasks:
df.head()
: Display the first few rows of the DataFramedf.tail()
: Display the last few rows of the DataFramedf.info()
: Display information about the DataFrame, such as the data types of each column and the number of non-missing valuesdf.describe()
: Generate descriptive statistics for the numerical columnsdf.columns
: Display the column names of the DataFramedf.dtypes
: Display the data types of the columns
You can also use the []
operator to select specific columns or rows. For example, to select the 'name'
and 'age'
columns, you can do:
To select a single row, you can use the .loc[]
or .iloc[]
attributes, which allow you to select rows by label or integer index, respectively. For example:# First, import pandas and read the data into a DataFrame
import pandas as pd
df = pd.read_csv('path/to/data.csv')
# View the first few rows of the DataFrame
print(df.head())
# Get information about the DataFrame, including the data types and number of rows and columns
print(df.info())
# Get statistical summary of the data
print(df.describe())
# Select a specific column
print(df['column_name'])
# Select multiple columns
print(df[['column_1', 'column_2']])
# Select rows that meet a certain condition
mask = df['column_name'] > value
filtered_df = df[mask]
# You can also use the query() method to filter the DataFrame using a string expression
fil
Well just try this first and get familier with your data, In Upcoming Blogs we will discuss how to handle some complex excel functions using Pandas!
#Data Analysis
[…] Start to use pandas! — Say GoodBye to Excel in 2023! […]