Pandas

Data manipulation and analysis with Python's Pandas library. DataFrames, grouping, merging, and cleaning real-world datasets.

Pandas

Pandas: Working with DateTime

Time-based data appears everywhere: server logs, financial transactions, sensor readings, user activity streams. Yet datetime handling remains one of the most frustrating aspects of data analysis….

Read more →
Pandas

Pandas - Transpose DataFrame

• Transposing DataFrames swaps rows and columns using the .T attribute or .transpose() method, essential for reshaping data when features and observations need to be inverted

Read more →
Pandas

Pandas - str.extract() with Regex

The str.extract() method applies a regular expression pattern to each string in a Series and extracts matched groups into new columns. The critical requirement: your regex must contain at least one…

Read more →
Pandas

Pandas: String Operations Guide

Text data is messy. Customer names have inconsistent casing, addresses contain extra whitespace, and product codes follow patterns that need parsing. If you’re reaching for a for loop or apply()

Read more →
Pandas

Pandas - Stack and Unstack

• Stack converts column labels into row index levels (wide to long), while unstack does the reverse (long to wide), making them essential for reshaping hierarchical data structures

Read more →
Pandas

Pandas - Set DatetimeIndex

Time-series data without proper datetime indexing forces you into string comparisons and manual date arithmetic. A DatetimeIndex enables pandas’ temporal superpowers: automatic date-based slicing,…

Read more →
Pandas

Pandas - Set/Reset Column as Index

• Setting a column as an index transforms it from regular data into row labels, enabling faster lookups and more intuitive data alignment—use set_index() for single or multi-level indexes without…

Read more →
Pandas

Pandas - Select Multiple Columns

The most straightforward method for selecting multiple columns uses bracket notation with a list of column names. This approach is readable and works well when you know the exact column names.

Read more →
Pandas

Pandas - Reorder/Rearrange Columns

The most straightforward approach to reorder columns is passing a list of column names in your desired sequence. This creates a new DataFrame with columns arranged according to your specification.

Read more →
Pandas

Pandas - Resample Time Series Data

Resampling reorganizes time series data into new time intervals. Downsampling reduces frequency (hourly to daily), requiring aggregation. Upsampling increases frequency (daily to hourly), requiring…

Read more →
Pandas

Pandas - Reset Index of DataFrame

• The reset_index() method converts index labels into regular columns and creates a new default integer index, essential when you need to flatten hierarchical indexes or restore a clean numeric…

Read more →
Pandas

Pandas - Right Join DataFrames

A right join (right outer join) returns all records from the right DataFrame and matched records from the left DataFrame. When no match exists, Pandas fills left DataFrame columns with NaN values….

Read more →
Pandas

Pandas - Rename Column by Index

When working with DataFrames from external sources, you’ll frequently encounter datasets with auto-generated column names, duplicate headers, or names that don’t follow Python naming conventions….

Read more →
Pandas

Pandas - Rename Column Names

The rename() method is the most versatile approach for changing column names in Pandas. It accepts a dictionary mapping old names to new names and returns a new DataFrame by default.

Read more →
Pandas

Pandas: Reading and Writing Files

Every data project starts and ends with file operations. You pull data from CSVs, databases, or APIs, transform it, then export results for downstream consumers. Pandas makes this deceptively…

Read more →
Pandas

Pandas - Read from S3 Bucket

• Pandas integrates seamlessly with S3 through the s3fs library, allowing you to read files directly using standard read_csv(), read_parquet(), and other I/O functions with S3 URLs

Read more →
Pandas

Pandas - Pivot Table with Examples

A pivot table reorganizes data from a DataFrame by specifying which columns become the new index (rows), which become columns, and what values to aggregate. The fundamental syntax requires three…

Read more →
Pandas

Pandas - Rank Values in Column

• Pandas provides multiple ranking methods (average, min, max, first, dense) that handle tied values differently, with the rank() method offering fine-grained control over ranking behavior

Read more →
Pandas

Pandas - Merge on Multiple Columns

Merging on multiple columns follows the same syntax as single-column merges, but passes a list to the on parameter. This creates a composite key where all specified columns must match for rows to…

Read more →
Pandas

Pandas: Merge vs Join vs Concat

Combining DataFrames is one of the most common operations in data analysis, yet Pandas offers three different methods that seem to do similar things: concat, merge, and join. This creates…

Read more →
Pandas

Pandas - Join on Index

Pandas provides the join() method specifically optimized for index-based operations. Unlike merge(), which defaults to column-based joins, join() leverages the DataFrame index structure for…

Read more →
Pandas

Pandas - Left Join DataFrames

A left join returns all records from the left DataFrame and matching records from the right DataFrame. When no match exists, pandas fills the right DataFrame’s columns with NaN values. This operation…

Read more →
Pandas

Pandas - Memory Optimization Tips

• Pandas DataFrames can consume 10-100x more memory than necessary due to default data types—switching from int64 to int8 or using categorical types can reduce memory usage by 90% or more

Read more →
Pandas

Pandas: Memory Usage Reduction

Pandas defaults to memory-hungry data types. Load a CSV with a million rows, and Pandas will happily allocate 64-bit integers for columns that only contain values 0-10, and store repeated strings…

Read more →
Pandas

Pandas - Inner Join DataFrames

An inner join combines two DataFrames by matching rows based on common column values, retaining only the rows where matches exist in both datasets. This is the default join type in Pandas and the…

Read more →
Pandas

Pandas: GroupBy with DataFrames

The GroupBy operation is one of the most powerful features in pandas, yet many developers underutilize it or misuse it entirely. At its core, GroupBy implements the split-apply-combine paradigm: you…

Read more →
Pandas

Pandas: Handling Missing Data

Every real-world dataset has holes. Missing data shows up as NaN (Not a Number), None, or NaT (Not a Time) in Pandas, and how you handle these gaps directly impacts the quality of your analysis.

Read more →
Pandas

Pandas - GroupBy and Sum

The GroupBy sum operation is fundamental to data aggregation in Pandas. It splits your DataFrame into groups based on one or more columns, calculates the sum for each group, and returns the…

Read more →
Pandas

Pandas - GroupBy and Transform

The groupby() operation splits a DataFrame into groups based on one or more keys, applies a function to each group, and combines the results. This split-apply-combine pattern is fundamental to data…

Read more →
Pandas

Pandas - GroupBy Multiple Columns

• GroupBy with multiple columns creates hierarchical indexes that enable multi-dimensional data aggregation, essential for analyzing data across multiple categorical dimensions simultaneously.

Read more →
Pandas

Pandas - GroupBy Single Column

The groupby() method partitions a DataFrame based on unique values in a specified column. This operation doesn’t immediately compute results—it creates a GroupBy object that holds instructions for…

Read more →
Pandas

Pandas - GroupBy and Count

• GroupBy operations in Pandas enable efficient data aggregation by splitting data into groups based on categorical variables, applying functions, and combining results into a structured output

Read more →
Pandas

Pandas - GroupBy and Filter Groups

GroupBy filtering differs fundamentally from standard DataFrame filtering. While df[df['column'] > value] filters individual rows, GroupBy filtering operates on entire groups. When you filter…

Read more →
Pandas

Pandas - Drop Rows by Index

• Pandas offers multiple methods to drop rows by index including drop(), boolean indexing, and iloc[], each suited for different scenarios from simple deletions to complex conditional filtering

Read more →
Pandas

Pandas - eval() for Performance

Standard pandas operations create intermediate objects for each step in a calculation. When you write df['A'] + df['B'] + df['C'], pandas allocates memory for df['A'] + df['B'], then adds…

Read more →
Pandas

Pandas - Drop Duplicate Rows

• The drop_duplicates() method removes duplicate rows based on all columns by default, but accepts parameters to target specific columns, choose which duplicate to keep, and control in-place…

Read more →
Pandas

Pandas - Drop Multiple Columns

• Pandas offers multiple methods to drop columns: drop() with column names, drop() with indices, and direct column selection—each suited for different scenarios and data manipulation patterns.

Read more →
Pandas

Pandas - Drop Rows by Condition

• Pandas offers multiple methods to drop rows based on conditions: boolean indexing with bracket notation, drop() with index labels, and query() for SQL-like syntax—each with distinct performance…

Read more →
Pandas

Pandas - Create Empty DataFrame

• Creating empty DataFrames in Pandas requires understanding the difference between truly empty DataFrames, those with defined columns, and those with predefined structure including dtypes

Read more →
Pandas

Pandas - Cross Join DataFrames

A cross join (Cartesian product) combines every row from the first DataFrame with every row from the second DataFrame. If DataFrame A has m rows and DataFrame B has n rows, the result contains m × n…

Read more →
Pandas

Pandas - Cumulative Sum (cumsum)

The cumsum() method computes the cumulative sum of elements along a specified axis. By default, it operates on each column independently, returning a DataFrame or Series with the same shape as the…

Read more →
Pandas

Pandas - Convert Column to Integer

• Converting columns to integers in Pandas requires handling null values first, as standard int types cannot represent missing data—use Int64 (nullable integer) or fill/drop nulls before conversion

Read more →
Pandas

Pandas - Append DataFrames

Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is pd.concat(), which concatenates pandas objects along a particular axis with optional set logic…

Read more →
Pandas

Pandas - Apply Function to Column

• The apply() method transforms DataFrame columns using custom functions, lambda expressions, or built-in functions, offering more flexibility than vectorized operations for complex transformations

Read more →
Pandas

Pandas - Add Multiple Columns

The most straightforward approach to adding multiple columns is direct assignment. You can assign multiple columns at once using a list of column names and corresponding values.

Read more →
Pandas

How to Write to CSV in Pandas

Every data pipeline eventually needs to export data somewhere. CSV remains the universal interchange format—it’s human-readable, works with Excel, imports into databases, and every programming…

Read more →
Pandas

How to Write to Excel in Pandas

Pandas makes exporting data to Excel straightforward, but the simplicity of df.to_excel() hides a wealth of options that can transform your output from a raw data dump into a polished,…

Read more →
Pandas

How to Write to SQL in Pandas

Pandas excels at data manipulation, but eventually you need to persist your work somewhere more durable than a CSV file. SQL databases remain the backbone of most production data systems, and pandas…

Read more →
Pandas

How to Use Value Counts in Pandas

When you’re exploring a new dataset, one of the first questions you’ll ask is ‘what values exist in this column and how often do they appear?’ The value_counts() method answers this question…

Read more →
Pandas

How to Use Transform in Pandas

Pandas gives you three main methods for applying functions to data: apply(), agg(), and transform(). Understanding when to use each one will save you hours of debugging and rewriting code.

Read more →
Pandas

How to Use str.replace in Pandas

Real-world data is messy. You’ll encounter inconsistent formatting, unwanted characters, legacy encoding issues, and text that needs standardization before analysis. Pandas’ str.replace() method is…

Read more →
Pandas

How to Use str.split in Pandas

String splitting is one of the most common data cleaning operations you’ll perform in Pandas. Whether you’re parsing CSV-like fields, extracting usernames from email addresses, or breaking apart full…

Read more →
Pandas

How to Use str.contains in Pandas

String matching is one of the most common operations when working with text data in pandas. Whether you’re filtering customer names, searching product descriptions, or parsing log files, you need a…

Read more →
Pandas

How to Use str.extract in Pandas

Pandas’ str.extract method solves a specific problem: you have a column of strings containing structured information buried in text, and you need to pull that information into usable columns. Think…

Read more →
Pandas

How to Use Pipe in Pandas

If you’ve written Pandas code for any length of time, you’ve probably encountered the readability nightmare of nested function calls or sprawling intermediate variables. The pipe() method solves…

Read more →
Pandas

How to Use Query in Pandas

Pandas gives you two main ways to filter DataFrames: boolean indexing and the query() method. Most tutorials focus on boolean indexing because it’s the traditional approach, but query() often…

Read more →
Pandas

How to Use pd.cut in Pandas

Continuous numerical data is messy. When you’re analyzing customer ages, transaction amounts, or test scores, the raw numbers often obscure patterns that become obvious once you group them into…

Read more →
Pandas

How to Use pd.qcut in Pandas

Binning continuous data into discrete categories is a fundamental data preparation task. Pandas offers two primary functions for this: pd.cut and pd.qcut. Understanding when to use each will save…

Read more →
Pandas

How to Use Melt in Pandas

Data rarely arrives in the format you need. You’ll encounter ‘wide’ datasets where each variable gets its own column, and ’long’ datasets where observations stack vertically with categorical…

Read more →
Pandas

How to Use loc in Pandas

Pandas provides two primary indexers for accessing data: loc and iloc. Understanding the difference between them is fundamental to writing clean, bug-free data manipulation code.

Read more →
Pandas

How to Use Map in Pandas

Pandas gives you several ways to transform data, and choosing the wrong one leads to slower code and confused teammates. The map() function is your go-to tool for element-wise transformations on a…

Read more →
Pandas

How to Use json_normalize in Pandas

Nested JSON is everywhere. APIs return it, NoSQL databases store it, and configuration files depend on it. But pandas DataFrames expect flat, tabular data. The gap between these two worlds causes…

Read more →
Pandas

How to Use iloc in Pandas

Pandas provides two primary indexers for accessing data: loc and iloc. While they look similar, they serve fundamentally different purposes. iloc stands for ‘integer location’ and uses…

Read more →
Pandas

How to Use GroupBy in Pandas

Pandas GroupBy is one of those features that separates beginners from practitioners. Once you internalize it, you’ll find yourself reaching for it constantly—summarizing sales by region, calculating…

Read more →
Pandas

How to Use Get Dummies in Pandas

Machine learning algorithms work with numbers, not strings. When your dataset contains categorical variables like ‘red’, ‘blue’, or ‘green’, you need to convert them into a numerical format. One-hot…

Read more →
Pandas

How to Use Eval in Pandas

Pandas provides two eval functions that let you evaluate string expressions against your data: the top-level pd.eval() and the DataFrame method df.eval(). Both parse and execute expressions…

Read more →
Pandas

How to Use Describe in Pandas

Exploratory data analysis starts with one question: what does my data actually look like? Before building models, creating visualizations, or writing complex transformations, you need to understand…

Read more →
Pandas

How to Use Apply in Pandas

The apply() function in pandas lets you run custom functions across your data. It’s the escape hatch you reach for when pandas’ built-in methods don’t cover your use case. Need to parse a custom…

Read more →
Pandas

How to Use Applymap in Pandas

When you need to transform every single element in a Pandas DataFrame, applymap() is your tool. It takes a function and applies it to each cell individually, returning a new DataFrame with the…

Read more →
Pandas

How to Use Assign in Pandas

The assign() method is one of pandas’ most underappreciated features. It creates new columns on a DataFrame and returns a copy with those columns added. This might sound trivial—after all, you can…

Read more →
Pandas

How to Stack and Unstack in Pandas

Pandas provides two complementary methods for reshaping data: stack() and unstack(). These operations pivot data between ’long’ and ‘wide’ formats by moving index levels between the row and…

Read more →
Pandas

How to Sort a DataFrame in Pandas

Sorting is one of the most frequent operations you’ll perform during data analysis. Whether you’re finding top performers, organizing time-series data chronologically, or simply making a DataFrame…

Read more →
Pandas

How to Sort by Index in Pandas

Pandas DataFrames maintain an index that serves as the row identifier, but this index doesn’t always stay in the order you expect. After merging datasets, filtering rows, or creating custom indices,…

Read more →
Pandas

How to Set Index in Pandas

Every pandas DataFrame has an index, whether you set one explicitly or accept the default integer sequence. The index isn’t just a row label—it’s the backbone of pandas’ data alignment system. When…

Read more →
Pandas

How to Shift Values in Pandas

Shifting values is one of the most fundamental operations in time series analysis and data manipulation. The pandas shift() method moves data up or down along an axis, creating offset versions of…

Read more →
Pandas

How to Select Columns in Pandas

Column selection is the bread and butter of pandas work. Before you can clean, transform, or analyze data, you need to extract the specific columns you care about. Whether you’re dropping irrelevant…

Read more →
Pandas

How to Reset Index in Pandas

Understanding how to manipulate DataFrame indexes is fundamental to working effectively with pandas. The index isn’t just a row label—it’s a powerful tool for data alignment, fast lookups, and…

Read more →
Pandas

How to Right Join in Pandas

A right join returns all rows from the right DataFrame and the matched rows from the left DataFrame. When there’s no match in the left DataFrame, the result contains NaN values for those columns.

Read more →
Pandas

How to Sample Random Rows in Pandas

Random sampling is fundamental to practical data work. You need it for exploratory data analysis when you can’t eyeball a million rows. You need it for creating train/test splits in machine learning…

Read more →
Pandas

How to Rename Columns in Pandas

Every data scientist has opened a CSV file only to find column names like Unnamed: 0, cust_nm_1, or Total Revenue (USD) - Q4 2023. Messy column names create friction throughout your analysis…

Read more →
Pandas

How to Rank Values in Pandas

Ranking assigns ordinal positions to values in a dataset. Instead of asking ‘what’s the value?’, you’re asking ‘where does this value stand relative to others?’ This distinction matters in countless…

Read more →
Pandas

How to Read CSV Files in Pandas

CSV files remain the lingua franca of data exchange. Despite the rise of Parquet, JSON, and database connections, you’ll encounter CSVs constantly—from client exports to API downloads to legacy…

Read more →
Pandas

How to Read Excel Files in Pandas

Excel files remain stubbornly ubiquitous in data workflows. Whether you’re receiving sales reports from finance, customer data from marketing, or research datasets from academic partners, you’ll…

Read more →
Pandas

How to Read JSON Files in Pandas

JSON has become the lingua franca of web APIs and configuration files. It’s human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON’s nested, hierarchical structure doesn’t map…

Read more →
Pandas

How to Pivot a DataFrame in Pandas

Pivoting transforms data from a ’long’ format (many rows, few columns) to a ‘wide’ format (fewer rows, more columns). If you’ve ever received transactional data where each row represents a single…

Read more →
Pandas

How to One-Hot Encode in Pandas

One-hot encoding transforms categorical variables into a numerical format that machine learning algorithms can process. Most algorithms expect numerical input, and simply converting categories to…

Read more →
Pandas

How to Outer Join in Pandas

An outer join combines two DataFrames while preserving all records from both sides, regardless of whether a matching key exists. When a row from one DataFrame has no corresponding match in the other,…

Read more →
Pandas

How to Left Join in Pandas

A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. When there’s no match, the result contains NaN values for columns from the right DataFrame.

Read more →
Pandas

How to Merge DataFrames in Pandas

Every real-world data project involves combining datasets. You have customer information in one table, their transactions in another, and product details in a third. Getting useful insights means…

Read more →
Pandas

How to Merge on Index in Pandas

Most pandas tutorials focus on merging DataFrames using columns, but index-based merging is often the cleaner, faster approach—especially when your data naturally has meaningful identifiers like…

Read more →
Pandas

How to Iterate Over Rows in Pandas

Row iteration is one of those topics where knowing how to do something is less important than knowing when to do it. Pandas is built on NumPy, which processes entire arrays in optimized C code….

Read more →
Pandas

How to Join DataFrames in Pandas

Combining data from multiple sources is one of the most common operations in data analysis. Whether you’re merging customer records with transaction data, combining time series from different…

Read more →
Pandas

How to Label Encode in Pandas

Machine learning algorithms work with numbers, not text. When your dataset contains categorical columns like ‘color,’ ‘size,’ or ‘region,’ you need to convert these string values into numerical…

Read more →
Pandas

How to Inner Join in Pandas

An inner join combines two DataFrames by keeping only the rows where the join key exists in both tables. If a key appears in one DataFrame but not the other, that row gets dropped. This makes inner…

Read more →
Pandas

How to Handle MultiIndex in Pandas

Hierarchical indexing (MultiIndex) lets you work with higher-dimensional data in a two-dimensional DataFrame. Instead of creating separate DataFrames or adding redundant columns, you encode multiple…

Read more →
Pandas

How to GroupBy and Count in Pandas

Counting things is the foundation of data analysis. Before you build models or create visualizations, you need to understand what’s in your data: How many orders per customer? How many defects per…

Read more →
Pandas

How to GroupBy and Sum in Pandas

Grouping data by categories and calculating sums is one of the most common operations in data analysis. Whether you’re calculating total sales by region, summing expenses by department, or…

Read more →
Pandas

How to Forward Fill in Pandas

Forward fill is exactly what it sounds like: it takes the last known valid value and carries it forward to fill subsequent missing values. If you have a sensor reading at 10:00 AM and missing data at…

Read more →
Pandas

How to Filter NaN Values in Pandas

NaN values are the silent saboteurs of data analysis. They creep into your datasets from incomplete API responses, failed data entry, sensor malfunctions, or mismatched joins. Left unchecked, they’ll…

Read more →
Pandas

How to Filter Rows in Pandas

Row filtering is something you’ll do in virtually every pandas workflow. Whether you’re cleaning messy data, preparing subsets for analysis, or extracting records that meet specific criteria,…

Read more →
Pandas

How to Fill NaN Values in Pandas

Missing data is inevitable in real-world datasets. Whether it’s a sensor that failed to record a reading, a user who skipped a form field, or data that simply doesn’t exist for certain combinations,…

Read more →
Pandas

How to Fill NaN with Mean in Pandas

Missing data is inevitable. Whether you’re working with survey responses, sensor readings, or scraped web data, you’ll encounter NaN values that need handling before analysis or modeling. Mean…

Read more →
Pandas

How to Filter by Date in Pandas

Date filtering is one of the most common operations in data analysis. Whether you’re analyzing sales trends, processing server logs, or building financial reports, you’ll inevitably need to slice…

Read more →
Pandas

How to Explode a Column in Pandas

When working with real-world data, you’ll frequently encounter columns containing list-like values. Maybe you’re parsing JSON from an API, dealing with multi-select form fields, or processing…

Read more →
Pandas

How to Delete a Column in Pandas

Deleting columns from a DataFrame is one of the most frequent operations in data cleaning. Whether you’re removing irrelevant features before model training, dropping columns with too many null…

Read more →
Pandas

How to Cross Join in Pandas

A cross join, also called a Cartesian product, combines every row from one table with every row from another table. If DataFrame A has 3 rows and DataFrame B has 4 rows, the result contains 12…

Read more →
Pandas

How to Create a Crosstab in Pandas

A crosstab—short for cross-tabulation—is a table that displays the frequency distribution of variables. Think of it as a pivot table specifically designed for categorical data. When you need to…

Read more →
Pandas

How to Check Data Types in Pandas

Data types in Pandas aren’t just metadata—they determine what operations you can perform, how much memory your DataFrame consumes, and whether your calculations produce correct results. A column that…

Read more →
Pandas

How to Backward Fill in Pandas

Backward fill is a data imputation technique that fills missing values with the next valid observation in a sequence. Unlike forward fill, which carries previous values forward, backward fill looks…

Read more →
Pandas

How to Bin Data in Pandas

Binning—also called discretization or bucketing—converts continuous numerical data into discrete categories. You take a range of values and group them into bins, turning something like ‘age: 27’ into…

Read more →
Pandas

How to Add a New Column in Pandas

Adding columns to a Pandas DataFrame is one of the most common operations you’ll perform in data analysis. Whether you’re calculating derived metrics, categorizing data, or preparing features for…

Read more →