Engineering

Engineering

Unique Paths: Grid Movement DP

Grid movement problems are the gateway drug to dynamic programming. They’re visual, intuitive, and map cleanly to the core DP concepts you’ll use everywhere else. The ‘unique paths’ problem—counting…

Read more →
Engineering

SQL - USING Clause in Joins

The USING clause is a syntactic shortcut for joining tables when the join columns share the same name. Instead of writing out the full equality condition, you simply specify the column name once….

Read more →
Engineering

SQL - Subquery in SELECT Clause

A subquery in the SELECT clause is a query nested inside the column list of your main query. Unlike subqueries in WHERE or FROM clauses, these must return exactly one value—a single row with a single…

Read more →
Engineering

SQL - Self Join with Examples

A self join is exactly what it sounds like: joining a table to itself. While this might seem circular at first, it’s one of the most practical SQL techniques for solving real-world data problems.

Read more →
Engineering

SQL - ROLLUP with Examples

ROLLUP is a GROUP BY extension that generates subtotals and grand totals in a single query. Instead of writing multiple queries and combining them with UNION ALL, you get hierarchical aggregations…

Read more →
Engineering

SQL - Natural Join

Natural join is SQL’s attempt at making joins effortless. Instead of explicitly specifying which columns should match between tables, a natural join automatically identifies columns with identical…

Read more →
Engineering

SQL - INNER JOIN with Examples

INNER JOIN is the workhorse of relational database queries. It combines rows from two or more tables based on a related column, returning only the rows where the join condition finds a match in both…

Read more →
Engineering

SQL - GROUP BY Multiple Columns

GROUP BY is fundamental to SQL analytics, but single-column grouping only gets you so far. Real business questions rarely fit into one dimension. You don’t just want total sales—you want sales by…

Read more →
Engineering

SQL - GROUPING SETS

GROUPING SETS solve a common analytical problem: you need aggregations at multiple levels in a single result set. Think sales totals by region, by product, by region and product combined, and a grand…

Read more →
Engineering

SQL - EXISTS and NOT EXISTS

EXISTS is one of SQL’s most underutilized operators. It answers a simple question: ‘Does at least one row exist that matches this condition?’ Unlike IN, which compares values, or JOINs, which combine…

Read more →
Engineering

SQL - FULL OUTER JOIN

A FULL OUTER JOIN combines the behavior of both LEFT and RIGHT joins into a single operation. It returns every row from both tables in the join, matching rows where possible and filling in NULL…

Read more →
Engineering

SQL - CUBE with Examples

CUBE is a GROUP BY extension that generates subtotals for all possible combinations of columns you specify. If you’ve ever built a pivot table in Excel or created a report that shows totals by…

Read more →
Engineering

SQL - Convert Date to String

Converting dates to strings is one of those tasks that seems trivial until you’re debugging a report that shows ‘2024-01-15’ in production but ‘01/15/2024’ in development. Date formatting affects…

Read more →
Engineering

SQL - Convert String to Date

Every database developer eventually faces the same problem: dates stored as strings. Whether it’s data imported from CSV files, user input from web forms, legacy systems that predate proper date…

Read more →
Engineering

SQL - ANY and ALL Operators

SQL’s ANY and ALL operators solve a specific problem: comparing a single value against a set of values returned by a subquery. While you could accomplish similar results with JOINs or EXISTS clauses,…

Read more →
Engineering

Splay Tree: Self-Adjusting BST

Splay trees are binary search trees that reorganize themselves with every operation. Unlike AVL or Red-Black trees that maintain strict balance invariants, splay trees take a different approach: they…

Read more →
Engineering

Spark Scala - Read JSON File

JSON remains the lingua franca of data interchange. APIs return it, logging systems emit it, and configuration files use it. When you’re building data pipelines with Apache Spark, you’ll inevitably…

Read more →
Engineering

Spark Scala - Window Functions

Window functions solve a fundamental problem in data processing: how do you compute values across multiple rows while keeping each row intact? Standard aggregations with GROUP BY collapse rows into…

Read more →
Engineering

Spark Scala - DataFrame Union

Union operations combine DataFrames vertically—stacking rows from multiple DataFrames into a single result. This differs fundamentally from join operations, which combine DataFrames horizontally…

Read more →
Engineering

Spark Scala - Kafka Integration

Streaming data pipelines have become the backbone of modern data architectures. Whether you’re processing clickstream data, IoT sensor readings, or financial transactions, the ability to handle data…

Read more →
Engineering

Spark Scala - RDD Operations

Resilient Distributed Datasets (RDDs) are Spark’s original abstraction for distributed data processing. While DataFrames and Datasets have become the preferred API for most workloads, understanding…

Read more →
Engineering

Spark Scala - Read CSV File

CSV files refuse to die. Despite the rise of Parquet, ORC, and Avro, you’ll still encounter CSV in nearly every data engineering project. Legacy systems export it. Business users create it in Excel….

Read more →
Engineering

Spark Scala - Build with SBT

If you’re building Spark applications in Scala, SBT should be your default choice. While Maven has broader enterprise adoption and Gradle offers flexibility, SBT provides native Scala support that…

Read more →
Engineering

R-Tree: Spatial Data Indexing

Traditional B-trees excel at one-dimensional data. Finding all users with IDs between 1000 and 2000 is straightforward—the data has a natural ordering. But what about finding all restaurants within 5…

Read more →
Engineering

R lubridate - Date Arithmetic

Date arithmetic sounds simple until you actually try to implement it. Adding 30 days to January 15th is straightforward. Adding ‘one month’ is not—does that mean 28, 29, 30, or 31 days? What happens…

Read more →
Engineering

R - format() Dates

Date formatting is one of those tasks that seems trivial until you’re debugging why your report shows ‘2024-01-15’ instead of ‘January 15, 2024’ at 2 AM before a client presentation. R’s format()

Read more →
Engineering

Python - pow() Function

Python provides multiple ways to calculate powers, but the built-in pow() function stands apart with capabilities that go beyond simple exponentiation. While most developers reach for the **

Read more →
Engineering

Python - Nested Loops

A nested loop is simply a loop inside another loop. The inner loop executes completely for each single iteration of the outer loop. This structure is fundamental when you need to work with…

Read more →
Engineering

Python - None Type Explained

Python’s None is a singleton object that represents the intentional absence of a value. It’s not zero, it’s not an empty string, and it’s not False—it’s the explicit statement that ’there is…

Read more →
Engineering

Python - If/Elif/Else Statement

Every useful program makes decisions. Should we grant access to this user? Is this input valid? Does this order qualify for free shipping? Conditional statements are how you encode these decisions in…

Read more →
Engineering

Python - divmod() Function

Python’s divmod() function is one of those built-ins that many developers overlook, yet it solves a common problem elegantly: getting both the quotient and remainder from a division operation in…

Read more →
Engineering

Python - Data Types Overview

Python is dynamically typed, meaning you don’t declare variable types explicitly—the interpreter figures it out at runtime. This doesn’t mean Python is weakly typed; it’s actually strongly typed. You…

Read more →
Engineering

Python - Complex Numbers

Python includes complex numbers as a built-in numeric type, sitting alongside integers and floats. This isn’t a bolted-on afterthought—complex numbers are deeply integrated into the language,…

Read more →
Engineering

Python - chr() and ord() Functions

Every character you see on screen is stored as a number. The letter ‘A’ is 65. The digit ‘0’ is 48. The emoji ‘🐍’ is 128013. This mapping between characters and integers is called character encoding,…

Read more →
Engineering

Python - Boolean Operations

Python’s boolean type represents one of two values: True or False. These aren’t just abstract concepts—they’re first-class objects that inherit from int, making True equivalent to 1 and…

Read more →
Engineering

Python - Bytes and Bytearray

Binary data is everywhere in software engineering. Every file on disk, every network packet, every image and audio stream exists as raw bytes. Python’s text strings (str) handle human-readable text…

Read more →
Engineering

PySpark: Handling Skewed Data

Data skew occurs when certain keys in your dataset appear far more frequently than others, causing uneven distribution of work across your Spark cluster. In a perfectly balanced world, each partition…

Read more →
Engineering

KISS Principle: Keep It Simple

The KISS principle—‘Keep It Simple, Stupid’—originated not in software but in aerospace. Kelly Johnson, the legendary engineer behind Lockheed’s Skunk Works, demanded that aircraft be designed so a…

Read more →
Engineering

How to Write to CSV in PySpark

CSV remains the lingua franca of data exchange. Despite its limitations—no schema enforcement, no compression by default, verbose storage—it’s universally readable. When you’re processing terabytes…

Read more →
Engineering

How to Use UDF in PySpark

PySpark’s built-in functions cover most data transformation needs, but real-world data is messy. You’ll inevitably encounter scenarios where you need custom logic: proprietary business rules, complex…

Read more →
Engineering

How to Use Map Type in PySpark

PySpark’s MapType is a complex data type that stores key-value pairs within a single column. Think of it as embedding a dictionary directly into your DataFrame schema. This becomes invaluable when…

Read more →
Engineering

How to Outer Join in PySpark

Every data engineer eventually hits the same problem: you need to combine two datasets, but they don’t perfectly align. Maybe you’re merging customer records with transactions, and some customers…

Read more →
Engineering

How to Left Join in PySpark

Left joins are the workhorse of data engineering. When you need to enrich a primary dataset with optional attributes from a secondary source, left joins preserve your complete dataset while pulling…

Read more →
Engineering

How to Inner Join in PySpark

Joins are the backbone of relational data processing. Whether you’re building ETL pipelines, preparing features for machine learning, or generating reports, you’ll spend a significant portion of your…

Read more →
Engineering

How to GroupBy in PySpark

GroupBy operations are the backbone of data analysis in PySpark. Whether you’re calculating sales totals by region, counting user events by session, or computing average response times by service,…

Read more →
Engineering

How to Filter Rows in PySpark

Row filtering is the bread and butter of data processing. Whether you’re cleaning messy datasets, extracting subsets for analysis, or preparing data for machine learning, you’ll filter rows…

Read more →
Engineering

How to Explode Arrays in PySpark

Array columns are everywhere in PySpark. Whether you’re parsing JSON from an API, processing log files with repeated fields, or working with denormalized data from a NoSQL database, you’ll eventually…

Read more →
Engineering

How to Cross Join in PySpark

A cross join, also called a Cartesian product, combines every row from one dataset with every row from another. Unlike inner or left joins that match rows based on key columns, cross joins have no…

Read more →
Engineering

Apache Spark vs Apache Flink

The big data processing landscape has consolidated around two dominant frameworks: Apache Spark and Apache Flink. Both can handle batch and stream processing, both scale horizontally, and both have…

Read more →
Engineering

Apache Spark - Partition Pruning

Partition pruning is Spark’s mechanism for skipping irrelevant data partitions during query execution. Think of it like a library’s card catalog system: instead of walking through every aisle to find…

Read more →
Engineering

Apache Spark - Column Pruning

Column pruning is one of Spark’s most impactful automatic optimizations, yet many developers never think about it—until their jobs run ten times slower than expected. The concept is straightforward:…

Read more →