Spark

SQL

Spark SQL - Window Functions Tutorial

Window functions perform calculations across a set of rows that are related to the current row. Unlike aggregate functions with GROUP BY that collapse multiple rows into one, window functions…

Read more →
SQL

Spark SQL - Hive Integration

To enable Hive support in Spark, you need the Hive dependencies and proper configuration. First, ensure your spark-defaults.conf or application code includes Hive metastore connection details:

Read more →
SQL

Spark SQL - JSON Functions

• Spark SQL provides over 20 specialized JSON functions for parsing, extracting, and manipulating JSON data directly within DataFrames without requiring external libraries or UDFs

Read more →
SQL

Spark SQL - Managed vs External Tables

Spark SQL supports two table types that differ in how they manage data lifecycle and storage. Managed tables (also called internal tables) give Spark full control over both metadata and data files….

Read more →
SQL

Spark SQL - Map Functions

• Map functions in Spark SQL enable manipulation of key-value pair structures through native SQL syntax, eliminating the need for complex UDFs or RDD operations in most scenarios

Read more →
SQL

Spark SQL - Struct Type Operations

Struct types represent complex data structures within a single column, similar to objects in programming languages or nested JSON documents. Unlike primitive types, structs contain multiple named…

Read more →
SQL

Spark SQL - Aggregate Functions

Spark SQL provides comprehensive aggregate functions that operate on grouped data. The fundamental pattern involves grouping rows by one or more columns and applying aggregate functions to compute…

Read more →
SQL

Spark SQL - Array Functions

• Spark SQL provides 50+ array functions that enable complex data transformations without UDFs, significantly improving performance through Catalyst optimizer integration and whole-stage code…

Read more →
SQL

Spark SQL - Catalog API

The Spark Catalog API exposes metadata operations through the SparkSession.catalog object. This interface abstracts the underlying metastore implementation, whether you’re using Hive, Glue, or…

Read more →
SQL

Spark SQL - Create Database and Tables

Spark SQL databases are logical namespaces that organize tables and views. By default, Spark creates a default database, but production applications require proper database organization for better…

Read more →
SQL

Spark SQL - Data Types Reference

• Spark SQL supports 20+ data types organized into numeric, string, binary, boolean, datetime, and complex categories, with specific handling for nullable values and schema evolution

Read more →
Engineering

Spark Scala - Read JSON File

JSON remains the lingua franca of data interchange. APIs return it, logging systems emit it, and configuration files use it. When you’re building data pipelines with Apache Spark, you’ll inevitably…

Read more →
Engineering

Spark Scala - Window Functions

Window functions solve a fundamental problem in data processing: how do you compute values across multiple rows while keeping each row intact? Standard aggregations with GROUP BY collapse rows into…

Read more →
Engineering

Spark Scala - DataFrame Union

Union operations combine DataFrames vertically—stacking rows from multiple DataFrames into a single result. This differs fundamentally from join operations, which combine DataFrames horizontally…

Read more →
Engineering

Spark Scala - Kafka Integration

Streaming data pipelines have become the backbone of modern data architectures. Whether you’re processing clickstream data, IoT sensor readings, or financial transactions, the ability to handle data…

Read more →
Engineering

Spark Scala - RDD Operations

Resilient Distributed Datasets (RDDs) are Spark’s original abstraction for distributed data processing. While DataFrames and Datasets have become the preferred API for most workloads, understanding…

Read more →
Engineering

Spark Scala - Read CSV File

CSV files refuse to die. Despite the rise of Parquet, ORC, and Avro, you’ll still encounter CSV in nearly every data engineering project. Legacy systems export it. Business users create it in Excel….

Read more →
Engineering

Spark Scala - Build with SBT

If you’re building Spark applications in Scala, SBT should be your default choice. While Maven has broader enterprise adoption and Gradle offers flexibility, SBT provides native Scala support that…

Read more →
Engineering

Apache Spark vs Apache Flink

The big data processing landscape has consolidated around two dominant frameworks: Apache Spark and Apache Flink. Both can handle batch and stream processing, both scale horizontally, and both have…

Read more →
Engineering

Apache Spark - Partition Pruning

Partition pruning is Spark’s mechanism for skipping irrelevant data partitions during query execution. Think of it like a library’s card catalog system: instead of walking through every aisle to find…

Read more →
Engineering

Apache Spark - Column Pruning

Column pruning is one of Spark’s most impactful automatic optimizations, yet many developers never think about it—until their jobs run ten times slower than expected. The concept is straightforward:…

Read more →