Source

Jan 27, 2026 Data Engineering

Spark Streaming - Rate Source for Testing

The rate source is a built-in streaming source in Spark Structured Streaming that generates rows at a specified rate. Unlike file-based or socket sources, it requires no external setup and produces…

Read more →

Jan 26, 2026 Data Engineering

Spark Streaming - File Source Processing

Spark Structured Streaming treats file sources as unbounded tables, continuously monitoring a directory for new files. Unlike traditional batch processing, the file source uses checkpoint metadata to…

Read more →

Jan 26, 2026 Data Engineering

Spark Streaming - Kafka Source Integration

Spark Structured Streaming integrates with Kafka through the kafka source format. The minimal configuration requires bootstrap servers and topic subscription:

Read more →

Oct 30, 2025 Python

PySpark - Streaming from File Source

PySpark Structured Streaming treats file sources as unbounded tables, continuously monitoring directories for new files. Unlike batch processing, the streaming engine maintains state through…

Read more →

Oct 30, 2025 Python

PySpark - Streaming from Socket Source

• PySpark’s socket streaming provides a lightweight way to process real-time data streams over TCP connections, ideal for development, testing, and scenarios where you need to integrate with legacy…

Read more →