Read | Application Architect

Jan 23, 2026 Engineering

Spark Scala - Read JSON File

JSON remains the lingua franca of data interchange. APIs return it, logging systems emit it, and configuration files use it. When you’re building data pipelines with Apache Spark, you’ll inevitably…

Read more →

Jan 23, 2026 Engineering

Spark Scala - Read Parquet File

Apache Parquet has become the de facto standard for storing analytical data in big data ecosystems. As a columnar storage format, Parquet stores data by column rather than by row, which provides…

Read more →

Jan 22, 2026 Engineering

Spark Scala - Read CSV File

CSV files refuse to die. Despite the rise of Parquet, ORC, and Avro, you’ll still encounter CSV in nearly every data engineering project. Legacy systems export it. Business users create it in Excel….

Read more →

Jan 11, 2026 Scala

Scala - Read CSV File

For simple CSV files without complex quoting or escaping, Scala’s standard library provides sufficient functionality. Use scala.io.Source to read files line by line and split on delimiters.

Read more →

Dec 15, 2025 R

R - Read CSV File (read.csv / readr::read_csv)

• R offers multiple CSV reading methods—base R’s read.csv() provides universal compatibility while readr::read_csv() delivers 10x faster performance with better type inference

Read more →

Dec 15, 2025 R

R - Read Excel File (readxl::read_excel)

The readxl package comes bundled with the tidyverse but can be installed independently. It reads both modern .xlsx files and legacy .xls formats without external dependencies.

Read more →

Dec 15, 2025 R

R - Read Fixed-Width File

Fixed-width files allocate specific character positions for each field. Unlike CSV files that use delimiters, these files rely on consistent positioning. A record might look like this:

Read more →

Dec 15, 2025 R

R - Read from Database (DBI/RSQLite)

The DBI (Database Interface) package provides a standardized way to interact with databases in R. RSQLite implements this interface for SQLite databases, offering a zero-configuration option that…

Read more →

Dec 15, 2025 R

R - Read from URL/Web

Base R handles simple URL reading through readLines() and url() connections. This works for plain text, CSV files, and basic HTTP requests without authentication.

Read more →

Dec 15, 2025 R

R - Read JSON File (jsonlite)

The jsonlite package is the de facto standard for JSON operations in R. Install it once and load it for each session:

Read more →

Nov 25, 2025 Python

Python - Read File into List

The most straightforward approach uses readlines(), which returns a list where each element represents a line from the file, including newline characters:

Read more →

Nov 25, 2025 Python

Python - Read File Line by Line

The readline() method reads a single line from a file, advancing the file pointer to the next line. This approach gives you explicit control over when and how lines are read.

Read more →

Nov 24, 2025 Python

Python - Read File (Complete Guide)

The with statement is the standard way to read files in Python. It automatically closes the file even if an exception occurs, preventing resource leaks.

Read more →

Nov 12, 2025 Python

Python File Handling: read, write, and append Operations

File I/O operations form the backbone of data persistence in Python applications. Whether you’re processing CSV files, managing application logs, or storing user preferences, understanding file…

Read more →

Oct 25, 2025 Python

PySpark - Read JSON File into DataFrame

Reading JSON files into a PySpark DataFrame starts with the spark.read.json() method. This approach automatically infers the schema from the JSON structure.

Read more →

Oct 25, 2025 Python

PySpark - Read Multiline JSON

PySpark’s JSON reader expects newline-delimited JSON (NDJSON) by default. Each line must contain a complete, valid JSON object:

Read more →

Oct 25, 2025 Python

PySpark - Read Multiple CSV Files

The simplest approach to reading multiple CSV files uses wildcard patterns. PySpark’s spark.read.csv() method accepts glob patterns to match multiple files simultaneously.

Read more →

Oct 25, 2025 Python

PySpark - Read Nested JSON File

PySpark’s spark.read.json() method automatically infers schema from JSON files, including nested structures. Start with a simple nested JSON file:

Read more →

Oct 25, 2025 Python

PySpark - Read ORC File into DataFrame

ORC is a columnar storage format optimized for Hadoop workloads. Unlike row-based formats, ORC stores data by columns, enabling efficient compression and faster query execution when you only need…

Read more →

Oct 25, 2025 Python

PySpark - Read Parquet File into DataFrame

Reading Parquet files in PySpark starts with initializing a SparkSession and using the DataFrame reader API. The simplest approach loads the entire file into memory as a distributed DataFrame.

Read more →

Oct 25, 2025 Python

PySpark - Read XML File into DataFrame

PySpark requires the spark-xml package to read XML files. Install it via pip or include it when creating your Spark session.

Read more →

Oct 24, 2025 Python

PySpark - Read CSV File into DataFrame

PySpark’s spark.read.csv() method provides the simplest approach to load CSV files into DataFrames. The method accepts file paths from local filesystems, HDFS, S3, or other distributed storage…

Read more →

Oct 24, 2025 Python

PySpark - Read CSV with Custom Schema

• Defining custom schemas in PySpark eliminates costly schema inference and prevents data type mismatches that cause runtime failures in production pipelines

Read more →

Oct 24, 2025 Python

PySpark - Read CSV with Header and InferSchema

• PySpark’s inferSchema option automatically detects column data types by sampling data, but adds overhead by requiring an extra pass through the dataset—use it for exploration, disable it for…

Read more →

Oct 24, 2025 Python

PySpark - Read Delta Lake Table

Reading a Delta Lake table in PySpark requires minimal configuration. The Delta Lake format is built on top of Parquet files with a transaction log, making it straightforward to query.

Read more →

Oct 24, 2025 Python

PySpark - Read Excel File into DataFrame

PySpark’s native data source API supports formats like CSV, JSON, Parquet, and ORC, but Excel files require additional handling. Excel files are binary formats (.xlsx) or legacy binary formats (.xls)…

Read more →

Oct 24, 2025 Python

PySpark - Read from Hive Table

Before reading from Hive tables, configure your SparkSession to connect with the Hive metastore. The metastore contains metadata about tables, schemas, partitions, and storage locations.

Read more →

Oct 24, 2025 Python

PySpark - Read from JDBC/Database

• PySpark’s JDBC connector enables distributed reading from relational databases with automatic partitioning across executors, but requires careful configuration of partition columns and bounds to…

Read more →

Oct 24, 2025 Python

PySpark - Read from Kafka with Structured Streaming

PySpark’s Structured Streaming API treats Kafka as a structured data source, enabling you to read from topics using the familiar DataFrame API. The basic connection requires the Kafka bootstrap…

Read more →

Oct 23, 2025 Python

PySpark - Read Avro File into DataFrame

• PySpark requires the spark-avro package to read Avro files, which must be specified during SparkSession initialization or provided at runtime via –packages

Read more →

Sep 27, 2025 Pandas

Pandas - Read JSON File (read_json)

• Pandas read_json() handles multiple JSON structures including records, split, index, columns, and values orientations, with automatic type inference and nested data flattening capabilities

Read more →

Sep 27, 2025 Pandas

Pandas - Read Multiple Sheets from Excel

• Use pd.read_excel() with the sheet_name parameter to read single, multiple, or all sheets from an Excel file into DataFrames or a dictionary of DataFrames

Read more →

Sep 27, 2025 Pandas

Pandas - Read Parquet File (read_parquet)

Parquet is a columnar storage format designed for analytical workloads. Unlike row-based formats like CSV, Parquet stores data by column, enabling efficient compression and selective column reading.

Read more →

Sep 27, 2025 Pandas

Pandas - Read Specific Columns from CSV

The usecols parameter in read_csv() is the most straightforward approach for reading specific columns. You can specify columns by name or index position.

Read more →

Sep 27, 2025 Pandas

Pandas - Read SQL Query into DataFrame (read_sql)

The read_sql() function executes SQL queries and returns results as a pandas DataFrame. It accepts both raw SQL strings and SQLAlchemy selectable objects, working with any database supported by…

Read more →

Sep 26, 2025 Pandas

Pandas - Read CSV File (read_csv)

The read_csv() function reads comma-separated value files into DataFrame objects. The simplest invocation requires only a file path:

Read more →

Sep 26, 2025 Pandas

Pandas - Read CSV Skip Rows/Header

• Use skiprows parameter with integers, lists, or callable functions to exclude specific rows when reading CSV files, reducing memory usage and processing time for large datasets

Read more →

Sep 26, 2025 Pandas

Pandas - Read CSV with Custom Delimiter

The read_csv() function in Pandas defaults to comma separation, but real-world data files frequently use alternative delimiters. The sep parameter (or its alias delimiter) accepts any string or…

Read more →

Sep 26, 2025 Pandas

Pandas - Read CSV with Different Encodings

• CSV files can have various encodings (UTF-8, Latin-1, Windows-1252) that cause UnicodeDecodeError if not handled correctly—detecting and specifying the right encoding is critical for data integrity

Read more →

Sep 26, 2025 Pandas

Pandas - Read Excel File (read_excel)

The read_excel() function is your primary tool for importing Excel data into pandas DataFrames. At minimum, you only need the file path:

Read more →

Sep 26, 2025 Pandas

Pandas - Read Fixed-Width File (read_fwf)

• read_fwf() handles fixed-width format files where columns are defined by character positions rather than delimiters, common in legacy systems and government data

Read more →

Sep 26, 2025 Pandas

Pandas - Read from S3 Bucket

• Pandas integrates seamlessly with S3 through the s3fs library, allowing you to read files directly using standard read_csv(), read_parquet(), and other I/O functions with S3 URLs

Read more →

Sep 26, 2025 Pandas

Pandas - Read HTML Table from URL

The read_html() function returns a list of all tables found in the HTML source. Each table becomes a separate DataFrame, indexed by its position in the document.

Read more →

Sep 25, 2025 Pandas

Pandas - Read Clipboard Data

• Pandas read_clipboard() provides instant data import from copied spreadsheet cells, eliminating the need for intermediate CSV files during exploratory analysis

Read more →

Sep 07, 2025 Python

NumPy - Read CSV with np.genfromtxt()

While pandas dominates CSV loading in data science workflows, np.genfromtxt() offers advantages when you need direct NumPy array output without pandas overhead. For numerical computing pipelines,…

Read more →

Jun 06, 2025 Pandas

How to Read Parquet Files in Pandas

Parquet is a columnar storage format that has become the de facto standard for analytical workloads. Unlike row-based formats like CSV where data is stored record by record, Parquet stores data…

Read more →

Jun 06, 2025 Python

How to Read Parquet Files in Polars

Parquet has become the de facto standard for analytical data storage. Its columnar format, efficient compression, and schema preservation make it ideal for data engineering workflows. But the tool…

Read more →

Jun 06, 2025 Engineering

How to Read Parquet Files in PySpark

Parquet has become the de facto standard for storing analytical data in big data ecosystems, and for good reason. Its columnar storage format means you only read the columns you need. Built-in…

Read more →

Jun 05, 2025 Pandas

How to Read CSV Files in Pandas

CSV files remain the lingua franca of data exchange. Despite the rise of Parquet, JSON, and database connections, you’ll encounter CSVs constantly—from client exports to API downloads to legacy…

Read more →

Jun 05, 2025 Python

How to Read CSV Files in Polars

Polars has rapidly become the go-to DataFrame library for Python developers who need speed without sacrificing usability. Built in Rust with a Python API, it consistently outperforms pandas on CSV…

Read more →

Jun 05, 2025 Engineering

How to Read CSV Files in PySpark

CSV files refuse to die. Despite better alternatives like Parquet, Avro, and ORC, you’ll encounter CSV data constantly in real-world data engineering. Vendors export it, analysts create it, legacy…

Read more →

Jun 05, 2025 Pandas

How to Read Excel Files in Pandas

Excel files remain stubbornly ubiquitous in data workflows. Whether you’re receiving sales reports from finance, customer data from marketing, or research datasets from academic partners, you’ll…

Read more →

Jun 05, 2025 Pandas

How to Read JSON Files in Pandas

JSON has become the lingua franca of web APIs and configuration files. It’s human-readable, flexible, and ubiquitous. But flexibility comes at a cost—JSON’s nested, hierarchical structure doesn’t map…

Read more →

Jun 05, 2025 Python

How to Read JSON Files in Polars

Polars has become the go-to DataFrame library for performance-conscious Python developers. While pandas remains ubiquitous, Polars consistently benchmarks 5-20x faster for most operations, and JSON…

Read more →

Jun 05, 2025 Engineering

How to Read JSON Files in PySpark

JSON has become the lingua franca of data interchange. Whether you’re processing API responses, application logs, configuration dumps, or event streams, you’ll inevitably encounter JSON files that…

Read more →

Jan 30, 2025 Engineering

CQRS: Separating Read and Write Models

Every developer has felt the pain: you’ve got a domain model that started clean and simple, but now it’s bloated with computed properties for display, lazy-loaded collections for reports, and…

Read more →