R - Vectors - Create, Access, Modify
Atomic vectors store elements of a single type. Use c() to combine values or type-specific constructors for empty vectors.
Atomic vectors store elements of a single type. Use c() to combine values or type-specific constructors for empty vectors.
• The which() function returns integer positions of TRUE values in logical vectors, enabling precise element selection and manipulation in R data structures
The while loop in R evaluates a condition before each iteration. If the condition is TRUE, the code block executes; if FALSE, the loop terminates.
Read more →The write.csv() function is R’s built-in solution for exporting data frames to CSV format. It’s a wrapper around write.table() with sensible defaults for comma-separated values.
The R ecosystem offers several Excel writing solutions: xlsx (Java-dependent), openxlsx (requires zip utilities), and writexl. The writexl package stands out by having zero external dependencies…
Read more →The tryCatch() function wraps code that might fail and defines handlers for different conditions. The basic syntax includes an expression to evaluate and named handler functions.
• R uses <- as the primary assignment operator by convention, though = works in most contexts—understanding the subtle differences prevents unexpected scoping issues
Long-format data stores observations in rows where each row represents a single measurement. Wide-format data spreads these measurements across columns. pivot_wider() from the tidyr package…
The replace_na() function from tidyr provides a streamlined approach to handling missing data. It works with vectors, lists, and data frames, making it more versatile than base R’s is.na()…
• The separate() function splits one column into multiple columns based on a delimiter, with automatic type conversion and flexible handling of edge cases through parameters like extra and fill
The unite() function from the tidyr package merges multiple columns into one. The basic syntax requires the data frame, the name of the new column, and the columns to combine.
Five dplyr verbs handle 90% of data manipulation tasks. Master these before anything else.
Read more →• The t-test determines whether means of two groups differ significantly, with three variants: one-sample (comparing to a known value), two-sample (independent groups), and paired (dependent…
Read more →The table() function counts occurrences of unique values in vectors or factor combinations. It returns an object of class ’table’ that behaves like a named array.
Implicit missing values are combinations of variables that don’t appear in your dataset but should exist based on the data’s structure. These are fundamentally different from explicit NA values that…
Read more →The drop_na() function from tidyr provides a targeted approach to handling missing data in data frames. While base R’s na.omit() removes any row with at least one NA value across all columns,…
Both expand_grid() and crossing() create data frames containing all possible combinations of their input vectors. They’re essential for generating test scenarios, creating complete datasets for…
The fill() function from tidyr addresses a common data cleaning challenge: missing values that should logically carry forward from previous observations. This occurs frequently in spreadsheet-style…
List-columns are the foundation of tidyr’s nesting capabilities. Unlike typical data frame columns that contain atomic vectors (numeric, character, logical), list-columns contain lists where each…
Read more →• pivot_longer() transforms wide-format data into long format by converting column names into values of a new variable, essential for tidy data analysis and visualization in R
• The subset() function provides an intuitive way to filter rows and select columns from data frames using logical conditions without repetitive bracket notation or the $ operator
R’s switch() function evaluates an expression and returns a value based on the match. Unlike traditional switch statements in languages like C or Java, R’s implementation returns values rather than…
R provides two native binary formats for persisting objects: RDS and RData. RDS files store a single R object, while RData files can store multiple objects from your workspace. Both formats preserve…
Read more →• The reshape() function transforms data between wide format (multiple columns per subject) and long format (one row per observation) without external packages
R implements object-oriented programming differently than languages like Java or Python. Instead of methods belonging to objects, R uses generic functions that dispatch to appropriate methods based…
Read more →Variance measures how far data points spread from their mean. It’s calculated by taking the average of squared differences from the mean. Standard deviation is simply the square root of variance,…
Read more →• R offers multiple CSV reading methods—base R’s read.csv() provides universal compatibility while readr::read_csv() delivers 10x faster performance with better type inference
The readxl package comes bundled with the tidyverse but can be installed independently. It reads both modern .xlsx files and legacy .xls formats without external dependencies.
Fixed-width files allocate specific character positions for each field. Unlike CSV files that use delimiters, these files rely on consistent positioning. A record might look like this:
Read more →The DBI (Database Interface) package provides a standardized way to interact with databases in R. RSQLite implements this interface for SQLite databases, offering a zero-configuration option that…
Read more →Base R handles simple URL reading through readLines() and url() connections. This works for plain text, CSV files, and basic HTTP requests without authentication.
The jsonlite package is the de facto standard for JSON operations in R. Install it once and load it for each session:
While map() handles single-input iteration elegantly, real-world data operations frequently require coordinating multiple inputs. Consider calculating weighted averages, combining data from…
• possibly() and safely() transform functions into error-resistant versions that return default values or captured error objects instead of halting execution
library(purrr)
Read more →R’s mean() function calculates the arithmetic average of numeric vectors. The function handles NA values through the na.rm parameter, essential for real-world datasets with missing data.
The merge() function combines two data frames based on common columns, similar to SQL JOIN operations. The basic syntax requires at least two data frames, with optional parameters controlling join…
• R provides four core functions for working with normal distributions: dnorm() for probability density, pnorm() for cumulative probability, qnorm() for quantiles, and rnorm() for random…
• keep() and discard() filter lists and vectors using predicate functions, providing a more expressive alternative to bracket subsetting when working with complex filtering logic
Base R’s lapply() always returns a list. You then coerce it to your desired type, often discovering type mismatches late in execution. The purrr approach enforces types immediately:
The purrr package revolutionizes functional programming in R by providing a consistent, predictable interface for iteration. While base R’s lapply() works, map() offers superior error handling,…
R packages extend base functionality through collections of functions, data, and documentation. The primary installation source is CRAN (Comprehensive R Archive Network), accessed through…
Read more →The lm() function fits linear models using the formula interface y ~ x1 + x2 + .... The function returns a model object containing coefficients, residuals, fitted values, and statistical…
• Lists in R are heterogeneous data structures that can contain elements of different types, including vectors, data frames, functions, and even other lists, making them the most flexible container…
Read more →Logistic regression models the probability of a binary outcome using a logistic function. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities…
Read more →R offers multiple approaches to create matrices. The matrix() function is the most common method, taking a vector of values and organizing them into rows and columns.
Hypothesis testing follows a structured approach: formulate a null hypothesis (H0) representing no effect or difference, define an alternative hypothesis (H1), collect data, calculate a test…
Read more →R’s conditional statements follow a straightforward structure. Unlike vectorized languages where conditions apply element-wise by default, R’s base if statement evaluates a single logical value.
• The ifelse() function provides vectorized conditional logic, evaluating conditions element-wise across vectors and returning values based on TRUE/FALSE results
The fundamental structure of a ggplot2 line plot combines the ggplot() function with geom_line(). The data must include at least two continuous variables: one for the x-axis and one for the…
• The patchwork package provides intuitive operators (+, /, |) for combining ggplot2 plots with minimal code, making it the modern standard for multi-plot layouts
Read more →The ggsave() function provides a streamlined approach to exporting ggplot2 visualizations. At its simplest, you specify a filename and the function handles the rest.
The fundamental ggplot2 scatter plot requires a dataset, aesthetic mappings, and a point geometry layer. Here’s the minimal implementation:
Read more →• Violin plots combine box plots with kernel density estimation to show the full distribution shape of your data, making them superior for revealing multimodal distributions and data density patterns…
Read more →R functions follow a straightforward structure using the function keyword. The basic anatomy includes parameters, a function body, and an optional explicit return statement.
The labs() function provides the most straightforward approach to adding labels in ggplot2. It handles titles, subtitles, captions, and axis labels in a single function call.
ggplot2 creates bar plots through two primary geoms: geom_bar() and geom_col(). Understanding their difference prevents common confusion. geom_bar() counts observations by default, while…
Box plots display the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In ggplot2, creating a box plot requires mapping a categorical variable to the…
Read more →Install ggplot2 from CRAN or load it as part of the tidyverse:
Read more →ggplot2 provides dedicated scale functions for every aesthetic mapping. For discrete data, scale_color_manual() and scale_fill_manual() offer complete control over color assignment.
Faceting creates small multiples—a series of similar plots using the same scale and axes, allowing you to compare patterns across subsets of your data. Instead of overlaying multiple groups on a…
Read more →The fundamental histogram in ggplot2 requires a dataset and a continuous variable mapped to the x-axis. The geom_histogram() function automatically bins the data and counts observations.
• ggplot2 provides granular control over legend appearance through theme(), guides(), and scale functions, allowing you to position, style, and organize legends to match publication requirements
• R uses lexical scoping with four environment types (global, function, package, empty), where variable lookup follows a parent chain until reaching the empty environment
Read more →Factors represent categorical variables in R, internally stored as integer vectors with associated character labels called levels. This dual nature makes factors memory-efficient while maintaining…
Read more →R for loops iterate over elements in a sequence, executing a code block for each element. The basic syntax follows the pattern for (variable in sequence) { expression }.
The select() function from dplyr extracts columns from data frames using intuitive syntax. Unlike base R’s bracket notation, select() returns a tibble and allows unquoted column names.
• The select() function in dplyr offers helper functions that match column names by patterns, eliminating tedious manual column specification and reducing errors in data manipulation workflows
The slice() function selects rows by their integer positions. Unlike filter() which uses logical conditions, slice() works with row numbers directly.
The summarise() function from dplyr condenses data frames into summary statistics. At its core, it takes a data frame and returns a smaller one containing computed aggregate values.
The dplyr package deprecated top_n() in version 1.0.0, recommending slice_max() and slice_min() as replacements. This wasn’t arbitrary—top_n() had ambiguous behavior, particularly around tie…
Joins combine two dataframes based on shared key columns. Each join type handles non-matching rows differently, which directly impacts your result set size and content.
Read more →The mutate() function from dplyr adds new variables or transforms existing ones in your data frame. Unlike base R’s approach of modifying columns with $ or [], mutate() keeps your data…
• n() counts rows within groups while n_distinct() counts unique values, forming the foundation of aggregation operations in dplyr
The ntile() function from dplyr divides a vector into N bins of approximately equal size. It assigns each observation a bin number from 1 to N based on its rank in ascending order. This differs…
The pipe operator revolutionizes R code readability by eliminating nested function calls. Instead of writing function3(function2(function1(data))), you write `data %>% function1() %>% function2()…
The relocate() function from dplyr moves columns to new positions within a data frame. By default, it moves specified columns to the leftmost position.
The rename() function from dplyr uses a straightforward syntax where you specify the new name on the left and the old name on the right. This reversed assignment feels natural when reading code…
The dplyr package provides three distinct ranking functions that assign positional values to rows. While they appear similar, their handling of tied values creates fundamentally different outputs.
Read more →The case_when() function evaluates conditions from top to bottom, returning the right-hand side value when a condition evaluates to TRUE. Each condition follows the formula syntax: `condition ~…
dplyr transforms data manipulation in R by providing a grammar of data manipulation. Instead of learning dozens of functions with inconsistent interfaces, you master five verbs that combine to solve…
Read more →The dplyr package provides two complementary functions for counting observations: count() and tally(). While both produce frequency counts, they differ in their workflow position. count()…
The distinct() function from dplyr identifies and removes duplicate rows from data frames. Unlike base R’s unique(), it works naturally with tibbles and integrates into pipe-based workflows.
The filter() function from dplyr selects rows where conditions evaluate to TRUE. Unlike base R subsetting with brackets, filter() automatically removes NA values and integrates cleanly into piped…
Read more →The filter() function from dplyr accepts multiple conditions separated by commas, which implicitly creates an AND relationship. Each condition must evaluate to a logical vector.
The group_by() function transforms a regular data frame into a grouped tibble, which subsequent operations treat as separate partitions. This grouping is metadata—the physical data structure…
The fundamental distinction between if_else() and ifelse() lies in type checking. if_else() enforces strict type consistency between the true and false branches, preventing silent type coercion…
• The lag() and lead() functions shift values within a vector by a specified number of positions, essential for time-series analysis, calculating differences between consecutive rows, and…
The data.table package addresses fundamental performance limitations in base R. While data.frame operations create full copies of data for each modification, data.table uses reference semantics and…
Read more →The across() function operates within dplyr verbs like mutate(), summarise(), and filter(). Its basic structure takes a column selection and a function to apply:
The dplyr package provides two filtering joins that differ fundamentally from mutating joins like inner_join() or left_join(). While mutating joins combine columns from both tables, filtering…
The arrange() function from dplyr provides an intuitive interface for sorting data frames. Unlike base R’s order(), it returns the entire data frame in sorted order rather than just indices.
The between() function in dplyr filters rows where values fall within a specified range, inclusive of both boundaries. The syntax is straightforward:
library(dplyr)
Read more →• Chi-square tests evaluate relationships between categorical variables, with the test of independence being most common for analyzing contingency tables and the goodness-of-fit test validating…
Read more →• R is a specialized language for statistical computing and data visualization, with a syntax optimized for vectorized operations that eliminate most explicit loops
Read more →• Confidence intervals quantify estimation uncertainty by providing a range of plausible values for population parameters, with the 95% level being standard practice in most fields
Read more →The cor() function computes correlation coefficients between numeric vectors or matrices. The most common method is Pearson correlation, which measures linear relationships between variables.
R packages aren’t just for CRAN distribution. Any collection of functions you use repeatedly across projects benefits from package structure. You get automatic dependency management, integrated help…
Read more →The data.frame() function constructs a data frame from vectors. Each vector becomes a column, and all vectors must have equal length.
The cut() function divides a numeric vector into intervals and returns a factor representing which interval each value falls into. The basic syntax requires two arguments: the data vector and the…
Data frames store tabular data with columns of potentially different types. The data.frame() function constructs them from vectors, lists, or other data frames.
R operates with six atomic vector types: logical, integer, numeric (double), complex, character, and raw. This article focuses on the four essential types you’ll use daily: numeric, character,…
Read more →• R data frames support multiple indexing methods including bracket notation [], double brackets [[]], and the $ operator, each with distinct behaviors for subsetting rows and columns
• Data frames in R support multiple methods for adding columns: direct assignment ($), bracket notation ([]), and functions like cbind() and mutate() from dplyr
The most straightforward approach uses rbind() to bind rows together. Create a new row as a data frame or list with matching column names:
• The aggregate() function provides a straightforward approach to split-apply-combine operations, computing summary statistics across grouped data without external dependencies
ANOVA partitions total variance into between-group and within-group components. The F-statistic compares these variances: if between-group variance significantly exceeds within-group variance, at…
Read more →The apply family functions provide vectorized operations across R data structures. They replace traditional for-loops with functional programming patterns, reducing code complexity and often…
Read more →Arrays are homogeneous data structures that extend beyond two dimensions. While vectors are one-dimensional and matrices are two-dimensional, arrays can have any number of dimensions. All elements…
Read more →