How to Perform Polynomial Fitting in NumPy

Polynomial fitting is the process of finding a polynomial function that best approximates a set of data points. You've likely encountered it when drawing trend lines in spreadsheets or analyzing...

Key Insights

  • NumPy provides two APIs for polynomial fitting: the legacy polyfit/polyval functions and the modern numpy.polynomial module, with the latter offering better numerical stability for most use cases.
  • Choosing the right polynomial degree is critical—too low causes underfitting, too high causes overfitting. Use residual analysis and domain knowledge rather than blindly maximizing fit.
  • Always consider data scaling when working with large x-values, as polynomial fitting can become numerically unstable without proper conditioning.

Introduction to Polynomial Fitting

Polynomial fitting is the process of finding a polynomial function that best approximates a set of data points. You’ve likely encountered it when drawing trend lines in spreadsheets or analyzing sensor data that follows a curved pattern.

The technique serves several practical purposes: modeling relationships between variables, smoothing noisy data, interpolating between known points, and extrapolating trends. Unlike machine learning approaches, polynomial fitting is deterministic, interpretable, and requires no training infrastructure.

NumPy provides robust tools for polynomial operations. You can fit polynomials to data, evaluate them at arbitrary points, and perform algebraic operations—all with a few function calls. This article covers both the legacy API and the modern polynomial module, giving you the knowledge to choose the right approach for your specific needs.

Understanding the Core Function: numpy.polyfit()

The numpy.polyfit() function is the traditional workhorse for polynomial fitting. It uses least squares regression to find coefficients that minimize the sum of squared residuals between your data and the fitted polynomial.

Here’s the basic syntax:

numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)

The essential parameters are:

  • x: Array of x-coordinates (independent variable)
  • y: Array of y-coordinates (dependent variable)
  • deg: Degree of the polynomial to fit

The function returns an array of coefficients, ordered from highest degree to lowest. For a degree-2 polynomial, you get [a, b, c] representing ax² + bx + c.

Let’s start with a linear fit:

import numpy as np

# Sample data with some noise
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.3, 5.8, 8.2, 9.9, 12.1, 14.0, 16.3, 17.8, 20.1])

# Fit a degree-1 polynomial (linear)
coefficients = np.polyfit(x, y, deg=1)

print(f"Coefficients: {coefficients}")
print(f"Equation: y = {coefficients[0]:.3f}x + {coefficients[1]:.3f}")

Output:

Coefficients: [1.99878788 0.06666667]
Equation: y = 1.999x + 0.067

The coefficients tell us the slope is approximately 2 and the y-intercept is near 0, which matches our data’s underlying pattern.

Evaluating Polynomials with numpy.polyval()

Once you have coefficients, numpy.polyval() evaluates the polynomial at any x-value:

import numpy as np
import matplotlib.pyplot as plt

# Original data
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.3, 5.8, 8.2, 9.9, 12.1, 14.0, 16.3, 17.8, 20.1])

# Fit and evaluate
coefficients = np.polyfit(x, y, deg=1)

# Create smooth x values for plotting the fitted line
x_smooth = np.linspace(0, 11, 100)
y_fitted = np.polyval(coefficients, x_smooth)

# Predict specific values
y_at_5_5 = np.polyval(coefficients, 5.5)
print(f"Predicted y at x=5.5: {y_at_5_5:.2f}")

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, color='blue', label='Original data', s=50)
plt.plot(x_smooth, y_fitted, color='red', label='Fitted line', linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.grid(True, alpha=0.3)
plt.title('Linear Polynomial Fit')
plt.show()

The key insight here is that polyval accepts both scalar and array inputs. Pass a single value for point predictions, or an array to generate a smooth curve for visualization.

Choosing the Right Polynomial Degree

Selecting the appropriate degree is where polynomial fitting becomes more art than science. A degree-1 polynomial (line) might miss curvature in your data. A degree-10 polynomial will pass through every point but oscillate wildly between them.

Here’s a practical comparison:

import numpy as np
import matplotlib.pyplot as plt

# Generate data with quadratic relationship plus noise
np.random.seed(42)
x = np.linspace(0, 10, 20)
y_true = 0.5 * x**2 - 2 * x + 3
y = y_true + np.random.normal(0, 2, len(x))

# Fit polynomials of different degrees
degrees = [1, 2, 5]
x_smooth = np.linspace(0, 10, 100)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, deg in zip(axes, degrees):
    coeffs = np.polyfit(x, y, deg)
    y_fitted = np.polyval(coeffs, x_smooth)
    
    # Calculate residuals and R²
    y_pred = np.polyval(coeffs, x)
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    ax.scatter(x, y, color='blue', alpha=0.6, label='Data')
    ax.plot(x_smooth, y_fitted, color='red', linewidth=2, label=f'Degree {deg}')
    ax.set_title(f'Degree {deg} (R² = {r_squared:.4f})')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The degree-1 fit (underfitting) misses the curvature entirely. The degree-2 fit captures the true relationship. The degree-5 fit has a marginally higher R² but introduces unnecessary complexity and will perform poorly on new data.

My recommendation: start with the lowest degree that captures the underlying pattern. Use domain knowledge—if you’re modeling projectile motion, a degree-2 polynomial makes physical sense. If you’re just smoothing noise, consider whether a polynomial is even appropriate.

Working with numpy.polynomial.polynomial Module

NumPy’s newer polynomial module offers better numerical stability, especially for high-degree polynomials or data with large x-values. The API is object-oriented and more intuitive.

import numpy as np
from numpy.polynomial import Polynomial

# Same data as before
np.random.seed(42)
x = np.linspace(0, 10, 20)
y_true = 0.5 * x**2 - 2 * x + 3
y = y_true + np.random.normal(0, 2, len(x))

# Fit using the modern API
poly = Polynomial.fit(x, y, deg=2)

# The polynomial object is callable
x_smooth = np.linspace(0, 10, 100)
y_fitted = poly(x_smooth)

# Access coefficients (note: ordered low-to-high, opposite of polyfit)
print(f"Coefficients (low to high degree): {poly.convert().coef}")

# You can also convert to the standard basis for interpretation
poly_standard = poly.convert()
print(f"Standard form: {poly_standard}")

Key differences from the legacy API:

  • Coefficients are ordered from lowest to highest degree
  • The Polynomial.fit() method automatically handles domain scaling internally
  • The returned object is callable and supports arithmetic operations
  • Better numerical conditioning for edge cases

For new projects, prefer the numpy.polynomial module unless you have a specific reason to use polyfit.

Practical Example: Real-World Data Fitting

Let’s work through a complete example with realistic data—monthly sales figures that show seasonal growth:

import numpy as np
from numpy.polynomial import Polynomial
import matplotlib.pyplot as plt

# Simulated monthly sales data (24 months)
months = np.arange(1, 25)
# Underlying trend: growth with some seasonal variation
base_sales = 1000 + 50 * months + 2 * months**2
seasonal = 200 * np.sin(2 * np.pi * months / 12)
noise = np.random.normal(0, 100, len(months))
sales = base_sales + seasonal + noise

# Fit a quadratic trend (ignoring seasonality for now)
poly = Polynomial.fit(months, sales, deg=2)

# Generate predictions
months_extended = np.linspace(1, 30, 100)  # Extend 6 months into future
sales_trend = poly(months_extended)

# Calculate R² for the trend fit
sales_predicted = poly(months)
ss_res = np.sum((sales - sales_predicted) ** 2)
ss_tot = np.sum((sales - np.mean(sales)) ** 2)
r_squared = 1 - (ss_res / ss_tot)

# Calculate residuals for analysis
residuals = sales - sales_predicted

print(f"R² Score: {r_squared:.4f}")
print(f"Mean Absolute Residual: {np.mean(np.abs(residuals)):.2f}")

# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 8))

# Main plot
axes[0].scatter(months, sales, color='blue', alpha=0.7, label='Actual Sales')
axes[0].plot(months_extended, sales_trend, color='red', linewidth=2, 
             label='Polynomial Trend')
axes[0].axvline(x=24, color='gray', linestyle='--', alpha=0.5, label='Forecast Start')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Sales ($)')
axes[0].set_title(f'Sales Trend Analysis (R² = {r_squared:.4f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residual plot
axes[1].bar(months, residuals, color='steelblue', alpha=0.7)
axes[1].axhline(y=0, color='red', linestyle='-', linewidth=1)
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Residual')
axes[1].set_title('Residuals (Actual - Predicted)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The residual plot reveals the seasonal pattern we didn’t capture. This is valuable diagnostic information—it tells us a polynomial alone isn’t sufficient and we might need to add Fourier terms or use a different modeling approach.

Common Pitfalls and Best Practices

Poorly Conditioned Data

When x-values are large (like years: 2020, 2021, 2022), polynomial fitting becomes numerically unstable. The modern Polynomial.fit() handles this automatically through internal domain mapping, but with polyfit, you should center and scale your data:

import numpy as np

# Problematic: large x values
years = np.array([2018, 2019, 2020, 2021, 2022, 2023])
values = np.array([100, 120, 115, 140, 160, 175])

# Solution: center the data
years_centered = years - np.mean(years)
coeffs = np.polyfit(years_centered, values, deg=2)

# When predicting, remember to center new values too
new_year = 2024
prediction = np.polyval(coeffs, new_year - np.mean(years))

Extrapolation Dangers

Polynomials can behave erratically outside the fitted range. A well-behaved quadratic fit can shoot off to infinity or negative values just beyond your data. Always visualize extrapolations and apply domain knowledge.

When Polynomials Aren’t Appropriate

Consider alternatives when:

  • Your data has discontinuities (use piecewise functions)
  • You need local flexibility without global effects (use splines via scipy.interpolate)
  • The relationship is fundamentally non-polynomial (exponential, logarithmic)
  • You have many data points and need computational efficiency (polynomial fitting is O(n) but evaluation of high-degree polynomials can be slow)

Practical Checklist

  1. Visualize your data before fitting
  2. Start with the lowest reasonable degree
  3. Check residuals for patterns
  4. Validate on held-out data if possible
  5. Use the modern numpy.polynomial module for new code
  6. Be skeptical of high R² values with high-degree polynomials

Polynomial fitting remains a fundamental tool in data analysis. It’s fast, interpretable, and requires no external dependencies beyond NumPy. Master these techniques, and you’ll have a reliable method for trend analysis and curve fitting that works in production without the complexity of machine learning pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.