ML Regularized cost function

Posted Dec 9, 2024 Updated Dec 28, 2024

By Wei Xiong

1 min read

Optimizing Machine Learning Cost Function with Regularization

1. Understanding Cost Function and Regularization

In our discussion, we explored a machine learning cost function with regularization:

Cost Function Formula

\[J(\vec{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2\]

First Term (Error Term): Measures the average squared error between the predicted and actual values.
Second Term (Regularization Term): Prevents overfitting by penalizing large weight values.

Key Parameters

$m$: Number of training samples.
$n$: Number of features or parameters in the model.
$\lambda$: Regularization parameter controlling the strength of the penalty.

2. Selecting $\lambda$ and Optimizing Weights

2.1 Finding the Best $\lambda$

Small $\lambda$: Low penalty; model may overfit.
Large $\lambda$: High penalty; model may underfit.
Method: Use cross-validation or grid search to determine the optimal $\lambda$.

2.2 Updating Weight Vector $\vec{w}$ with Gradient Descent

Gradient Computation

\[\frac{\partial J(\vec{w}, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(x^{(i)}) - y^{(i)}\right) \cdot x_j^{(i)} + \frac{\lambda}{m} w_j\] \[\frac{\partial J(\vec{w}, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(x^{(i)}) - y^{(i)}\right)\]

Update Rule

\[w_j := w_j - \alpha \cdot \frac{\partial J(\vec{w}, b)}{\partial w_j}\] \[b := b - \alpha \cdot \frac{\partial J(\vec{w}, b)}{\partial b}\]

$\alpha$: Learning rate controlling step size.

3. Visualization and Extreme Examples

Example: Extreme Regularization Effects

We demonstrated with two cases:

No Regularization ($\lambda=0$):
\[J(\vec{w}, b) = 2756\]
- Model focuses entirely on minimizing the error term.
Strong Regularization ($\lambda=100$):
\[J(\vec{w}, b) = 15256\]
- Regularization dominates, heavily penalizing large weights.

Observations

Small $\lambda$: Allows more flexibility but risks overfitting.
Large $\lambda$: Reduces overfitting but risks underfitting.

Machine Learning

This post is licensed under CC BY 4.0 by the author.