ML Regularized cost function
ML Regularized cost function
Optimizing Machine Learning Cost Function with Regularization
1. Understanding Cost Function and Regularization
In our discussion, we explored a machine learning cost function with regularization:
Cost Function Formula
\[J(\vec{w}, b) = \frac{1}{2m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2\]- First Term (Error Term): Measures the average squared error between the predicted and actual values.
- Second Term (Regularization Term): Prevents overfitting by penalizing large weight values.
Key Parameters
- $m$: Number of training samples.
- $n$: Number of features or parameters in the model.
- $\lambda$: Regularization parameter controlling the strength of the penalty.
2. Selecting $\lambda$ and Optimizing Weights
2.1 Finding the Best $\lambda$
- Small $\lambda$: Low penalty; model may overfit.
- Large $\lambda$: High penalty; model may underfit.
- Method: Use cross-validation or grid search to determine the optimal $\lambda$.
2.2 Updating Weight Vector $\vec{w}$ with Gradient Descent
Gradient Computation
\[\frac{\partial J(\vec{w}, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(x^{(i)}) - y^{(i)}\right) \cdot x_j^{(i)} + \frac{\lambda}{m} w_j\] \[\frac{\partial J(\vec{w}, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\vec{w}, b}(x^{(i)}) - y^{(i)}\right)\]Update Rule
\[w_j := w_j - \alpha \cdot \frac{\partial J(\vec{w}, b)}{\partial w_j}\] \[b := b - \alpha \cdot \frac{\partial J(\vec{w}, b)}{\partial b}\]- $\alpha$: Learning rate controlling step size.
3. Visualization and Extreme Examples
Example: Extreme Regularization Effects
We demonstrated with two cases:
No Regularization ($\lambda=0$):
\[J(\vec{w}, b) = 2756\]- Model focuses entirely on minimizing the error term.
Strong Regularization ($\lambda=100$):
\[J(\vec{w}, b) = 15256\]- Regularization dominates, heavily penalizing large weights.
Observations
- Small $\lambda$: Allows more flexibility but risks overfitting.
- Large $\lambda$: Reduces overfitting but risks underfitting.
This post is licensed under CC BY 4.0 by the author.