import pandas as pd
data = [[4000, 25, 20000], [8000, 30, 70000], [5000, 28, 35000], [7500., 33, 50000], [12000, 40, 85000]]
df = pd.DataFrame(data, columns=["Monthly Income", "Age", "Loan Limit"])
df
Let $\theta_1$ be the parameter of age, $\theta_2$ be the parameter of monthly income
Fit planar: $h_\theta (x) = \theta_0 + \theta_1x_1 + \theta_2x_2$
($\theta_0$ is the intercept term)
data = [[1, 4000, 25, 20000], [1, 8000, 30, 70000], [1, 5000, 28, 35000], [1, 7500., 33, 50000], [1, 12000, 40, 85000]]
df = pd.DataFrame(data, columns=["x0", "Monthly Income (x2)", "Age (x1)", "Loan Limit"])
df
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=2, color='r')
plt.show()
Integration: $$ h_\theta (x) = \sum_{i=0}^{n}\theta_ix_i = \theta^\tau x $$
An error is the difference between the observed value and the true value (very often unobserved, generated by the DGP). A residual is the difference between the observed value and the predicted value (by the model).
$\epsilon$ - Residual
Prediction Error = Actual Value - Predicted Value
Residual = Observed value - Predicted Value
In statistics, the actual value is the value that is obtained by observation or by measuring the available data. It is also called the observed value. The predicted value is the value of the variable predicted based on the regression analysis.
so for each sample:
$$y^{(i)} = \theta^\mathrm{T} x^{(i)}+\epsilon^{(i)}$$
Every residual is independent and identically distributed. They all have the same distribution
In this case, we assume μ=0, thus
$$P(y^{(i)}|x^{(i)}; \theta) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})} $$
Its likelihood function: $$L(θ) = \prod\limits_{i=1}^{m}P(y^{(i)}|x^{(i)}; \theta) = \prod\limits_{i=1}^{m}\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})}$$ and its log-likelihood function: $$logL(θ) = log \prod\limits_{i=1}^{m}\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})}$$
We are looking for a combination of parameters and data to make the probability of the actual value as large as possible. The reason to convert the likelihood function into a log-likelihood because in this way we turned multiplication into addition, to simplify the calculation process. $$logAB = logA+logB$$
Thus $$\sum_{i=1}^{m}log\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})} = mlog\frac{1}{{\sigma \sqrt {2\pi } }} - \frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2 $$
as $mlog\frac{1}{{\sigma \sqrt {2\pi } }} > 0$ and $\frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2 > 0$
Thus the smaller the $$\frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\tau x^{(i)})^2 > 0$$ the greater the probability of the actual value, we refer to it as $J(\theta)$
object function: $$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 = \frac{1}{2}(X\theta - y)^\mathrm{T}(X\theta-y)$$
Its partial derivative: $$\nabla _ \theta J(\theta) = \nabla _ \theta \left( \frac{1}{2}(X\theta - y)^\mathrm{T}(X\theta-y) \right) = \nabla _ \theta \left( \frac{1}{2}(X^\mathrm{T}\theta ^\mathrm{T} - y^\mathrm{T})(X\theta-y) \right) = \nabla _ \theta \left( \frac{1}{2}(\theta ^\mathrm{T} X^\mathrm{T} X \theta - \theta ^\mathrm{T} X^\mathrm{T}y - y^\mathrm{T}X \theta + y^\mathrm{T}y) \right) = \frac{1}{2}(2X^\mathrm{T}X \theta - X^\mathrm{T}y - (y^\mathrm{T} X)^\mathrm{T}) = X^\mathrm{T}X \theta - X^\mathrm{T}y$$
for a function f all of its partial derivatives are 0. $$\theta = (X^\mathrm{T}X)^{-1} X^\mathrm{T}y $$
RSS, residual sum of squares $ \sum_{i=1}^{m}(\hat{y}_i - y_i) $
$$R^2 : 1 - \frac{{\sum_{i=1}^{m}(\hat{y}_i - y_i)}}{{\sum_{i=1}^{m}(y_i - \bar{y})}}$$The closer $R^2$ to 1, the better the model.