Linear regression

Case

Data: monthly income and age

Estimate the loan limit of the bank

How income or age affect the loan linit

In [6]:
import pandas as pd
data = [[4000, 25, 20000], [8000, 30, 70000], [5000, 28, 35000], [7500., 33, 50000], [12000, 40, 85000]]
df = pd.DataFrame(data, columns=["Monthly Income", "Age", "Loan Limit"])
df
Out[6]:
Monthly Income Age Loan Limit
0 4000.0 25 20000
1 8000.0 30 70000
2 5000.0 28 35000
3 7500.0 33 50000
4 12000.0 40 85000

Let $\theta_1$ be the parameter of age, $\theta_2$ be the parameter of monthly income Fit planar: $h_\theta (x) = \theta_0 + \theta_1x_1 + \theta_2x_2$
($\theta_0$ is the intercept term)

In [9]:
data = [[1, 4000, 25, 20000], [1, 8000, 30, 70000], [1, 5000, 28, 35000], [1, 7500., 33, 50000], [1, 12000, 40, 85000]]
df = pd.DataFrame(data, columns=["x0", "Monthly Income (x2)", "Age (x1)", "Loan Limit"])
df
Out[9]:
x0 Monthly Income (x2) Age (x1) Loan Limit
0 1 4000.0 25 20000
1 1 8000.0 30 70000
2 1 5000.0 28 35000
3 1 7500.0 33 50000
4 1 12000.0 40 85000
In [15]:
import matplotlib.pyplot as plt

mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)

count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=2, color='r')
plt.show()

Integration: $$ h_\theta (x) = \sum_{i=0}^{n}\theta_ix_i = \theta^\tau x $$

Residual

An error is the difference between the observed value and the true value (very often unobserved, generated by the DGP). A residual is the difference between the observed value and the predicted value (by the model).

$\epsilon$ - Residual
Prediction Error = Actual Value - Predicted Value
Residual = Observed value - Predicted Value
In statistics, the actual value is the value that is obtained by observation or by measuring the available data. It is also called the observed value. The predicted value is the value of the variable predicted based on the regression analysis.
so for each sample:
$$y^{(i)} = \theta^\mathrm{T} x^{(i)}+\epsilon^{(i)}$$
Every residual is independent and identically distributed. They all have the same distribution

Gaussian Normal Distribution

$$P(\epsilon^{(i)}) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(\epsilon^{(i)})^2}{{2\sigma ^2}})} $$

In this case, we assume μ=0, thus
$$P(y^{(i)}|x^{(i)}; \theta) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})} $$

Its likelihood function: $$L(θ) = \prod\limits_{i=1}^{m}P(y^{(i)}|x^{(i)}; \theta) = \prod\limits_{i=1}^{m}\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})}$$ and its log-likelihood function: $$logL(θ) = log \prod\limits_{i=1}^{m}\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})}$$

We are looking for a combination of parameters and data to make the probability of the actual value as large as possible. The reason to convert the likelihood function into a log-likelihood because in this way we turned multiplication into addition, to simplify the calculation process. $$logAB = logA+logB$$

Thus $$\sum_{i=1}^{m}log\frac{1}{{\sigma \sqrt {2\pi } }}e^{(-\frac{(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2}{{2\sigma ^2}})} = mlog\frac{1}{{\sigma \sqrt {2\pi } }} - \frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2 $$

as $mlog\frac{1}{{\sigma \sqrt {2\pi } }} > 0$ and $\frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2 > 0$

Thus the smaller the $$\frac{1}{{\sigma ^2}}\times \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\tau x^{(i)})^2 > 0$$ the greater the probability of the actual value, we refer to it as $J(\theta)$

$$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(y^{(i)}-\theta^\mathrm{T} x^{(i)})^2$$
(Least squares)

object function: $$J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 = \frac{1}{2}(X\theta - y)^\mathrm{T}(X\theta-y)$$

Its partial derivative: $$\nabla _ \theta J(\theta) = \nabla _ \theta \left( \frac{1}{2}(X\theta - y)^\mathrm{T}(X\theta-y) \right) = \nabla _ \theta \left( \frac{1}{2}(X^\mathrm{T}\theta ^\mathrm{T} - y^\mathrm{T})(X\theta-y) \right) = \nabla _ \theta \left( \frac{1}{2}(\theta ^\mathrm{T} X^\mathrm{T} X \theta - \theta ^\mathrm{T} X^\mathrm{T}y - y^\mathrm{T}X \theta + y^\mathrm{T}y) \right) = \frac{1}{2}(2X^\mathrm{T}X \theta - X^\mathrm{T}y - (y^\mathrm{T} X)^\mathrm{T}) = X^\mathrm{T}X \theta - X^\mathrm{T}y$$

for a function f all of its partial derivatives are 0. $$\theta = (X^\mathrm{T}X)^{-1} X^\mathrm{T}y $$

Evaluation Method

RSS, residual sum of squares $ \sum_{i=1}^{m}(\hat{y}_i - y_i) $

$$R^2 : 1 - \frac{{\sum_{i=1}^{m}(\hat{y}_i - y_i)}}{{\sum_{i=1}^{m}(y_i - \bar{y})}}$$

The closer $R^2$ to 1, the better the model.