Linear Regression

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.
1

For multivariate linear regression, given the following function, our job is to find the right 𝜃.

$$h_𝜃(x) = 𝜃_0+𝜃_1x_1+𝜃_2x_2+…+𝜃_nx_n$$

For convenience of notation, define $x_0 = 1$.

$$ X=\begin{bmatrix}x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \\ \end{bmatrix} ∈ R^{n+1} \qquad 𝜃=\begin{bmatrix}𝜃_0 \\ 𝜃_1 \\ 𝜃_2 \\ \vdots \\ 𝜃_n \\ \end{bmatrix} ∈ R^{n+1}$$

So
$$h_𝜃(x) = 𝜃^TX$$

Hypothesis:
$$h_𝜃(x)=𝜃^TX=𝜃_0+𝜃_1x_1+𝜃_2x_2+…+𝜃_nx_n$$

Parameters:
$$𝜃_0, 𝜃_1,…, 𝜃_n$$

Cost function:
lr1.png

Gradient Descent

Repeat {

lr2.png

} (simultaneously update for every j = 0,…,n)

The formula above is equivalent to the following one:
lr3.png

Feature Scaling

The idea is to make sure features are on a similar scale, so that we can get a optimal value faster.
There are many ways to do it, one common method is mean normalization:

$$x_i = \frac{x_i - \mu_i}{s_i}$$

$\mu_i$ is the average value of $x_i$ in training set
$s_i$ is the range of $x_i$,that is
maximum value of $x_i − minimum$ value of $x_i$
$s_i$ also can be the standard deviation

Learning Rate α

$$ 𝜃_j := 𝜃_j - α\frac{\partial}{\partial 𝜃_j}J(𝜃) $$
If you choose the correct α, cost function $J(𝜃)$ should decrease after each iteration.

If gradient descent is not working and your code are correct, then you should use smaller α

For sufficiently small α, $J(𝜃)$ should decrease on every iteration.
But if α is too small, gradient descent can be slow to converge. So we need to manually try the value of learning rate to find a better one.

Normal Equation

Normal equation: Method to solve for 𝜃 analytically.
$$𝜃 = (X^TX)^{-1}X^T \vec y$$
You might wonder, how did this formula come into being?
I will show you below:
$$\begin{align} X𝜃 & = \vec y \\
X^T(\vec y-X𝜃) & = 0 \\
X^T\vec y-X^TX𝜃 & = 0 \\
X^TX𝜃 & = X^T\vec y \\
𝜃 & = (X^TX)^{-1}X^T\vec y \end{align}$$
Normal equation is slow if $n$ is very large, for you have to compute $(X^TX)^{-1}$, which is $O(n^3)$.

Algorithm Implementation

We will implement linear regression with multiple variables to predict the prices of houses. Suppose you are selling your house and you want to know what a good market price would be.

Suppose you have the file ex1data2.txt which contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house.

Linear_Regression_Gradient_Descent.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
%Linear regression with multiple variables
%Code in matlab by Kim Logan
%% Clear and Close Figures
clear ; close all; clc

%% Load Data
fprintf('Loading data ...\n');
data = load('ex1data2.txt');
X = data(:, 1:2);
y = data(:, 3);
m = length(y);

% Scale features and set them to zero mean
fprintf('Normalizing Features ...\n');
mu = [mean(X(:,1)), mean(X(:,2))];
sigma = [std(X(:,1)), std(X(:,2))];
X = [(X(:,1) - mu(1)) / sigma(1) ,(X(:,2) - mu(2)) / sigma(2)];

% Add intercept term to X
X = [ones(m, 1) X];

% Run Gradient Descent
fprintf('Running gradient descent ...\n');

% Choose some alpha value
alpha = 0.03;
num_iters = 400;
theta = zeros(3, 1);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
dp1 = dot(X*theta-y, X(:,1));
dp2 = dot(X*theta-y, X(:,2));
dp3 = dot(X*theta-y, X(:,3));
theta(1)= theta(1) - alpha/m*dp1;
theta(2)= theta(2) - alpha/m*dp2;
theta(3)= theta(3) - alpha/m*dp3;
% Save the cost J in every iteration
n = abs(y-X*theta);
J_history(iter) = dot(n,n)/m/2;
fprintf('%f\n', J_history(iter));
end

% Plot the convergence graph
figure;
plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2);
xlabel('Number of iterations');
ylabel('Cost J');

%% Display gradient descent's result
fprintf('Theta computed from gradient descent: \n');
fprintf(' %f \n', theta);
fprintf('\n');

%% Estimate the price of a 1650 sq-ft, 3 br house
input1 = (1650 - mu(1)) / sigma(1);
input2 = (3 - mu(2)) / sigma(2);
price = [1, input1, input2] * theta;

fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ...
'(using gradient descent):\n $%f\n'], price);

Normal_Equation.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
%Normal Equation
%Code in matlab by Kim Logan
%% Clear and Close Figures
clear ; close all; clc

%% Load Data
data = csvread('ex1data2.txt');
X = data(:, 1:2);
y = data(:, 3);
m = length(y);

% Add intercept term to X
X = [ones(m, 1) X];

% Calculate the parameters from the normal equation
theta = inv(X'*X)*X'*y;

% Display normal equation's result
fprintf('Theta computed from the normal equations: \n');
fprintf(' %f \n', theta);
fprintf('\n');

% Estimate the price of a 1650 sq-ft, 3 br house
price = [1, 1650, 3] * theta;
fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ...
'(using normal equations):\n $%f\n'], price);

The dataset ex1data2.txt is shown below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
2104,3,399900
1600,3,329900
2400,3,369000
1416,2,232000
3000,4,539900
1985,4,299900
1534,3,314900
1427,3,198999
1380,3,212000
1494,3,242500
1940,4,239999
2000,3,347000
1890,3,329999
4478,5,699900
1268,3,259900
2300,4,449900
1320,2,299900
1236,3,199900
2609,4,499998
3031,4,599000
1767,3,252900
1888,2,255000
1604,3,242900
1962,4,259900
3890,3,573900
1100,3,249900
1458,3,464500
2526,3,469000
2200,3,475000
2637,3,299900
1839,2,349900
1000,1,169900
2040,4,314900
3137,3,579900
1811,4,285900
1437,3,249900
1239,3,229900
2132,4,345000
4215,4,549000
2162,4,287000
1664,2,368500
2238,3,329900
2567,4,314000
1200,3,299000
852,2,179900
1852,4,299900
1203,3,239500