Data Mining by Mehmed Kantardzic (good book recommendations TXT) π
Read free book Β«Data Mining by Mehmed Kantardzic (good book recommendations TXT) πΒ» - read online or download for free at americanlibrarybooks.com
- Author: Mehmed Kantardzic
Read book online Β«Data Mining by Mehmed Kantardzic (good book recommendations TXT) πΒ». Author - Mehmed Kantardzic
Generalized linear regression models are currently the most frequently applied statistical techniques. They are used to describe the relationship between the trend of one variable and the values taken by several other variables. Modeling this type of relationship is often called linear regression. Fitting models is not the only task in statistical modeling. We often want to select one of several possible models as being the most appropriate. An objective method for choosing between different models is called ANOVA, and it is described in Section 5.5.
The relationship that fits a set of data is characterized by a prediction model called a regression equation. The most widely used form of the regression model is the general linear model formally written as
Applying this equation to each of the given samples we obtain a new set of equations
where Ξ΅jβs are errors of regression for each of m given samples. The linear model is called linear because the expected value of yj is a linear function: the weighted sum of input values.
Linear regression with one input variable is the simplest form of regression. It models a random variable Y (called a response variable) as a linear function of another random variable X (called a predictor variable). Given n samples or data points of the form (x1, y1), (x2, y2), β¦ , (xn, yn), where xi βX and yi β Y, linear regression can be expressed as
where Ξ± and Ξ² are regression coefficients. With the assumption that the variance of Y is a constant, these coefficients can be solved by the method of least squares, which minimizes the error between the actual data points and the estimated line. The residual sum of squares is often called the sum of squares of the errors about the regression line and it is denoted by SSE (sum of squares error):
where yi is the real output value given in the data set, and yiβ is a response value obtained from the model. Differentiating SSE with respect to Ξ± and Ξ², we have
Setting the partial derivatives equal to 0 (minimization of the total error) and rearranging the terms, we obtain the equations
which may be solved simultaneously to yield the computing formulas for Ξ± and Ξ². Using standard relations for the mean values, regression coefficients for this simple case of optimization are
where meanx and meany are the mean values for random variables X and Y given in a training data set. It is important to remember that our values of Ξ± and Ξ², based on a given data set, are only estimates of the true parameters for the entire population. The equation y = Ξ± + Ξ²x may be used to predict the mean response y0 for the given input x0, which is not necessarily from the initial set of samples.
For example, if the sample data set is given in the form of a table (Table 5.2), and we are analyzing the linear regression between two variables (predictor variable A and response variable B), then the linear regression can be expressed as
where Ξ± and Ξ² coefficients can be calculated based on previous formulas (using meanA = 5.4, and meanB = 6), and they have the values
TABLE 5.2. A Database for the Application of Regression MethodsAB138911114532
The optimal regression line is
The initial data set and the regression line are graphically represented in Figure 5.4 as a set of points and a corresponding line.
Figure 5.4. Linear regression for the data set given in Table 5.2.
Multiple regression is an extension of linear regression, and involves more than one predictor variable. The response variable Y is modeled as a linear function of several predictor variables. For example, if the predictor attributes are X1, X2, and X3, then the multiple linear regression is expressed as
where Ξ±, Ξ²1, Ξ²2, and Ξ²3 are coefficients that are found by using the method of least squares. For a linear regression model with more than two input variables, it is useful to analyze the process of determining Ξ² parameters through a matrix calculation:
where Ξ² = {Ξ²0, Ξ²1, β¦ , Ξ²n}, Ξ²0 = Ξ±, and X and Y are input and output matrices for a given training data set. The residual sum of the squares of errors SSE will also have the matrix representation
and after optimization
the final Ξ² vector satisfies the matrix equation
where Ξ² is the vector of estimated coefficients in a linear regression. Matrices X and Y have the same dimensions as the training data set. Therefore, an optimal solution for Ξ² vector is relatively easy to find in problems with several hundreds of training samples. For real-world data-mining problems, the number of samples may increase to several millions. In these situations, because of the extreme dimensions of matrices and the exponentially increased complexity of the algorithm, it is necessary to find modifications and/or approximations in the algorithm, or to use totally different regression methods.
There is a large class of regression problems, initially nonlinear, that can be converted into the form of the general linear model. For example, a polynomial relationship such as
can be converted to the linear form by setting new variables X4 = X1 Β· X3 and X5 = X2 Β· X3. Also, polynomial regression can be modeled by adding polynomial terms to the basic linear model. For example, a cubic polynomial curve has a form
By applying transformation to the predictor variables
Comments (0)