Training Manual for Data Analysis using SAS by Sujai Das (good books to read for women .txt) 📕

Comments: 0

Complain

Excerpt from the book:

SAS (Statistical Analysis System) software is comprehensive software which deals with many
problems related to Statistical analysis, Spreadsheet, Data Creation, Graphics, etc. It is a layered,
multivendor architecture. Regardless of the difference in hardware, operating systems, etc., the
SAS applications look the same and produce the same results. The three components of the SAS
System are Host, Portable Applications and Data. Host provides all the required interfaces
between the SAS system and the operating environment. Functionalities and applications reside
in Portable component and the user supplies the Data. We, in this course will be dealing with the
software related to perform statistical analysis of data.

Read Book

Read free book «Training Manual for Data Analysis using SAS by Sujai Das (good books to read for women .txt) 📕» - read online or download for free at americanlibrarybooks.com

Download in Format:

Author: Sujai Das

Read book online «Training Manual for Data Analysis using SAS by Sujai Das (good books to read for women .txt) 📕». Author - Sujai Das

Go to page:

MIXED; PROC RSREG; PROC IML; PROC PRINCOMP; PROC VARCOMP; PROC FACTOR; PROC CANCORR; PROC DISCRIM, etc. Some of these are described in the sequel.

PROC TTEST is the procedure that is used for comparing the mean of a given sample. This PROC is also used for compares the means of two independent samples. The paired observations t test compares the mean of the differences in the observations to a given number. The underlying assumption of the t test in all three cases is that the observations are random samples drawn from normally distributed populations. This assumption can be checked using the UNIVARIATE procedure; if the normality assumptions for the t test are not satisfied, one should analyze the data using the NPAR1WAY procedure. PROC TTEST computes the group comparison t statistic based on the assumption that the variances of the two groups are equal. It also computes an approximate t based on the assumption that the variances are unequal (the Behrens-Fisher problem). The following statements are available in PROC TTEST.

PROC TTEST <options>; CLASS variable;

PAIRED variables; BY variables;

VAR variables; FREQ Variables; WEIGHT variable;

No statement can be used more than once. There is no restriction on the order of the statements after the PROC statement. The following options can appear in the PROC TTEST statement. ALPHA= p: option specifies that confidence intervals are to be 100(1-p)% confidence intervals, where 0<p<1. By default, PROC TTEST uses ALPHA=0.05. If p is 0 or less, or 1 or more, an error message is printed.

COCHRAN: option requests the Cochran and Cox approximation of the probability level of the

approximate t statistic for the unequal variances situation.

H0=m: option requests tests against m instead of 0 in all three situations (one-sample, two- sample, and paired observation t tests). By default, PROC TTEST uses H0=0.

A CLASS statement giving the name of the classification (or grouping) variable must accompany the PROC TTEST statement in the two independent sample cases. It should be omitted for the one sample or paired comparison situations. The class variable must have two, and only two, levels. PROC TTEST divides the observations into the two groups for the t test using the levels of this variable. One can use either a numeric or a character variable in the CLASS statement.

In the statement PAIRED PairLists, the PairLists in the PAIRED statement identifies the variables to be compared in paired comparisons. You can use one or more PairLists. Variables or lists of variables are separated by an asterisk (*) or a colon (:). Examples of the use of the asterisk and the colon are shown in the following table.

The PAIRED Statements

Comparisons made

PAIRED A*B;

A-B

PAIRED A*B C*D;

A-B and C-D

PAIRED (A B)*(C B);

A-C, A-B and B-C

PAIRED (A1-A2)*(B1-B2);

A1-B1, A1-B2, A2-B1 and A2-B2

PAIRED (A1-A2):(B1-B2);

A1-B1 and A2-B2

PROC ANOVA performs analysis of variance for balanced data only from a wide variety of experimental designs whereas PROC GLM can analyze both balanced and unbalanced data. As ANOVA takes into account the special features of a balanced design, it is faster and uses less storage than PROC GLM for balanced data. The basic syntax of the ANOVA procedure is as given:

PROC ANOVA < Options>; CLASS variables;

MODEL dependents = independent variables (or effects)/options;

MEANS effects/options; ABSORB variables; FREQ variables;

TEST H = effects E = effect; MANOVA H = effects E = effect;

M = equations/options; REPEATED factor - name levels / options; By variables;

The PROC ANOVA, CLASS and MODEL statements are must. The other statements are optional. The CLASS statement defines the variables for classification (numeric or character variables - maximum characters =16).

The MODEL statement names the dependent variables and independent variables or effects. If no effects are specified in the MODEL statement, ANOVA fits only the intercept. Included in the ANOVA output are F-tests of all effects in the MODEL statement. All of these F-tests use residual mean squares as the error term. The MEANS statement produces tables of the means corresponding to the list of effects. Among the options available in the MEANS statement are several multiple comparison procedures viz. Least Significant Difference (LSD), Duncan’s New multiple - range test (DUNCAN), Waller - Duncan (WALLER) test, Tukey’s Honest Significant Difference (TUKEY). The LSD, DUNCAN and TUKEY options takes level of significance ALPHA = 5% unless ALPHA = options is specified. Only ALPHA = 1%, 5% and 10% are allowed with the Duncan’s test. 95% Confidence intervals about means can be obtained using CLM option under MEANS statement.

The TEST statement tests for the effects where the residual mean square is not the appropriate term such as main - plot effects in split - plot experiment. There can be multiple MEANS and TEST statements (as well as in PROC GLM), but only one MODEL statement preceded by RUN statement. The ABSORB statement implements the technique of absorption, which saves time and reduces storage requirements for certain type of models. FREQ statement is used when each observation in a data set represents ‘n’ observations, where n is the value of FREQ variable. The MANOVA statement is used for implementing multivariate analysis of variance. The

REPEATED statement is useful for analyzing repeated measurement designs and the BY statement specifies that separate analysis are performed on observations in groups defined by the BY variables.

PROC GLM for analysis of variance is similar to using PROC ANOVA. The statements listed for PROC ANOVA are also used for PROC GLM. In addition; the following more statements can be used with PROC GLM:

CONTRAST ‘label’ effect name< ... effect coefficients > </options>; ESTIMATE ‘label’ effect name< ... effect coefficients > </options>; ID variables;

LSMEANS effects < / options >;

OUTPUT < OUT = SAS-data-set>keyword=names< ... keyword = names>; RANDOM effects < / options >;

WEIGHT variables

Multiple comparisons as used in the options under MEANS statement are useful when there are no particular comparisons of special interest. But there do occur situations where preplanned comparisons are required to be made. Using the CONTRAST, LSMEANS statement, we can test specific hypothesis regarding pre - planned comparisons. The basic form of the CONTRAST statement is as described above, where label is a character string used for labeling output, effect name is class variable (which is independent) and effect - coefficients is a list of numbers that specifies the linear combination parameters in the null hypothesis. The contrast is a linear function such that the elements of the coefficient vector sum to 0 for each effect. While using the CONTRAST statements, following points should be kept in mind.

How many levels (classes) are there for that effect. If there are more levels of that effect in the data than the number of coefficients specified in the CONTRAST statement, the PROC GLM adds trailing zeros. Suppose there are 5 treatments in a completely randomized design denoted as T1, T2, T3, T4, T5 and null hypothesis to be tested is

Ho: T2+T3 = 2T1 or 2T1+T2+T3 = 0

Suppose in the data treatments are classified using TRT as class variable, then effect name is TRT CONTRAST ‘TIVS 2&3’ TRT 2 1 1 0 0; Suppose last 2 zeros are not given, the trailing zeros can be added automatically. The use of this statement gives a sum of squares with

1 degree of freedom (d.f.) and F-value against error as residual mean squares until specified. The name or label of the contrast must be 20 characters or less.

The available CONTRAST statement options are

E: prints the entire vector of coefficients in the linear function, i.e., contrast. E = effect: specifies an effect in the model that can be used as an error term ETYPE = n: specifies the types (1, 2, 3 or 4) of the E effect.

Multiple degrees of freedom contrasts can be specified by repeating the effect name and coefficients as needed separated by commas. Thus the statement for the above example

CONTRAST ‘All’ TRT 2 1 1 0 0, TRT 0 1 -1 0 0;

This statement produces two d.f. sum of squares due to both the contrasts. This feature can be used to obtain partial sums of squares for effects through the reduction principle, using sums of squares from multiple degrees of freedom contrasts that include and exclude the desired contrasts. Although only t1 linearly independent contrasts exists for t classes, any number of contrasts can be specified.

The ESTIMATE statement can be used to estimate linear functions of parameters that may or may not be obtained by using CONTRAST or LSMEANS statement. For the specification of the statement only word CONTRAST is to be replaced by ESTIMATE in CONTRAST statement.

Fractions in effects coefficients can be avoided by using DIVISOR = Common denominator as an option. This statement provides the value of an estimate, a standard error and a t-statistic for testing whether the estimate is significantly different from zero.

The LSMEANS statement produces the least square estimates of CLASS variable means i.e. adjusted means. For one-way structure, there are simply the ordinary means. The least squares means for the five treatments for all dependent variables in the model statement can be obtained using the statement.

LSMEANS TRT / options;

Various options available with this statement are:

STDERR: gives the standard errors of each of the estimated least square mean and the t-statistic for a test of hypothesis that the mean is zero.

PDIFF: Prints the p - values for the tests of equality of all pairs of CLASS means.

SINGULAR: tunes the estimability checking. The options E, E=, E-TYPE = are similar as discussed under CONTRAST statement.

Adjust=T: gives the probabilities of significance of pairwise comparisons based on T-test. Adjust=Tukey: gives the probabilities of significance of pairwise comparisons based on Tukey's

test

Lines: gives the letters on treatments showing significant and non-significant groups

When the predicted values are requested as a MODEL statement option, values of variable specified in the ID statement are printed for identification besides each observed, predicted and residual value. The OUTPUT statement produces an output data set that contains the original data set values alongwith the predicted and residual values.

Besides other options in PROC GLM under MODEL statement we can give the option: 1. solution 2. xpx (=X`X) 3 . I (g-inverse)

PROC GLM recognizes different theoretical approaches to ANOVA by providing four types of sums of squares and associated statistics. The four types of sums of squares in PROC GLM are called Type I, Type II, Type III and Type IV.

The Type I sums of squares are the classical sequential sums of squares obtained by adding the terms to the model in some logical sequence. The sum of squares for each class of effects is adjusted for only those effects that precede it in the model. Thus the sums of squares and their expectations are dependent on the order in which the model is specified.

The Type II, III and IV are ‘partial sums of squares' in the sense that each is adjusted for all other classes of the effects in the model, but each is adjusted according to different rules. One general rule applies to all three types: the estimable functions that generate the sums of squares for one class of squares will not involve any other classes of effects except those that “contain” the class of effects in question.

For example, the estimable functions that generate SS (AB) in a three- factor factorial will have zero coefficients on main effects and the (A  C) and (B  C) interaction effects. They will contain non-zero coefficient on the (A  B  C) interaction effects, because A  B  C interaction “contains” A  B interaction.

Type II, III and IV sums of squares differ from each other in how the coefficients are determined for the classes of effects that do not have zero coefficients - those that contain the class of effects in question. The estimable functions for the Type II sum of squares impose no restriction on the values of the non-zero coefficients on the remaining effects; they are allowed to take whatever values result from the computations adjusting for effects that are required to have zero coefficients. Thus, the coefficients on the higher-order interaction effects and higher level nesting effects are functions of the number of observations in the data. In general, the Type