#### Statistical Review

**Confidence Intervals:** The interval estimate of a population parameter, and usually established at 95%. If one was to take 100 samples from the population, the mean score or measurement from 95 of those sample populations would fall within the range of 95% Confidence Intervals. The closer or tighter that the range of the 95% CI is from the actual mean, the stronger the results of that test.

**Standard Error of Measurement:** This value tells the range (+/-) within which a patient's true score might fit within a given test. If the SEM for goniometric measurement of range of motion for knee flexion is 3.5 degrees than one could expect a variation of the true/actual range of motion to be between 116.5 and 123.5 degrees when the actual measured value is 120 degrees. It is also thought of as the standard deviation of the values from repeated test scores.

**Sensitivity:** If a patient does have a condition, what are the chances that the clinical test will be positive? This is your measure of True Positives. The values range from 0 to 1.0 where 1.0 = 100% true positives. The mnemonic SnOut is used to apply these findings. If a test has a high sensitivity and the test is negative, a clinician can feel better about ruling Out the disease (SnOut). Clinical tests with higher sensitivity are better for screening patients for the target condition, but not as good for providing a specific diagnosis. In other words when a highly sensitive test is negative you can feel more assured that the patient does not have the condition, however if the test is positive you cannot be assured that they did have that condition, unless the test was also highly specific.

**Specificity:** If a patient does not have a condition, what are the chances that the clinical test will be negative? This is your measure of True Negatives. The values range from 0 to 1.0 where 1.0 = 100% true negatives. The mnemonic SpIn is used to apply these findings. If a test has a high specificity and the test is positive, a clinician can feel better about ruling In the disease (SpIn). When a highly specific test is positive you can feel more assured that the patient does does have the condition, however if the test is negative you cannot be assured that they do not have that condition, unless the test was also highly sensitive.

**Positive Likelihood Ratio (+LR):** Expresses the change in odds favoring the condition when given a positive test. It is a calculation of the specificity and the sensitivity of a test (+LR = Sensitivity / 1-Specificity). A +LR > 1.0 increases the likelihood of providing a correct diagnosis based on the test result.

**Negative Likelihood Ratio:** Expresses the change in odds that a condition is absent when given a negative test. It is a calculation of the specificity and the sensitivity of a test (-LR = 1-Sensitivity / Specificity). A LR < 1.0 increases the likelihood of providing a correct diagnosis based on the test result.

**Odds Ratio:** This is the estimate of the relative risk and is typically used when the relative risk cannot be determined accurately based on the limitations of the study (inability to accurately calculate cumulative incidence, i.e. a case-control study). It is often used to express the effect size. It is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. It is the probability of the same event or condition occurring in two groups. A 1:1 odds ratio means no difference in odds between the groups (the event or condition occurs equally in both groups).

**Relative Risk:** This is the measure of the relative effect, which is the ratio that describes the risks associated with the exposed group compared to the unexposed group. It indicates the likelihood that someone who has been exposed to a risk factor will develop the condition in comparison to someone who has not been exposed.

**Effect Size:** The magnitude of the difference between two treatments or the relationship between two variables. A larger effect size for one treatment indicates that it resulted in a larger positive difference in the outcome that was measured.

**Validity:** Does the clinical test measure what it is intended to measure? This is the question that validity answers. This can often be measured by Sensitivity and Specificity values as well as Likelihood Ratios (positive and negative predictive values are used often but are not as helpful as likelihood ratios).

**Reliability:** How well do examiners agree on the findings of a test? Reliability is a measure of agreement, but not validity. It is based on the amount of error that is present in a set of scores. In order for a clinical test to have good validity, good reliability is required. However, tests that do not have good validity can still have excellent reliability. Examiners may be able to measure a test very reliably between themselves and other examiners, but that does not necessarily mean that the test is a good measure of a specific condition or diagnosis. It is measured by coefficients (Kappa or Intraclass Correlation Coefficients depending on the type of variable).

**Kappa:** This is a measure of agreement that has been chance-corrected. This statistic evaluates the proportion of observed agreement and then takes into account the proportion that can be expected by chance. It was designed primarily to measure non-parametric data such as dichotomous variables, that include Yes/No and Positive/Negative answers, and categorical variable like manual muscle test grades. The range of scores runs from 0 to 10 and the interpretation of scores has been suggested as:

1.0 = Perfect

0.8 to 1.0 = Excellent

0.6 to 0.8 = Substantial

0.4 to 0.6 = Moderate

< 0.4 = Poor

**Weighted Kappa:** The regular Kappa statistic does not differentiate among disagreements. If a researcher wants to assign a greater weight to one disagreement over another due to greater possible risks, then it becomes a Weighted Kappa. Some disagreements may be more serious then others. Not all data can be differentiated like this, but if it can then this Weighted Kappa can be used to estimate reliability.

**Intraclass Correlation Coefficient (ICC):** The ICC is a reliability coefficient. It calculates the variance in scores and is able to reflect both degree of correspondence and agreement between ratings. It ranges from 0 to 1.0. The ICC is a measure of reliability designed primarily for parametric variables (interval or ratio data), which are continuous, such as range of motion measurements, height, weight, etc. The interpretation of scores has been suggested as:

1.0 = Perfect

0.9 to 1.0 = Excellent

0.75 to 0.9 = Good

0.5 to 0.75 = Moderate

< 0.5 = Poor

**Correlation:** Correlation is a measure of association and not agreement (reliability measures agreement). It indicates the linear relationship between variables and ranges from -1 to 0 to 1 and is measured by coefficients (Pearson or Spearmans). The closer the variable is to 1, the stronger the positive correlation and the closer to -1 the stronger the negative correlation. For the most part a zero indicates no correlation at all between the variables. Correlation sizes have been defined as:

+/- 0.1 to 0.3 = Small

+/- 0.3 to 0.5 = Medium

+/- 0.5 to 1.0 = Large

Correlation Coefficients: Statistics that quantitatively describe the strength and direction of a relationship between two variables.

**Reference:** Portney LG, Watkins MP (2000), Foundations of Clinical Research: Applications to Practice. Upper Saddle River: Prentice-Hall, Inc.