Submitted by jpedrin on Wed, 2016-12-21 16:27

**Happy Holidays from the MGH Biostatistics Center**

Many investigators use Excel spreadsheets or Access as a database for clinical data. In order to analyze data in a statistical package, it is usually necessary first convert your Access database or Excel spreadsheets into tables that can be read into statistical packages. By following the guidelines below, investigators can structure their spreadsheets so that the conversion is simple. Click here for a video demonstration of how to construct your Excel sheet.

The most common format of a dataset for statistical analysis is a rectangular grid of rows and columns. Each column represents a data element or variable (corresponding to a field in Access) such as age, gender, blood pressure, etc. while each row is the observed values for one patient at one particular point in time. Some sample data from a clinical trial on the effect of caffeine on blood pressure might look like Table 1 and Table 2.

The field names used by SAS and other packages must be no more than 32 characters, must contain only letters, digits, underscore and must not start with a digit. In SAS, upper case and lower case letters are equivalent so that two columns named AGE and age would cause confusion. You can use mixed uppercase and lowercase to make the field names more readable but make sure that the field name is unique (within the table) when converted to all uppercase.

Most statistical packages support at least two data types, numeric and character (text). If a variable is numeric don’t enter any characters in that column (except for the field name in row 1). To enter a missing numeric value just leave the cell blank.

Do NOT use quotes, the number sign (#) or commas in your text fields or as codes because these have special meanings in statistical packages. For example, if you want an abbreviation for inches use in or spell it out rather than using the quotes. For names such as O’Brien, use OBrien.

Use the same field name for a subject id (e.g. ID, PatID, PatientID) in all tables where the id’s should match. If you have codes for visits, e.g. baseline, week1, week2, discharge, use the same codes in all tables. Use the same field name for the visit field (e.g. Visit) in all tables.

IMPORTANT: The information that you include on a spreadsheet that will be used by multiple researchers must follow HIPAA guidelines for patient confidentiality.

When possible enter dichotomous variables as numeric with 1=yes and 0=no. This makes them easier to include in analyses such as logistic regression without conversion.

Do not use formulas for cell values. The sheet must be simple rows by columns (fields). If you are going to use Excel to compute things, e.g. means or percentages, then make a copy of the file before you do it. Also, make a code book and place it on a separate sheet. This way you have built in documentation for your data.

An important additional table to include in your Excel file or Access database is a code book. The code book includes a row for each variable in each table, and four features of each variable should be described. The first feature (TABLE) is the table in which the variable is located. If you have only one table, this feature is not needed. The second feature is the name of the variable in the table (FIELD). The third feature is a descriptive label of the variable (LABEL). Finally, the fourth feature provides a code for any categorical or binary variables (CODE).

PID | Age | Female | TX |
---|---|---|---|

101 | 23 | 1 | 1 |

102 | 14 | 0 | 0 |

101 | 25 | 0 | 1 |

PID | Visit | Blood pressure |
---|---|---|

101 | 0 | 121 |

101 | 1 | 123 |

101 | 2 | 124 |

102 | 0 | 142 |

102 | 1 | 147 |

102 | 2 | 145 |

103 | 0 | 161 |

103 | 1 | 165 |

103 | 2 | 168 |

Table | Field | Label | Code |
---|---|---|---|

Baseline | PID | patient id | |

Baseline | Age | age (years) | |

Baseline | Female | female gender | 1=female, 0=male |

Baseline | TX | treatment | 1=regular, 0=decaf |

Visit | PID | patient id | |

Visit | Visit | visit | 0=Baseline,1=6 month, 2=12 month |

Visit | Blood pressure | Systolic blood pressure (mmHg) |

Click here for a video of how to import data into STATA.

The first step in any data analysis is to investigate the data using *summary statistics* and *graphics*. This is the information presented in the description of the study population in Table 1 of most scientific papers. One approach to choosing how to summarize and graph your data is based upon the type of data you are summarizing. Five common types of data are *continuous*, *binary*, *categorical*, *ordinal*, and *time to event*. We will focus on the first two here, but if you would like more information on the other types of data please contact us. Click here for a video of how to summarize data in Excel. Click here for a video of how to summarize data in STATA.

A continuous variable takes any numerical magnitude within the range of biologically possible values (age, BMI). As an example, here are five BMI measurements (18.3, 22.1, 23.3, 23.7, 30.0).

Continuous data are described in terms of the *location* and *variability* in the data, most commonly using the *mean* and the *standard deviation*. For example, in many papers, the age of study participants is summarized using these measures in Table 1.

The sample mean is equal to the sum of the observations divided by the number of observations. For the BMI measurements above, we obtain the sum, 18.3 + 22.1 + 23.3 + 23.7 + 30.0 = 117.4, so that the mean is 117.4/5=23.48.

The sample standard deviation (SD) is proportional to the square root of the average squared distance of each observation from the mean. For the BMI measurements above, the standard deviation is about 4.22.

If the data are *skewed* (see the *histograms* under Graphics tab for an example), authors sometimes choose to summarize continuous data using the *median* and *range* rather than the mean and standard deviation.

The sample median is the 50th *percentile* of the measurements (the midpoint of the ordered values). For the BMI measurements above, the median is 23.3.

The sample range is the interval defined by the maximum and minimum observations. For the BMI measurements above, the range is 18.3 to 30.0.

These summary statistics could be displayed in a table like this:

Variable | Mean (SD) | Median [Range] |
---|---|---|

BMI | 23.48 (4.22) | 23.3 [18.3 - 30.0] |

To calculate these summary statistics for your dataset, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

A binary variable takes two possible values, one value for each of two categories (sex, alive/dead). Binary data is summarized by simply counting the *frequency* of each category or dividing the frequency by the total number of samples to obtain the *proportion* or percentage of each category. Frequencies and the percent of the column total can be displayed in contingency tables such as this:

Active Rx | Placebo Rx | Total | |
---|---|---|---|

Alive | 12 (43%) | 14 (61%) | 26 (51%) |

Dead | 16 (57%) | 9 (39%) | 25 (49%) |

Total | 28 | 23 | 51 |

In addition to summary statistics, graphs provide an alternative way to summarize a sample of continuous observations.

The histogram is a commonly used graph for a single continuous variable. A histogram shows the *distribution* of continuous data by breaking the data into bins and showing the frequency in each bin.

In this histogram, the data is roughly *symmetric*:

In this histogram the data is skewed to the right:

To plot a histogram, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

A commonly used graph for displaying the relationship between a pair of continuous variables is the scatter plot. In a scatter plot, each point represents a single patient or animal. The horizontal or x-axis is usually the *explanatory variable* and the vertical or y-axis is usually the *outcome variable*.

Here is a scatter plot of the 50th systolic blood pressure percentile of girls in the 50th height percentile for ages 1 through 10. (Data taken from http://www.nhlbi.nih.gov/guidelines/hypertension/child_tbl.pdf)

To plot a scatter plot, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

A common scenario in clinical research is a study that compares a continuous outcome in two groups. For example, a clinical trial might compare the cholesterol level in patients given a new treatment and patients given a standard of care treatment to determine if the treatment improves cholesterol level. A first step in such an analysis is to calculate the mean cholesterol in each group. If you do not know how to calculate the mean, please see the previous section. If the group means are different, there are four potential reasons for the difference:

- Actual difference between the groups: In the example, the treatment caused a change in cholesterol.
- Chance: Two groups will often not have the same mean just by chance
*Bias:**Confounding:*

Although the last two are very important, we will not focus on them here. If you are concerned with bias and confounding in your study, please contact us.

The question in most studies is whether the difference between the groups is *statistically significant*. This is asking if the observed difference between the groups is unlikely due just to chance. The value that we calculate to assess whether the difference is likely due to chance is called the *p-value*, which is the chance of the observed group difference (or something more extreme) assuming the true group means were the same. If the *p-value* is less than 0.05, we say that the group difference is unlikely due to chance and therefore is statistically significant.

A *two-sample t-test* allows us to test for a group difference with a continuous outcome. A t-test calculates the difference between the group means and determines how probable the observed difference is if there was truly no difference between the groups. This is the p-value. If the p-value is low enough (i.e. less than 0.05), we conclude that there is a significant difference between the groups. If you would like to complete a two-sample t-test on your data, please use this Excel worksheet. If you have additional questions, please contact us.

A *chi-square test* allows us to test for a group difference in the relative proportions of a binary outcome. As with the t-test, if the p-value is low enough (i.e. less than 0.05), we conclude that there is a difference in proportions between the groups. If you would like to complete a chi-square test on your data, please use this Excel worksheet. If you have additional questions, please contact us.

The *Pearson correlation* between two continuous variables is a measure of how closely the scatter plot comes to falling on a straight line so that the two variables either increase together or one variable increases as the other variable decreases. It ranges from -1 to 1, with 1 indicating that the scatter plot falls exactly on a line sloping up to the right, -1 indicating a line sloping down to the right, and 0 indicating no linear association whatsoever.

The following pair of variables has a correlation of about 0.9.

While the following have a correlation of about 0.3.

If you would like to calculate the Pearson correlation between two variables and the *Fisher’s Z-test* p-value to test whether it is significantly different from zero, please use this Excel worksheet. If you have additional questions, please contact us.

A well designed study needs a sample large enough to minimize the probability of drawing an erroneous conclusion from the study results. In a randomized clinical trial comparing an active treatment to a control treatment, two types of errors are possible.

A *Type I error* is made by concluding that the active treatment is superior to the control treatment when in fact it is no better. *A Type II error* is made by concluding that the active treatment is no better than the control treatment when in fact it is superior.

The probability of making a Type I error is denoted by *alpha* and the probability of making a Type II error is denoted by *beta*. *Power* is defined as 1 minus beta.

The sample size needs to be large enough to guarantee high power, usually 0.80 or more, to conclude that an effective active treatment is indeed effective while simultaneously holding alpha to a low level, usually 0.05 or less. This guarantees a low Type I error rate if the active treatment is in fact ineffective.

To compute the sample size for a two-sample t-test of means with equal size groups we need to specify:

- The alpha level, usually 0.05 or less.
- The power, usually 0.80 or more.
- The (common) within group standard deviation.
- The minimum mean difference between the groups considered to be clinically significant.

To compute the sample size for a chi-square test of two proportions with equal size treatment groups we need to specify:

- The alpha level, usually 0.05 or less.
- The power, usually 0.80 or more.
- The hypothesized true proportion of successes in the control treatment group.
- The hypothesized true proportion of successes in the active treatment group.

To compute the sample size to test for non-zero Pearson correlation between two variables we need to specify:

- The alpha level, usually 0.05 or less.
- The power, usually 0.80 or more.
- The hypothesized true correlation.

If you would like to calculate the approximate sample size for one of the above tests, please use this Excel worksheet. An alternative approach that addresses additional scenarios is our web-based sample size calculator available here. If you have additional questions, please contact us.