Recent News...

Happy Holidays!

Happy Holidays from the MGH Biostatistics Center

Site Redesign

Welcome to our redesigned website!

Support

You are here

Basic Clinical Statistics

How to create an Excel worksheet for analysis

Introduction

Many investigators use Excel spreadsheets or Access as a database for clinical data. In order to analyze data in a statistical package, it is usually necessary first convert your Access database or Excel spreadsheets into tables that can be read into statistical packages. By following the guidelines below, investigators can structure their spreadsheets so that the conversion is simple. Click here for a video demonstration of how to construct your Excel sheet.

Datasets for statistical analysis

The most common format of a dataset for statistical analysis is a rectangular grid of rows and columns. Each column represents a data element or variable (corresponding to a field in Access) such as age, gender, blood pressure, etc. while each row is the observed values for one patient at one particular point in time. Some sample data from a clinical trial on the effect of caffeine on blood pressure might look like Table 1 and Table 2.

The field names used by SAS and other packages must be no more than 32 characters, must contain only letters, digits, underscore and must not start with a digit. In SAS, upper case and lower case letters are equivalent so that two columns named AGE and age would cause confusion. You can use mixed uppercase and lowercase to make the field names more readable but make sure that the field name is unique (within the table) when converted to all uppercase.

Most statistical packages support at least two data types, numeric and character (text). If a variable is numeric don’t enter any characters in that column (except for the field name in row 1). To enter a missing numeric value just leave the cell blank.

Text or Character fields

Do NOT use quotes, the number sign (#) or commas in your text fields or as codes because these have special meanings in statistical packages. For example, if you want an abbreviation for inches use in or spell it out rather than using the quotes. For names such as O’Brien, use OBrien.

Patient identifiers and visit identifiers

Use the same field name for a subject id (e.g. ID, PatID, PatientID) in all tables where the id’s should match. If you have codes for visits, e.g. baseline, week1, week2, discharge, use the same codes in all tables. Use the same field name for the visit field (e.g. Visit) in all tables.

IMPORTANT: The information that you include on a spreadsheet that will be used by multiple researchers must follow HIPAA guidelines for patient confidentiality.

Coding Dichotomous (yes/no) variables

When possible enter dichotomous variables as numeric with 1=yes and 0=no. This makes them easier to include in analyses such as logistic regression without conversion.

Special Consideration for Excel

Do not use formulas for cell values. The sheet must be simple rows by columns (fields). If you are going to use Excel to compute things, e.g. means or percentages, then make a copy of the file before you do it. Also, make a code book and place it on a separate sheet. This way you have built in documentation for your data.

Constructing the Code Book

An important additional table to include in your Excel file or Access database is a code book. The code book includes a row for each variable in each table, and four features of each variable should be described. The first feature (TABLE) is the table in which the variable is located. If you have only one table, this feature is not needed. The second feature is the name of the variable in the table (FIELD). The third feature is a descriptive label of the variable (LABEL). Finally, the fourth feature provides a code for any categorical or binary variables (CODE).

Table 1: Blood pressure study: BASELINE Table
PID Age Female TX
101 23 1 1
102 14 0 0
101 25 0 1
Table 2: Blood pressure study: VISIT Table
PID Visit Blood pressure
101 0 121
101 1 123
101 2 124
102 0 142
102 1 147
102 2 145
103 0 161
103 1 165
103 2 168
Table 3: Blood pressure study: CODE BOOK Table
Table Field Label Code
Baseline PID patient id
Baseline Age age (years)
Baseline Female female gender 1=female, 0=male
Baseline TX treatment 1=regular, 0=decaf
Visit PID patient id
Visit Visit visit 0=Baseline,1=6 month, 2=12 month
Visit Blood pressure Systolic blood pressure (mmHg)

How to import data into statistical software

Click here for a video of how to import data into STATA.

How to summarize data

The first step in any data analysis is to investigate the data using summary statistics and graphics. This is the information presented in the description of the study population in Table 1 of most scientific papers. One approach to choosing how to summarize and graph your data is based upon the type of data you are summarizing. Five common types of data are continuous, binary, categorical, ordinal, and time to event. We will focus on the first two here, but if you would like more information on the other types of data please contact us. Click here for a video of how to summarize data in Excel. Click here for a video of how to summarize data in STATA.

Continuous Data

A continuous variable takes any numerical magnitude within the range of biologically possible values (age, BMI). As an example, here are five BMI measurements (18.3, 22.1, 23.3, 23.7, 30.0).

Continuous data are described in terms of the location and variability in the data, most commonly using the mean and the standard deviation. For example, in many papers, the age of study participants is summarized using these measures in Table 1.

Mean

The sample mean is equal to the sum of the observations divided by the number of observations. For the BMI measurements above, we obtain the sum, 18.3 + 22.1 + 23.3 + 23.7 + 30.0 = 117.4, so that the mean is 117.4/5=23.48.

Standard Deviation

The sample standard deviation (SD) is proportional to the square root of the average squared distance of each observation from the mean. For the BMI measurements above, the standard deviation is about 4.22.

If the data are skewed (see the histograms under Graphics tab for an example), authors sometimes choose to summarize continuous data using the median and range rather than the mean and standard deviation.

Median

The sample median is the 50th percentile of the measurements (the midpoint of the ordered values). For the BMI measurements above, the median is 23.3.

Range

The sample range is the interval defined by the maximum and minimum observations. For the BMI measurements above, the range is 18.3 to 30.0.

These summary statistics could be displayed in a table like this:

Variable Mean (SD) Median [Range]
BMI 23.48 (4.22) 23.3 [18.3 - 30.0]

To calculate these summary statistics for your dataset, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

Binary Data

A binary variable takes two possible values, one value for each of two categories (sex, alive/dead). Binary data is summarized by simply counting the frequency of each category or dividing the frequency by the total number of samples to obtain the proportion or percentage of each category. Frequencies and the percent of the column total can be displayed in contingency tables such as this:

Active Rx Placebo Rx Total
Alive 12 (43%) 14 (61%) 26 (51%)
Dead 16 (57%) 9 (39%) 25 (49%)
Total 28 23 51

How to graph data

In addition to summary statistics, graphs provide an alternative way to summarize a sample of continuous observations.

Histogram

The histogram is a commonly used graph for a single continuous variable. A histogram shows the distribution of continuous data by breaking the data into bins and showing the frequency in each bin.

In this histogram, the data is roughly symmetric:

Histogram skewed

In this histogram the data is skewed to the right:

Histogram skewed

To plot a histogram, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

Scatter Plot

A commonly used graph for displaying the relationship between a pair of continuous variables is the scatter plot. In a scatter plot, each point represents a single patient or animal. The horizontal or x-axis is usually the explanatory variable and the vertical or y-axis is usually the outcome variable.

Here is a scatter plot of the 50th systolic blood pressure percentile of girls in the 50th height percentile for ages 1 through 10. (Data taken from http://www.nhlbi.nih.gov/guidelines/hypertension/child_tbl.pdf)

Scatter plot

To plot a scatter plot, you can input your data into this Excel worksheet. If you would like additional information, please contact us.

How to test for a difference between two groups of samples

Hypothesis Test

A common scenario in clinical research is a study that compares a continuous outcome in two groups. For example, a clinical trial might compare the cholesterol level in patients given a new treatment and patients given a standard of care treatment to determine if the treatment improves cholesterol level. A first step in such an analysis is to calculate the mean cholesterol in each group. If you do not know how to calculate the mean, please see the previous section. If the group means are different, there are four potential reasons for the difference:

  1. Actual difference between the groups: In the example, the treatment caused a change in cholesterol.
  2. Chance: Two groups will often not have the same mean just by chance
  3. Bias:
  4. Confounding:

Although the last two are very important, we will not focus on them here. If you are concerned with bias and confounding in your study, please contact us.

The question in most studies is whether the difference between the groups is statistically significant. This is asking if the observed difference between the groups is unlikely due just to chance. The value that we calculate to assess whether the difference is likely due to chance is called the p-value, which is the chance of the observed group difference (or something more extreme) assuming the true group means were the same. If the p-value is less than 0.05, we say that the group difference is unlikely due to chance and therefore is statistically significant.

Continuous Data

A two-sample t-test allows us to test for a group difference with a continuous outcome. A t-test calculates the difference between the group means and determines how probable the observed difference is if there was truly no difference between the groups. This is the p-value. If the p-value is low enough (i.e. less than 0.05), we conclude that there is a significant difference between the groups. If you would like to complete a two-sample t-test on your data, please use this Excel worksheet. If you have additional questions, please contact us.

Binary Data

A chi-square test allows us to test for a group difference in the relative proportions of a binary outcome. As with the t-test, if the p-value is low enough (i.e. less than 0.05), we conclude that there is a difference in proportions between the groups. If you would like to complete a chi-square test on your data, please use this Excel worksheet. If you have additional questions, please contact us.

How to assess within group correlation of two continuous variables

The Pearson correlation between two continuous variables is a measure of how closely the scatter plot comes to falling on a straight line so that the two variables either increase together or one variable increases as the other variable decreases. It ranges from -1 to 1, with 1 indicating that the scatter plot falls exactly on a line sloping up to the right, -1 indicating a line sloping down to the right, and 0 indicating no linear association whatsoever.

The following pair of variables has a correlation of about 0.9.

Pearson correlation - 0.9

While the following have a correlation of about 0.3.

Pearson correlation - 0.3

If you would like to calculate the Pearson correlation between two variables and the Fisher’s Z-test p-value to test whether it is significantly different from zero, please use this Excel worksheet. If you have additional questions, please contact us.

How to compute sample size

A well designed study needs a sample large enough to minimize the probability of drawing an erroneous conclusion from the study results. In a randomized clinical trial comparing an active treatment to a control treatment, two types of errors are possible.

A Type I error is made by concluding that the active treatment is superior to the control treatment when in fact it is no better. A Type II error is made by concluding that the active treatment is no better than the control treatment when in fact it is superior.

The probability of making a Type I error is denoted by alpha and the probability of making a Type II error is denoted by beta. Power is defined as 1 minus beta.

The sample size needs to be large enough to guarantee high power, usually 0.80 or more, to conclude that an effective active treatment is indeed effective while simultaneously holding alpha to a low level, usually 0.05 or less. This guarantees a low Type I error rate if the active treatment is in fact ineffective.

Two-Sample T-Test of Means with Equal Size Groups

To compute the sample size for a two-sample t-test of means with equal size groups we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The (common) within group standard deviation.
  4. The minimum mean difference between the groups considered to be clinically significant.

Chi-Square Test of Two Proportions with Equal Size Groups

To compute the sample size for a chi-square test of two proportions with equal size treatment groups we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The hypothesized true proportion of successes in the control treatment group.
  4. The hypothesized true proportion of successes in the active treatment group.

Fisher’s Z-Test for Non-Zero Pearson Correlation

To compute the sample size to test for non-zero Pearson correlation between two variables we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The hypothesized true correlation.

If you would like to calculate the approximate sample size for one of the above tests, please use this Excel worksheet. An alternative approach that addresses additional scenarios is our web-based sample size calculator available here. If you have additional questions, please contact us.