MGH Biostatistics Center

MGH Biostatistics Center Archive

The MGH Biostatistics Center has moved to a new website at https://biostatistics.massgeneral.org/.

Various links to the older site may exist. This site provides all of the content that might have been referenced externally.

Technical reports and software

Asymptotic Tests For Location, Variance, and Equivalence of Gene Expression Using the Distance Matrix

by Douglas Hayden, Mark Kon

This technical report summarizes work done to develop asymptotic tests for location, variance, and equivalence of gene expression between two groups of microarrays in order to find an alternative to permutation tests based on the pairwise distance matrix. Given additive errors, the location test is not asymptotically normal without further assumptions. The variance test and equivalence test are asymptotically normal.

Survival Analysis of Longitudinal Microarrays

by Natasa Rajicic

R code for implementing survival analysis of longitudinally collected gene expression data. Methods are described in the linked paper by Rajicic N, Finkelstein DM, and Schoenfeld DA, submitted to Bioinformatics, May 2006.

Production of Deidentified Datasets at the Conclusion of Human Subjects Trials

by A. Korpak, D. Schoenfeld; for MGH Clinical Research Day 2005

This poster, prepared for the MGH Clinical Research Day 2005, discusses some important considerations regarding the preparation of deidentified datasets. It is a requirement of most NIH-sponsored trials that data be provided for distribution to qualified investigators. Studies involving human subjects face the challenge of preparing a dataset that is both useful and protects patient confidentiality. This is a new area, with few established guidelines; the poster looks at some specific issues that have arisen in the context of dataset preparations done for ARDS Network studies.

Analysis of familial aggregation studies with complex ascertainment schemes

by A. Matthews, D. Finkelstein, R. Betensky

This paper proposes an approach to adjust analyses of family studies for complex ascertainment schemes where the sampling is dependent on the disease history of the entire family. This approach extends that of Tosteson et al. 1991 to handle these types of sampling schemes.

Analysis of Co-aggregation of Cancer Based on Registry Data

by A. Matthews et al.

We conducted an exploratory analysis of co-aggregation of cancers in individuals and families utilizing sibships from over 18,000 families who had been recruited to the registry of the NCI-sponsored multi-institutional Cancer Genetics Network. We found statistically significant familial co-aggregation of lung cancer with pancreatic (p<0.0001), prostate (p < 0.001), and colorectal cancers (p=0.003). In addition, we found significant familial co-aggregation of pancreatic and colorectal cancers (p=0.022), and co-aggregation of hematopoietic and (non-ovarian) gynecologic cancers (p=0.01)

A set of functions for computing Matlab functions in parallel

micaParalize is a set of functions that allow a user to easily run normal Matlab functions on a multi-processesor machine by dividing the work-load amongst the serval processors. The program is used for simulations, bootstraps and function maximization problems in biostatistics.

Notes

micaParalize has been superceeded by biopara, which is on the software page

One sample log-rank software

by Dianne M. Finkelstein, Alona Muzikansky, David A. Schoenfeld

1s_logrank.xls is for computing one sample log rank test, confidence intervals for the SMR, calculating estimate for survivorship in the matched standard population and visually comparing survivorship of the sample to that of the standard population as described in the paper and instructions (both included in the zip file). The paper was published in the Journals of the National Cancer Institute, Vol. 95, No. 19, Oct 1 2003 pp. 1434–1439 as a commentary.

Analysis of Failure Time Data From Screening Studies With Missing Observations

by Dianne Finkelstein

"Analysis of Failure Time Data From Screening Studies With Missing Observations" was delivered by Dianne Finkelstein, Ph.D. at the 2003 Joint Statistical Meetings in San Fransisco.

A program for analysis of failure time data with dependent interval censoring

by Dianne Finkelstein and David Schoenfeld

depcen.exe is a program for estimating survival probabilities and probabilities of attending visits as described in the paper "Analysis of Failure Time Data with Dependent Interval Censoring" (Finkelstein D.M., Goggins W.B, and Schoenfeld D.A., Biometrics 2002 58:298-304). The program was implemented in Matlab and runs as a batch job from a DOS command prompt. The time to blood shedding data from the paper is also included. "interval_censr_data.zip" contains the data in .dat format and the .sas file required for setup. When using this data, please reference the article cited above.

A program for computing sequential boundaries

by David Schoenfeld, Ph. D.

Gen.m is an m-file (Matlab/Octave) for for computing sequential boundaries, as descibed in the paper "A Simple Algorithm for Designing Group Sequential Clinical Trials" (Schoenfeld, Biometrics 57, 972-974; September 2001). If you have Matlab download gen.m, gen.m can also be run under Octave a public domain m-file interpreter which can be downloaded from the URL below. In addition sequential.zip contains a compiled version of gen.m which runs on the command line.

Sample Size Program

by David Schoenfeld, Ph. D.

A web based program for computing various sample sizes.

Fitting the Proportional Hazards Model for Interval Censored data

by William Goggins

This program is a Monte Carlo EM (MCEM) algorithm for fitting the proportional hazards model for interval censored failure time data. The algorithm generates orderings of the failures from their probability distribution under the model. We maximize the average of the log-likelihoods from these completed data sets to obtain updated parameter estimates. As with the standard Cox (1972) model, this algorithm does not require the estimation of the baseline hazard function. The method is described in the paper Goggins W, Finkelstein DM and Zaslavsky A. Applying the Cox Proportional Hazards Model when the Change Time of a Binary Time-Varying Covariate is Interval-Censored. Biometrics 1999;55: 445-451

Notes

(the SPARC object file above means that using this software requires access to a SPARC-based Sun workstation with S installed on it.)

Search script for the COSTART Thesaurus of Adverse Reaction

by Richard Morse

costart.html is a Javascipt application for searching the Coding Symbols for Thesaurus of Adverse Reaction Terms (5th Edition). The page can be saved to a local computer so that it can be accessed while that computer is not connected to the internet. To do this from the COSTART page, select Save from the File menu, and make sure to save the file as the type "Web page, complete .htm, .html" (Internet Explorer).

Pfor, Psimulate, Pbootstrap

by Peter Lazar

Parallel enabled utilities for use with Mathworks' Distributed Computing Toolbox. A parallel for loop, parallel simulation and a parallel bootstrap that makes use of the DCT to speed up execution.

biopara — self contained parallel system for the R statistical package

by Peter Lazar

Biopara is a parallel framework designed to allow users of R to distribute execution of coarsely parallel problems over multiple machines for concurrent execution. Biopara is called from within an R session and results are returned to the same session as a list of native R objects that can be directly manipulated without reinterpretation.

Permutation Tests For Location and Variance Using the Distance Matrix

by Douglas Hayden and Peter Lazar

Permutation test to compare variability within and distance between two groups of microarrays using the distance matrix. This software uses R, and is available through CRAN.

Corticosteroid Equivalent Dose Calculator

by Richard Morse

sterconv.html is a Javascript application which converts milligrams of various corticosteroids to the equivalent dose of Methylprednisolone. The conversion data is adapted from Goodman and Gilman's The Pharmacological Basis of Therapeutics, 9th ed. (1996); Hardman JG and Limbird LE editors; New York: McGraw-Hill Health Professions Division; page 1466.

Power and Sample Size Calculation for Logistic and Cox Models

This application calculates the power and sample size for the Cox and Logistic Models. It is written in Matlab/Octave. If you have Matlab (or it's free version, Octave) on your computer you can download the source files that are in the zip file "cox_and_logistic_matlab_programs.zip" . The programs to run the model are powl8 for the logistic model and powc9 for the cox model. The other programs are necessary helper functions which must be on your path. Included is a link to download Octave. If you don't wish to use the Octave/Matlab version you need to download and run the MCRInstaller first and then use the zip file "logistic_cox_power.zip".

A program to calculate the power or sample size for a random slopes model

by David A. Schoenfeld

This program calculates the power or sample size for a random slopes model. This model assumes that each patient has a linear trajectory in with a slope and intercept that have a multivariate normal distribution. The fixed effects are time and a time treatment interaction. The time treatment interaction is the treatment effect as it measures the extent to which the mean slope is different in the two treatment groups. The program requires the variance covariance matrix for the random effects and the error variance of observations arround a patients trajectory.

The program is an m-file (Matlab/Octave) and will run under Octave which is a public domain m-file interpreter.

The program has also been ported to R.

Notes

2009 Feb 24; Matlab program updated.

Power calculator for a Sequential Parallel Comparison Design (SPCD)

by David Schoenfeld Ph.D.

An excel spreadsheet that calculates the power of a sequential parallel comparison design (often abbreviated SPCD) assuming dropouts based on the paper by Tamura and Huang, Cancer Trials 2007; 4:309-317.

Analysis of the relationship between longitudinal gene expressions and ordered categorical event data

by Natasa Rajicic, Dianne M. Finkelstein, David A. Schoenfeld and the Inflammation and Host Response to Injury Research Program Investigators

The NIH project “Inflammatory and Host Response to Injury” (Glue) is being conducted to study the changes in the body over time in response to trauma and burn. Patients are monitored for changes in their clinical status, such as the onset of and recovery from organ failure. Blood samples are drawn over the first days and weeks after the injury to obtain gene expression levels over time. Our goal was to develop a method of selecting genes that differentially expressed in patients who either improved or experienced organ failure. For this, we needed a test for the association between longitudinal gene expressions and the time to the occurrence of ordered categorical outcomes indicating recovery, stable disease, and organ failure. We propose a test for which the relationship between the gene expression and the events is modeled using the cumulative proportional odds model that is a generalization of the pooling repeated observation method. Given the high-dimensionality of the microarray data, it was necessary to control for the multiplicity of the testing. To control for the false discovery rate (FDR), we applied both a permutational approach as well as Efron's empirical estimation method. We explore our method through simulations and provide the analysis of the multi-center, longitudinal study of immune response to inflammation and trauma (http://www.gluegrant.org). This paper was publish in Statistics in Medicine, 2009; 28:2817–2832, Published online 17 July 2009 in Wiley InterScience, (www.interscience.wiley.com) DOI: 10.1002/sim.3665

Effects of Missmodeling on the Proportional Hazards Regression Model

by David Schoenfeld, Ph. D.

This graphical tool shows the effect of non-proportional hazards or a missing covariate on the power of the proportional hazards regression model. One can manipulate the hazard functions using sliders to create different hazard ratios over time keeping either the final survival rates, the mean or the median constant. Further, one can alter the hazard ratio by not considering an important covariate. The power is shown for each hazard ratio function. The graphical user interface shows each hazard function, each survival curve and the hazard ratio.

Notes

To run this program you must have Java and either Matlab or Matlab Compiler Runtime (MCR) v7.13 installed. If you do not have Matlab or MCR, use the links above to download the appropriate installer for your system.

Compute summary statistics for a continuous variable such as age, BMI, or diastolic blood pressure.

by Douglas Hayden

Compare the mean of a continuous variable such as age, BMI, or diastolic blood pressure between two groups.

by Douglas Hayden

Compute the correlation between two continuous variables such as age and BMI or BMI and diastolic blood pressure.

by Douglas Hayden

Compare a proportion, such as proportion of female subjects, between two groups.

by Douglas Hayden

Plot a histogram for one continuous variable such as age, BMI, or diastolic blood pressure.

by Douglas Hayden

Plot a scatter plot for two continuous variables such as age and BMI or BMI and diastolic blood pressure.

by Douglas Hayden

Compute approximate required sample size

by Douglas Hayden

Outcome-Driven Clustering of Microarray Data

by Jessie Hsu

Design and analysis of a clinical trial using previous trials as historical control

by Schoenfeld et. al.

For single arm trials, a treatment is evaluated by comparing the results to historically reported summary statistics. Such a historically controlled trial is often analyzed as if summary statistics from previous trials were known without variation. We develop a test of treatment efficacy and sample size calculation for historically controlled trials that considers this variation. Using a Bayesian approach we take into account both variation due to estimation and variation in the estimand from trial to trial. As an example, we apply these methods to a clinical trial for amyotrophic lateral sclerosis (ALS) using data from the placebo groups of sixteen trials. We find that when attempting to detect a small effect size, historically controlled trials would require more patients than concurrently controlled trials. However, when the treatment effect size is large, historically controlled trials can be an efficient approach to demonstrating treatment benefit. We also show that utilizing patient level data for the prognostic covariates can reduce the sample size required for a historically controlled trial.

Seminars

JSM 2014 Session on Challenges

Date/Time:

Tuesday, August 5, 2014 - 8:30am EDT

Speaker(s):

  • Dianne Finkelstein, PhD
  • Andrew Strahs, PhD
  • Ritesh Ramchandani
  • Marc Buyse
  • David Schoenfeld

Summary:

A clinical trial may collect several outcomes that are to be used in deciding whether a new therapy is effective and tolerable for treating a chronic disease such as cancer. Often some of the outcomes (such as mortality) are censored. In oncology, the combined outcome of progression-free survival has been a standard primary outcome for some time, but there can be ambiguous results when survival benefit is not clear. In other diseases such as cardiology and neurology there has been interest in using global tests that combine multiple (possibly hierarchical) outcomes to assess new regimen. This session will discuss considerations for the design and interpretation a trial with multiple outcomes. Proposed non-parametric tests for a combined outcome will be considered. Specific trials will be discussed as illustration of the issues.

  1. Introduction: Issues in Multiple outcomes in clinical trails when they are ordered or simultaneous and possibly censored (Dianne Finkelstein, Mass General Hospital & Harvard University)
  2. Case Study on Analysis of Progression and Survival in a Trial with Treatment Crossover: the Aveo trial of Tivozanib in Renal Cancer (Andrew Strahs, PhD, Alnylam Inc)
  3. Global Tests for Multiple Outcomes in Chronic Disease Trials (Ritesh Ramchandani, Harvard School of Public Health)
  4. Comparing and Summarizing Multiple Outcomes in a Clinical Trial (Marc Buyse, CluePoints Inc, IDDI, Hasselt University, Belgium)
  5. Discussion: David Schoenfeld

Support

External Resources

* This is a link to an external site. The MGH Biostatistics Center is not responsible for the content at this location and can make no guarantees or promises or claims about its accuracy. If you have any questions, you should consult a statistician.

MGH Biostatistics Center Interactive Introduction to Biostatistics

The Introduction to Biostatistics is designed to provide researchers at MGH with the tools necessary to complete basic statistical analysis. This introduction consists of five main sections:

Basic Clinical Statistics

In this section, we provide new clinical researcher a basic introduction to biostatistical concepts and terms by describing how to organize a dataset for analysis, how to summarize or graph data, how to assess if there is a relationship between variables and how to determine the sample size required for a future study. If you are new to clinical research, we suggest you start here to learn or review some important concepts. GO »

Key to Statistical Tests

In this section, we provide a step-by-step approach in the "Choose your own adventure" style to choose the appropriate statistical analysis for your data. In addition to providing you with an appropriate analysis, a demonstration of how to complete each statistical analysis in several statistical software programs is also shown. If you are new to clinical research, we suggest that you start with this approach to choosing a statistical test. GO »

Table for Statistical Tests

In this section, we provide a table format with a more compact approach to choose the best analysis for your data. As in the Key to Statistical Tests, a demonstration of how to complete each analysis in statistical software is provided. GO »

Power and Sample Size

In this section, we provide a step-by-step approach to perform a power or sample size calculation for several different study designs. In addition, web-based power or sample size calculators that work on the desktop, mobile phone, and tablet are provided for several common scenarios. GO »

Basic Clinical Statistics

How to create an Excel worksheet for analysis

Introduction

Many investigators use Excel spreadsheets or Access as a database for clinical data. In order to analyze data in a statistical package, it is usually necessary first convert your Access database or Excel spreadsheets into tables that can be read into statistical packages. By following the guidelines below, investigators can structure their spreadsheets so that the conversion is simple. Click here for a video demonstration of how to construct your Excel sheet.

Datasets for statistical analysis

The most common format of a dataset for statistical analysis is a rectangular grid of rows and columns. Each column represents a data element or variable (corresponding to a field in Access) such as age, gender, blood pressure, etc. while each row is the observed values for one patient at one particular point in time. Some sample data from a clinical trial on the effect of caffeine on blood pressure might look like Table 1 and Table 2.

The field names used by SAS and other packages must be no more than 32 characters, must contain only letters, digits, underscore and must not start with a digit. In SAS, upper case and lower case letters are equivalent so that two columns named AGE and age would cause confusion. You can use mixed uppercase and lowercase to make the field names more readable but make sure that the field name is unique (within the table) when converted to all uppercase.

Most statistical packages support at least two data types, numeric and character (text). If a variable is numeric don't enter any characters in that column (except for the field name in row 1). To enter a missing numeric value just leave the cell blank.

Text or Character fields

Do NOT use quotes, the number sign (#) or commas in your text fields or as codes because these have special meanings in statistical packages. For example, if you want an abbreviation for inches use in or spell it out rather than using the quotes. For names such as O'Brien, use OBrien.

Patient identifiers and visit identifiers

Use the same field name for a subject id (e.g. ID, PatID, PatientID) in all tables where the id's should match. If you have codes for visits, e.g. baseline, week1, week2, discharge, use the same codes in all tables. Use the same field name for the visit field (e.g. Visit) in all tables.

IMPORTANT: The information that you include on a spreadsheet that will be used by multiple researchers must follow HIPAA guidelines for patient confidentiality.

Coding Dichotomous (yes/no) variables

When possible enter dichotomous variables as numeric with 1=yes and 0=no. This makes them easier to include in analyses such as logistic regression without conversion.

Special Consideration for Excel

Do not use formulas for cell values. The sheet must be simple rows by columns (fields). If you are going to use Excel to compute things, e.g. means or percentages, then make a copy of the file before you do it. Also, make a code book and place it on a separate sheet. This way you have built in documentation for your data.

Constructing the Code Book

An important additional table to include in your Excel file or Access database is a code book. The code book includes a row for each variable in each table, and four features of each variable should be described. The first feature (TABLE) is the table in which the variable is located. If you have only one table, this feature is not needed. The second feature is the name of the variable in the table (FIELD). The third feature is a descriptive label of the variable (LABEL). Finally, the fourth feature provides a code for any categorical or binary variables (CODE).

Table 1: Blood pressure study: BASELINE Table
PID Age Female TX
101 23 1 1
102 14 0 0
101 25 0 1
Table 2: Blood pressure study: VISIT Table
PID Visit Blood pressure
101 0 121
101 1 123
101 2 124
102 0 142
102 1 147
102 2 145
103 0 161
103 1 165
103 2 168
Table 3: Blood pressure study: CODE BOOK Table
Table Field Label Code
Baseline PID patient id
Baseline Age age (years)
Baseline Female female gender 1=female, 0=male
Baseline TX treatment 1=regular, 0=decaf
Visit PID patient id
Visit Visit visit 0=Baseline,1=6 month, 2=12 month
Visit Blood pressure Systolic blood pressure (mmHg)

How to import data into statistical software

Click here for a video of how to import data into STATA.

How to summarize data

The first step in any data analysis is to investigate the data using summary statistics and graphics. This is the information presented in the description of the study population in Table 1 of most scientific papers. One approach to choosing how to summarize and graph your data is based upon the type of data you are summarizing. Five common types of data are continuous, binary, categorical, ordinal, and time to event. We will focus on the first two here. Click here for a video of how to summarize data in Excel. Click here for a video of how to summarize data in STATA.

Continuous Data

A continuous variable takes any numerical magnitude within the range of biologically possible values (age, BMI). As an example, here are five BMI measurements (18.3, 22.1, 23.3, 23.7, 30.0).

Continuous data are described in terms of the location and variability in the data, most commonly using the mean and the standard deviation. For example, in many papers, the age of study participants is summarized using these measures in Table 1.

Mean

The sample mean is equal to the sum of the observations divided by the number of observations. For the BMI measurements above, we obtain the sum, 18.3 + 22.1 + 23.3 + 23.7 + 30.0 = 117.4, so that the mean is 117.4/5=23.48.

Standard Deviation

The sample standard deviation (SD) is proportional to the square root of the average squared distance of each observation from the mean. For the BMI measurements above, the standard deviation is about 4.22.

If the data are skewed (see the histograms under Graphics tab for an example), authors sometimes choose to summarize continuous data using the median and range rather than the mean and standard deviation.

Median

The sample median is the 50th percentile of the measurements (the midpoint of the ordered values). For the BMI measurements above, the median is 23.3.

Range

The sample range is the interval defined by the maximum and minimum observations. For the BMI measurements above, the range is 18.3 to 30.0.

These summary statistics could be displayed in a table like this:

Variable Mean (SD) Median [Range]
BMI 23.48 (4.22) 23.3 [18.3 - 30.0]

To calculate these summary statistics for your dataset, you can input your data into this Excel worksheet.

Binary Data

A binary variable takes two possible values, one value for each of two categories (sex, alive/dead). Binary data is summarized by simply counting the frequency of each category or dividing the frequency by the total number of samples to obtain the proportion or percentage of each category. Frequencies and the percent of the column total can be displayed in contingency tables such as this:

Active Rx Placebo Rx Total
Alive 12 (43%) 14 (61%) 26 (51%)
Dead 16 (57%) 9 (39%) 25 (49%)
Total 28 23 51

How to graph data

In addition to summary statistics, graphs provide an alternative way to summarize a sample of continuous observations.

Histogram

The histogram is a commonly used graph for a single continuous variable. A histogram shows the distribution of continuous data by breaking the data into bins and showing the frequency in each bin.

In this histogram, the data is roughly symmetric:

Histogram skewed

In this histogram the data is skewed to the right:

Histogram skewed

To plot a histogram, you can input your data into this Excel worksheet.

Scatter Plot

A commonly used graph for displaying the relationship between a pair of continuous variables is the scatter plot. In a scatter plot, each point represents a single patient or animal. The horizontal or x-axis is usually the explanatory variable and the vertical or y-axis is usually the outcome variable.

Here is a scatter plot of the 50th systolic blood pressure percentile of girls in the 50th height percentile for ages 1 through 10. (Data taken from http://www.nhlbi.nih.gov/guidelines/hypertension/child_tbl.pdf)

Scatter plot

To plot a scatter plot, you can input your data into this Excel worksheet.

How to test for a difference between two groups of samples

Hypothesis Test

A common scenario in clinical research is a study that compares a continuous outcome in two groups. For example, a clinical trial might compare the cholesterol level in patients given a new treatment and patients given a standard of care treatment to determine if the treatment improves cholesterol level. A first step in such an analysis is to calculate the mean cholesterol in each group. If you do not know how to calculate the mean, please see the previous section. If the group means are different, there are four potential reasons for the difference:

  1. Actual difference between the groups: In the example, the treatment caused a change in cholesterol.
  2. Chance: Two groups will often not have the same mean just by chance
  3. Bias:
  4. Confounding:

Although the last two are very important, we will not focus on them here.

The question in most studies is whether the difference between the groups is statistically significant. This is asking if the observed difference between the groups is unlikely due just to chance. The value that we calculate to assess whether the difference is likely due to chance is called the p-value, which is the chance of the observed group difference (or something more extreme) assuming the true group means were the same. If the p-value is less than 0.05, we say that the group difference is unlikely due to chance and therefore is statistically significant.

Continuous Data

A two-sample t-test allows us to test for a group difference with a continuous outcome. A t-test calculates the difference between the group means and determines how probable the observed difference is if there was truly no difference between the groups. This is the p-value. If the p-value is low enough (i.e. less than 0.05), we conclude that there is a significant difference between the groups. If you would like to complete a two-sample t-test on your data, please use this Excel worksheet.

Binary Data

A chi-square test allows us to test for a group difference in the relative proportions of a binary outcome. As with the t-test, if the p-value is low enough (i.e. less than 0.05), we conclude that there is a difference in proportions between the groups. If you would like to complete a chi-square test on your data, please use this Excel worksheet.

How to assess within group correlation of two continuous variables

The Pearson correlation between two continuous variables is a measure of how closely the scatter plot comes to falling on a straight line so that the two variables either increase together or one variable increases as the other variable decreases. It ranges from -1 to 1, with 1 indicating that the scatter plot falls exactly on a line sloping up to the right, -1 indicating a line sloping down to the right, and 0 indicating no linear association whatsoever.

The following pair of variables has a correlation of about 0.9.

Pearson correlation - 0.9

While the following have a correlation of about 0.3.

Pearson correlation - 0.3

If you would like to calculate the Pearson correlation between two variables and the Fisher's Z-test p-value to test whether it is significantly different from zero, please use this Excel worksheet.

How to compute sample size

A well designed study needs a sample large enough to minimize the probability of drawing an erroneous conclusion from the study results. In a randomized clinical trial comparing an active treatment to a control treatment, two types of errors are possible.

A Type I error is made by concluding that the active treatment is superior to the control treatment when in fact it is no better. A Type II error is made by concluding that the active treatment is no better than the control treatment when in fact it is superior.

The probability of making a Type I error is denoted by alpha and the probability of making a Type II error is denoted by beta. Power is defined as 1 minus beta.

The sample size needs to be large enough to guarantee high power, usually 0.80 or more, to conclude that an effective active treatment is indeed effective while simultaneously holding alpha to a low level, usually 0.05 or less. This guarantees a low Type I error rate if the active treatment is in fact ineffective.

Two-Sample T-Test of Means with Equal Size Groups

To compute the sample size for a two-sample t-test of means with equal size groups we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The (common) within group standard deviation.
  4. The minimum mean difference between the groups considered to be clinically significant.
Chi-Square Test of Two Proportions with Equal Size Groups

To compute the sample size for a chi-square test of two proportions with equal size treatment groups we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The hypothesized true proportion of successes in the control treatment group.
  4. The hypothesized true proportion of successes in the active treatment group.
Fisher's Z-Test for Non-Zero Pearson Correlation

To compute the sample size to test for non-zero Pearson correlation between two variables we need to specify:

  1. The alpha level, usually 0.05 or less.
  2. The power, usually 0.80 or more.
  3. The hypothesized true correlation.

If you would like to calculate the approximate sample size for one of the above tests, please use this Excel worksheet. An alternative approach that addresses additional scenarios is our web-based sample size calculator available here.

Key to Statistical Tests

In this section, we provide guidance for choosing an appropriate statistical test when you have independent observations. Analyses involving independent observations usually have a single measurement of the outcome for each person or animal. If you are doing an analysis with multiple measurements of the outcome (i.e. longitudinal clinical trial with repeated measurements), you will need to use alternative techniques. Please schedule a Catalyst consultation for further assistance if you have such a situation (link).

What type of outcome do you have?

Continuous (ex. Age, BMI, cholesterol)

What type of predictor do you have?

None

What do you want to report?

Hypothesis test for mean difference from specified value - One sample t-test
What is this?

The one sample t-test is used to test whether the mean of a population is equal to a specified value (often but not necessarily zero). It assumes that the data are drawn from a normal distribution, but it can also be used in large samples (N>50) even if the data are not drawn from a normal distribution.

Code examples

Confidence interval for mean
What is this?

The confidence interval for the mean consists of a lower and upper bound that, over repeat experiments, brackets the true mean with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true mean.

Code examples

Additional questions to ask:

Is a t-test inappropriate for your data?

The t-test is inappropriate for small samples (N<50) if the data are not drawn from a normal distribution. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, the one sample t-test is inappropriate. If the t-test is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Binary

What do you want to report?

Hypothesis test for group mean difference - Two sample t-test
What is this?

The two sample t-test is used to test whether the difference between the means of two populations is equal to a specified value (often but not necessarily zero). It assumes that each sample of data is drawn from a normal distribution, but it can also be used in large samples (N>50 per group) even if the data are not drawn from a normal distribution.

Code examples

Confidence interval for difference in means
What is this?

The confidence interval for the difference in the means consists of a lower and upper bound that, over repeat experiments, brackets the true mean difference with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true mean difference.

Code examples

Additional questions to ask:

Is a t-test inappropriate for your data?

The t-test is inappropriate for small samples (N<50 per group) if the data are not drawn from a normal distribution in each group. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, the t-test is inappropriate. If the t-test is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Do you need to control for confounders?

A confounder is a third variable that is associated with both the continuous outcome and the binary predictor. For example, if the continuous outcome is BMI and the binary predictor is gender then age could be a confounder if the females and males happen to differ substantially in age. In this case, any apparent difference in BMI between females and males could, in fact, be due to the difference in age, not the effect of gender. In this case, ignoring the role of age may lead to incorrectly concluding that BMI depends on gender. A multivariate regression model can correct this problem. For more information, you may schedule a Catalyst consultation for further assistance (link).

Categorical

What do you want to report?

Hypothesis test for mean differences - One way ANOVA

One way ANOVA, or one way analysis of variance, tests for any mean difference among two or more populations. It assumes that each sample is drawn from a normal distribution, and all populations have the same variance. It is also valid in large samples with non-normal distributions (N>50 per sample).

Code examples

Confidence intervals for means and mean differences

The confidence interval for the mean or mean difference consists of a lower and upper bound that, over repeat experiments, brackets the true mean or mean difference with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true mean or mean difference.

Code examples

Additional questions to ask:

Is ANOVA inappropriate for your data?

One-way ANOVA is inappropriate for small samples (N<50) if the data are not normally distributed. In addition, one-way ANOVA is inappropriate when the populations have unequal variance. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, then one-way ANOVA is inappropriate. If one-way ANOVA is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Do you need to control for confounders?

A confounder is a third variable that is associated with both the continuous outcome and the categorical predictor. For example, if the continuous outcome is BMI and the categorical predictor is race then age could be a confounder if the racial groups happen to differ substantially in age. In this case any apparent difference in BMI between racial groups could, in fact, be due to the difference in age, not the effect of race. In this case ignoring the role of age would lead to incorrectly concluding that BMI depends on race. A multivariate regression model can correct this problem. For more information, you may schedule a Catalyst consultation for further assistance (link).

Continuous

What do you want to report?

Pearson correlation

The Pearson correlation coefficient between two continuous variables is a measure of how closely the scatter plot of the two variables comes to falling on a straight line. It ranges from -1 to 1, with 1 indicating that as one variable increases the other variable also increases, -1 indicating that as one variable increases the other variable decreases and 0 indicating no linear association.

Code examples

Slope - Linear regression

Linear regression is the statistical method used to estimate the equation of a straight line relating a predictor and an outcome. The slope from a linear regression estimates the change in the outcome for every one unit change in the predictor. For example, if BMI is the outcome and age is the predictor measured in years, then the slope is the change in BMI per one year increase in age. If the outcome variable and predictor variable are significantly associated, then the estimated slope is significantly different than 0.

Code examples

Additional questions to ask:

Is linear regression inappropriate for your data?

Linear regression with a single continuous predictor assumes independence of the observations, a straight line relationship between the predictor and the outcome, and the deviations from the line are normally distributed with equal spread across the line. If any of these conditions are not met, linear regression may not be appropriate for your analysis.

If the outcome variable is highly skewed or not a continuous variable (i.e. ordinal or dichotomous), linear regression may not be appropriate for your analysis. If the relationship between the outcome and predictor is not a straight line, linear regression may also not be appropriate. Finally, linear regression is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, then linear regression is inappropriate. If linear regression is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Do you need to control for confounders?

A confounder is a third variable that is associated with both the continuous outcome and the continuous predictor. For example, if the continuous outcome is BMI and the continuous predictor is age then cholesterol could be a confounder if it is correlated with both BMI and age. In this case, any apparent association between BMI and age could, in fact, be due to the association between both and cholesterol. In this case, ignoring the role of cholesterol would lead to incorrectly concluding that BMI depends on age. A multivariate regression model can correct this problem. For more information, you may schedule a Catalyst consultation for further assistance (link).

Binary (ex. Sex, dead/alive)

What type of predictor do you have?

None

What do you want to report?

Hypothesis test that proportion differs from specified value - One sample test for a binomial proportion

The one sample test for a binomial proportion tests whether an observed sample proportion is significantly different from a pre-specified hypothesized proportion, p. If N*p*(1-p) >=5 then an approximate large sample method may be used, otherwise an exact binomial test is always valid.

Code examples

Confidence interval for proportion

The confidence interval for a sample proportion consists of a lower and upper bound that, over repeat experiments, brackets the true population proportion with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true population proportion.

Code examples

Additional questions to ask:

Is a one sample test for a binomial proportion inappropriate for your data?

Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, the one sample test for a binomial proportion is inappropriate. If the one sample test for a binomial proportion is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Binary

What do you want to report?

Hypothesis test for unequal proportions - Chi square test/Fisher's exact test

The Chi-square test is the most commonly used test for whether the proportions of patients with the binary outcome in the two populations are equal. The most common way to view data of this type is a 2x2 table. If the expected number of observations in each cell of the 2x2 table is <=5, then the Chi square test is inappropriate and Fisher's exact test should be used.

Code examples

Confidence interval for odds ratios

The odds is defined to be the probability of the event divide by 1 minus the probability of the event, and the odds ratio is one way to compare the binary outcome in two groups. The risk difference and relative risk are two other measures that can be reported in some cases. The confidence interval for the odds ratio consists of a lower and upper bound that, over repeat experiments, brackets the true population odds ratio with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true population odds ratio. Logistic regression can also be used to calculate the confidence interval for the odds ratio.

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the binary outcome and the binary predictor. For example, if the binary outcome is diabetes and the binary predictor is gender then BMI could be a confounder if the females and males happen to differ substantially in BMI. In this case any apparent difference in the prevalence of diabetes between females and males could, in fact, be due to the difference in BMI, not the difference in gender. Therefore ignoring the role of BMI would lead to incorrectly concluding that diabetes depends on gender. A multivariate logistic regression model can correct this problem. For more information, you may schedule a Catalyst consultation for further assistance (link).

Is a Chi-square test or logistic regression inappropriate for your data?

Suppose the binary outcome is diabetes and the binary predictor gender. If the expected number of diabetics or non-diabetics in either gender is <=5 then the Chi square test and logistic regression are inappropriate. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, then a chi-square test and logistic regression are inappropriate. If these techniques are inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Categorical

What do you want to report?

Hypothesis test for unequal proportions - Chi-square test

The Chi-square test is the most commonly used test for whether the proportions of patients with the binary outcome in more than two populations are equal. The most common way to view data of this type is a 2xc table where c is the number of groups. If the expected number of observations in each cell of the 2xc table is <=5, then the Chi square test is inappropriate and Fisher's exact test should be used.

Code examples

Confidence interval for odds ratios

The odds is defined to be the probability of the event divide by 1 minus the probability of the event, and the odds ratio is one way to compare the binary outcome in two groups. The risk difference and relative risk are two other measures that can be reported in some cases. The confidence interval for the odds ratio of the outcome between any two of the sub-groups defined by the categorical predictor consists of a lower and upper bound that, over repeat experiments, brackets the true population odds ratio with a specified probability (usually 95%). It is interpreted as the range of plausible values for the true population odds ratio. Logistic regression can be used to calculate the confidence interval for the odds ratio.

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the binary outcome and the categorical predictor. For example, if the binary outcome is diabetes and the categorical predictor is race then BMI could be a confounder if the races differ substantially in BMI. In this case any apparent difference in the prevalence of diabetes across racial categories could, in fact, be due to the difference in BMI, not the effect of race. Therefore ignoring the role of BMI would lead to incorrectly concluding that diabetes depends on race. A multivariate logistic regression model can correct this problem. For more information contact, you may schedule a Catalyst consultation for further assistance (link).

Is a Chi-square test or logistic regression inappropriate for your data?

Suppose the binary outcome is diabetes and the categorical predictor is race. If the expected number of diabetics or non-diabetics in any of the racial categories is <=5 then the Chi square test and logistic regression are inappropriate. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, then a chi-square test and logistic regression are inappropriate. If these techniques are inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Continuous

What do you want to report?

Hypothesis test for increasing odds ratio - logistic regression

The odds is defined to be the probability of the event divide by 1 minus the probability of the event. It may be of interest to test whether the odds increases (or decreases) with increasing values of a continuous predictor such as age. Logistic regression can be used to estimate the odds ratio for subjects who differ by one unit (or any number of units) on the continuous predictor and test whether the odds ratio differs significantly from one (since a odds ratio of one indicates equal odds and equal probability of the event).

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the binary outcome and the continuous predictor. For example, if the binary outcome is diabetes and the continuous predictor is age then BMI could be a confounder if BMI increases with age. In this case any apparent difference in the prevalence of diabetes with increasing age could, in fact, be due to the increase in BMI, not the increase in age. Therefore ignoring the role of BMI would lead to incorrectly concluding that diabetes depends on age. A multivariate logistic regression model can correct this problem. For more information contact, you may schedule a Catalyst consultation for further assistance (link).

Is a logistic regression inappropriate for your data?

If the binary outcome is diabetes and the continuous predictor is age then logistic regression is inappropriate if there is little or no overlap in age between the diabetics and the non-diabetics. Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, then logistic regression is inappropriate. If this technique is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Time to event (ex. Time to recurrence of cancer)

What type of predictor do you have?

None

What do you want to report?

Plot of the survival distribution - Kaplan-Meier curve

The Kaplan-Meier curve is a non-parametric estimate of the survival distribution which plots elapsed time on the horizontal axis and the estimated proportion of surviving subjects (that is those subjects who have not yet had the event of interest) on the vertical axis.

Code examples

Additional questions to ask:

Is a Kaplan-Meier curve inappropriate for your data?

Regardless of sample size, it is inappropriate if the observations are not independent. For example, repeated measurements from the same subject are not independent. Therefore, if subjects contribute more than one observation, the traditional Kaplan-Meier curve is inappropriate. If the traditional Kaplan-Meier curve is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Binary

What do you want to report?

Plot of the two survival distributions - Kaplan-Meier curve

The Kaplan-Meier curve is a non-parametric estimate of the survival distribution that plots elapsed time on the horizontal axis and the estimated proportion of surviving subjects (that is those subjects who have not yet had the event of interest) on the vertical axis for each group separately. By plotting one curve for each group defined by the binary predictor, it is possible to visually assess any apparent difference in survival.

Code examples

Hypothesis test for group difference - log-rank test

The log-rank test is a non-parametric test to assess whether the survival distributions of the two groups defined by the binary predictor are the same.

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the time to event and the binary predictor. For example, if the outcome is time to cancer recurrence and the binary predictor is gender then diabetes could be a confounder if the proportion of diabetics differs in males and females. In this case any apparent difference in the time to cancer recurrence between males and females could be due to diabetes, not gender. This can be corrected by a log-rank test stratified by diabetes or a multivariate Cox regression model. For more information, you may schedule a Catalyst consultation for further assistance (link).

Is a log-rank test inappropriate for your data?

Censoring in survival studies occurs when patients are observed for a specific amount of time, but they leave the study prior to having the even of interest. The log-rank test requires that the censoring time be independent of the survival time. If the censoring time is not independent, alternative approaches are required. Regardless of sample size, it is inappropriate if the observations are not independent. For example, multiple times to event measured on the same subject are not independent. Therefore, if subjects contribute more than one time to event, then the log-rank test is inappropriate. If this technique is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Categorical

What do you want to report?

Plot of the multiple survival distributions - Kaplan-Meier curve

The Kaplan-Meier curve is a non-parametric estimate of the survival distribution which plots elapsed time on the horizontal axis and the estimated proportion of surviving subjects (that is those subjects who have not yet had the event of interest) on the vertical axis for each group separately. By plotting one curve for each sub-group defined by the categorical predictor, it is possible to visually assess any apparent difference in survival.

Code examples

Hypothesis test for group difference - log-rank test

The log-rank test is a non-parametric test to assess whether the survival distributions of the groups defined by the categorical predictor are the same.

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the time to event and the categorical predictor. For example, if the outcome is time to cancer recurrence and the categorical predictor is race then diabetes could be a confounder if the proportion of diabetics differs in the race groups. In this case any apparent difference in the time to cancer recurrence between race groups could be due to diabetes, not race. This can be corrected by a log-rank test stratified by diabetes or a multivariate Cox regression model. For more information, you may schedule a Catalyst consultation for further assistance (link).

Is a log-rank test inappropriate for your data?

Censoring in survival studies occurs when patients are observed for a specific amount of time, but they leave the study prior to having the even of interest. The log-rank test requires that the censoring time be independent of the survival time. If the censoring time is not independent, alternative approaches are required. Regardless of sample size, it is inappropriate if the observations are not independent. For example, multiple times to event measured on the same subject are not independent. Therefore, if subjects contribute more than one time to event, then the log-rank test is inappropriate. If this technique is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Continuous

What do you want to report?

Hypothesis test for increasing hazard - Cox regression

The hazard is defined to be the probability of having the outcome event in a small time interval. It may be of interest to test whether the hazard increases (or decreases) with increasing values of a continuous predictor such as age. Cox regression can be used to estimate the hazard ratio for subjects who differ by one unit (or any number of units) on the continuous predictor and test whether the hazard ratio differs significantly from one (since a hazard ratio of one indicates equal hazards).

Code examples

Additional questions to ask:

Do you need to control for confounders?

A confounder is a third variable that is associated with both the time to event and the continuous predictor. For example, if the outcome is time to cancer recurrence and the continuous predictor is age then diabetes could be a confounder if the proportion of diabetics changes with age. In this case any apparent difference in the time to cancer recurrence due to age could be due to diabetes, not age. This can be corrected by a multivariate Cox regression model. For more information, you may schedule a Catalyst consultation for further assistance (link).

Is a Cox regression inappropriate for your data?

Censoring in survival studies occurs when patients are observed for a specific amount of time, but they leave the study prior to having the even of interest. The Cox regression models require that the censoring time be independent of the survival time. If the censoring time is not independent, alternative approaches are required. Regardless of sample size, it is inappropriate if the observations are not independent. For example, multiple times to event measured on the same subject are not independent. Therefore, if subjects contribute more than one time to event, then the traditional Cox regression model is inappropriate. If this technique is inappropriate for your data, you may schedule a Catalyst consultation for further assistance (link).

Table for Statistical Tests

Outcome variable Explanatory variable Topic
Continuous Dichotomous T-test
Continuous Categorical One-way ANOVA
Continuous Continuous Pearson Correlation / Linear regression
Dichotomous Dichotomous Chi-square test
Dichotomous Continuous Logistic regression
Ordinal Dichotomous Wilcoxon test
Ordinal Continuous Rank correlation coefficient
Time to event Dichotomous Log-rank test
Time to event Continuous Cox proportional hazards regression

One Sample T-test

SAS:

proc ttest data=t01 h0=21.7;
var bmi;
run;

Worked Example

Stata:

ttest bmi=21.7

Worked Example

Two Sample T-test

SAS:

proc ttest data=t01;
class gender;
var bmi;
run;
Worked Example

Stata:

ttest bmi, by(gender)
Worked Example

One-way ANOVA

SAS:

proc mixed data=t01;
class race;
model bmi=race /s;
lsmeans race /cl diffs adjust=bon;
run;
Worked Example

Stata:

anova bmi race
margins race, post
Worked Example

Pearson Correlation

SAS:

proc corr data=t01 nosimple;
var age;
with bmi;
run;
Worked Example

Stata:

pwcorr age bmi , sig

Worked Example

Linear Regression

SAS:

proc mixed data=t01;
model bmi=age /s;
run;

Worked Example

Stata:

regress bmi age

Worked Example

One Sample Binomial

SAS:

proc freq data=t01;
table diabetes/ binomial(level='1' p=0.50);
exact binomial;
run;

Worked Example

Stata:

bitest diabetes == 0.50
ci diabetes , binomial

Worked Example

Two Sample Binomial

SAS:

proc freq data=t01;
table diabetes*gender/ chisq;
run;

Worked Example

Stata:

tab diabetes gender, chi2 exact

Worked Example

Two Sample Logistic

SAS:

proc logistic data=t01;
class gender /descending;
model diabetes(event='1')=gender;
run;

Worked Example

Stata:

logistic diabetes gender

Worked Example

Many Sample Binomial

SAS:

proc freq data=t01;
table diabetes*race/ chisq;
run;

Worked Example

Stata:

tab diabetes race, chi2

Worked Example

Many Sample Logistic

SAS:

proc logistic data=t01;
class race /descending;
model diabetes(event='1')=race ;
run;

Worked Example

Stata:

logistic diabetes i.race

Worked Example

Continuous Logistic

SAS:

proc logistic data=t01;
model diabetes(event='1')=age ;
run;

Worked Example

Stata:

logistic diabetes age

Worked Example

One Sample Kaplan Meier

SAS:

proc lifetest data=t01 plots=(s);
time recurr_day*recurrence(0);
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
sts graph

Worked Example

Two Sample Kaplan Meier

SAS:

proc lifetest data=t01 plots=(s);
time recurr_day*recurrence(0);
strata gender / test=logrank;
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
sts graph, by(gender)

Worked Example

Two Sample Logrank Test

SAS:

proc lifetest data=t01 plots=(s);
time recurr_day*recurrence(0);
strata gender / test=logrank;
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
sts graph, by(gender)
sts test gender

Worked Example

Many Sample Kaplan Meier

SAS:

proc lifetest data=t01 plots=(s);
time recurr_day*recurrence(0);
strata race / test=logrank;
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
sts graph, by(race)

Worked Example

Many Sample Logrank Test

SAS:

proc lifetest data=t01 plots=(s);
time recurr_day*recurrence(0);
strata race / test=logrank;
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
sts graph, by(race)
sts test race

Worked Example

Cox Regression

SAS:

proc phreg data=t01;
model recurr_day*recurrence(0)=age;
run;

Worked Example

Stata:

stset recurr_day, failure(recurrence==1)
stcox age

Worked Example

Wilcoxon Test

SAS:

proc npar1way data=t01 wilcoxon;
class gender;
var bmi;
run;

Worked Example

Stata:

ranksum bmi, by(gender)

Worked Example

Rank Correlation Coefficient

SAS:

proc corr data=t01 nosimple spearman;
var age;
with bmi;
run;

Worked Example

Stata:

spearman age bmi

Worked Example