Introduction
Many investigators use Excel spreadsheets or Access as a database for clinical data. In order to analyze data in a statistical package, it is usually necessary first convert your Access database or Excel spreadsheets into tables that can be read into statistical packages. By following the guidelines below, investigators can structure their spreadsheets so that the conversion is simple. Click here for a video demonstration of how to construct your Excel sheet.
Datasets for statistical analysis
The most common format of a dataset for statistical analysis is a rectangular grid of rows and columns. Each column represents a data element or variable (corresponding to a field in Access) such as age, gender, blood pressure, etc. while each row is the observed values for one patient at one particular point in time. Some sample data from a clinical trial on the effect of caffeine on blood pressure might look like Table 1 and Table 2.
The field names used by SAS and other packages must be no more than 32 characters, must contain only letters, digits, underscore and must not start with a digit. In SAS, upper case and lower case letters are equivalent so that two columns named AGE and age would cause confusion. You can use mixed uppercase and lowercase to make the field names more readable but make sure that the field name is unique (within the table) when converted to all uppercase.
Most statistical packages support at least two data types, numeric and character (text). If a variable is numeric don't enter any characters in that column (except for the field name in row 1). To enter a missing numeric value just leave the cell blank.
Text or Character fields
Do NOT use quotes, the number sign (#) or commas in your text fields or as codes because these have special meanings in statistical packages. For example, if you want an abbreviation for inches use in or spell it out rather than using the quotes. For names such as O'Brien, use OBrien.
Patient identifiers and visit identifiers
Use the same field name for a subject id (e.g. ID, PatID, PatientID) in all tables where the id's should match. If you have codes for visits, e.g. baseline, week1, week2, discharge, use the same codes in all tables. Use the same field name for the visit field (e.g. Visit) in all tables.
IMPORTANT: The information that you include on a spreadsheet that will be used by multiple researchers must follow HIPAA guidelines for patient confidentiality.
Coding Dichotomous (yes/no) variables
When possible enter dichotomous variables as numeric with 1=yes and 0=no. This makes them easier to include in analyses such as logistic regression without conversion.
Special Consideration for Excel
Do not use formulas for cell values. The sheet must be simple rows by columns (fields). If you are going to use Excel to compute things, e.g. means or percentages, then make a copy of the file before you do it. Also, make a code book and place it on a separate sheet. This way you have built in documentation for your data.
Constructing the Code Book
An important additional table to include in your Excel file or Access database is a code book. The code book includes a row for each variable in each table, and four features of each variable should be described. The first feature (TABLE) is the table in which the variable is located. If you have only one table, this feature is not needed. The second feature is the name of the variable in the table (FIELD). The third feature is a descriptive label of the variable (LABEL). Finally, the fourth feature provides a code for any categorical or binary variables (CODE).
Table 1: Blood pressure study: BASELINE Table
PID |
Age |
Female |
TX |
101 |
23 |
1 |
1 |
102 |
14 |
0 |
0 |
101 |
25 |
0 |
1 |
Table 2: Blood pressure study: VISIT Table
PID |
Visit |
Blood pressure |
101 |
0 |
121 |
101 |
1 |
123 |
101 |
2 |
124 |
102 |
0 |
142 |
102 |
1 |
147 |
102 |
2 |
145 |
103 |
0 |
161 |
103 |
1 |
165 |
103 |
2 |
168 |
Table 3: Blood pressure study: CODE BOOK Table
Table |
Field |
Label |
Code |
Baseline |
PID |
patient id |
|
Baseline |
Age |
age (years) |
|
Baseline |
Female |
female gender |
1=female, 0=male |
Baseline |
TX |
treatment |
1=regular, 0=decaf |
Visit |
PID |
patient id |
|
Visit |
Visit |
visit |
0=Baseline,1=6 month, 2=12 month |
Visit |
Blood pressure |
Systolic blood pressure (mmHg) |
|