Translated R, SAS, STATA & SPSS
Understanding Syntax
The Passion Driven Statistics curriculum is intended to help students perform basic data management and statistical tests across 4 major statistical software platforms (R, SAS, Stata and SPSS). This web page provides a library of basic commands that the user can copy and paste into R, SAS, Stata or SPSS to perform a variety data management tasks and basic statistical tests. Our goal is to help student’s use statistical computing as a building block in scientific reasoning and creativity. Rather than producing students who can think about statistics from a software-specific perspective, these resources are meant to help students move flexibly and confidently between statistical software environments.
It is important to note that we use the following convention when presenting software-specific syntax. Bold text indicates syntax that does not need to change (Some of the bolded text could be changed, but the fact that it is listed in bold indicates that it does not need to be). Unbolded text indicates syntax that needs to be adapted to your own project (e.g. the actual name of your data set, your unique variable names, etc.).
Contents needed for every program
- Calling in a data set
SPSS GET FILE=‘P:\QAC\qac201\Studies\study name\filename.sav Stata use “P:\QAC\qac201\Studies\study name\filename" SAS LIBNAME in “P:\QAC\QAC201\study name;
DATA new; set in.filename;R > newdata <- read.table(file = “filename.txt”, sep = “\t”, header=T)
- Sorting the data
SPSS SORT CASES BY UNIQUE_ID. Stata sort unique_id SAS proc sort; by unique_id; R > title_of_data_set <-
title_of_data_set[order(title_of_data_set$unique_id,decreasing=F).]
Abbreviating a data set to a smaller number of variables (i.e. columns)
- Selecting variables you want to examine
Because many data sets are very large in terms of both observations and variables, any analyses that you conduct could take several minutes. Subsetting or abbreviating the data based on the variables that you will be examining can shorten the analytic time required to run your program. While this will not make a huge difference if you are running a program only a few times, the time you will save can be substantial if you plan to work extensively with the data.
SPSS /KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8. (Must follow the SAVE OUTFILE='dataname' command) Stata keep var1 var2 var3 var4 var5 var6 var7 var8 SAS KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8; R > var.keep <- c(“VAR1”, “VAR2”, “VAR3”, “VAR4”, “VAR5”, “VAR6”, “VAR7”, “VAR8”)
> title_of_new_data_set <- new.data[,var.keep]
- Outputting your abbreviated data set
SPSS SAVE OUTFILE= 'P:\QAC\qac201\Studies\study name\title_of_new_data_set’ Stata save filename SAS Data libname.title_of_new_data_set; set dataname; by unique_id; R > write.table(title_of_data_set, file=”filename.txt”, sep=”\t”, row.names=F) > title_of_data_set <-title_of_data_set[order(title_of_data_set$unique_id,decreasing=F),]
Data management tasks
- Basic Operations:
SPSS EQ or = >= or GE <= or LE > or GT < or LT NE STATA EQ or = >= <= > < != SAS == >= or GE <= or LE > or GT < or LT NE R == >= <= > < !=
- Identify missing data
Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value for a particular variable (VAR1), you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.
SPSS RECODE var1 (9=SYSMIS) Stata replace var1=. if var1==9 SAS if VAR1=9 then VAR1=.; R > title_of_data_set$VAR1[title_of_data_set$VAR1==9] <- NA
- Recode responses to “no” based on skip patterns
There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer to this question regarding daily use is “no”. This would need to be explicitly recoded. Note that we commonly code a “no” as 0 and a “yes” as 1.
SPSS RECODE var1 (SYSMIS=7). Stata eplace var1=7 if var1==. SAS if VAR1=. then VAR1=7; R > title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 7
- Recoding string variables into numeric
In most software packages, it is important when preparing to run statistical analyses that all variables have response categories that are numeric rather than “string” or “character” (i.e. response categories are numbers rather than strings of characters and/or symbols). While it is not always needed, it is often recommended that all variables with string responses be recoded into numeric values. These numeric values are known as dummy codes in that they carry no direct numeric meaning
SPSS RECODE TREE (‘Maple’=1) (‘Oak’=2) INTO TREE_N. Stata generate TREE_N=.
replace TREE_N=1 if TREE=="Maple"
replace TREE_N=2 if TREE=="Oak"
OR by using the encode command
encode TREE, gen(TREE_N)SAS IF TREE=‘Maple’ then TREE_N=1;
else if TREE= ‘Oak’ then TREE_N=2;R (Not necessary in R)
- Collapsing response categories
If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternately, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. For example, if you have the following categories for geographic region, you may want to collapse some of these categories:
Region: New England=1, Middle Atlantic=2, East North Central=3, West North Central=4, South Atlantic=5, East South Central=6, West South Central=7, Mountain=8, Pacific=9.
New_Region: East=1, West=2.SPSS COMPUTE new_region=2.
IF (region=1| region=2|region=3| region=5|region=6) new_region=1.Stata generate new_region =2
replace new_region=1 if region==1| region==2|region==3| region==5|region==6
OR by using the recode command
recode region (1/3 5 6=2) gen(new_region)SAS if region=1 or region=2 or region=3 or region=5 or region=6 then new_region=1;
else if region=4 or region=7 or region=8 or region=9 then new_region=2;R >new_region <- rep(NA, # of observations)
> new_region[title_of_data_set$region == 1 | title_of_data_set$region == 2 |title_of_data_set$region == 3 | title_of_data_set$region == 5 | title_of_data_set$region == 6] <-1
> new_region[title_of_data_set$region == 4 | title_of_data_set$region == 7 |title_of_data_set$region == 8 | title_of_data_set$region == 9] <- 2
- Aggregating variables
In many cases, you will want to combine multiple variables into one. For example, a data set may include a variable for each of several different individual anxiety disorders. You may however be interested in anxiety more generally. In this case you could create a general anxiety variable in which those individuals who received a diagnosis of social phobia, generalized anxiety disorder, specific phobia, panic disorder, agoraphobia, or obsessive compulsive disorder would be coded “yes” and those who were free from all of these diagnoses would be coded “no”.
SPSS IF (socphob=1|gad=1|specphob=1| panic=1|agora=1|ocd=1) anxiety=1.
RECODE anxiety (SYSMIS=0).Stata gen anxiety=1 if socphob==1|gad==1|specphob==1| panic==1|agora==1|ocd==1
replace anxiety=0 if anxiety==.SAS if socphob=1 or gad=1 or specphob=1 or panic=1 or agora=1 or ocd=1 then anxiety=1; elseanxiety=0; R > anxiety <- rep(0, # of observations)
> anxiety[title_of_data_set$socphob == 1 | title_of_data_set$gad==1 | title_of_data_set$panic == 1 | title_of_data_set$agora==1 | title_of_data_set$ocd == 1] <- 1
- Creating a continuous variable
If you are working with a number of items that represent a single construct, it may be useful to create a composite variable or score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent a single composite score ranging from 0 to 4 (i.e. 4 corresponding to the total number of symptoms measured and summed).
SPSS COMPUTE nd_sum=sum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4). Stata egen nd_sum=rsum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4) SAS nd_sum=sum (of nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4); R > nd_sum <- title_of_data_set$nd_symptom1 + title_of_data_set$nd_symptom2 + title_of_data_set$nd_symptom3 + title_of_data_set$nd_symptom4
> title_of_data_set$nd_sum <- nd_sum
- Renaming variables
Given the often cryptic names that variables are given in some data sets, it can often be useful to rename them into something you find meaningful (i.e. easier to remember or type)
SPSS COMPUTE newvarname=var1 Stata rename var1 newvarname SAS RENAME var1=newvarname; R > names(title_of_data_set)[names(title_of_data_set)=="VAR1"] <- "newvarname"
- Subsetting data to a particular set of observations (i.e. rows)
It can also be necessary to subset the data so that you are including only those observations (i.e. rows of data) that assist in answering your particular research question. For example, if you are interested in identifying demographic predictors of depression, but only among Type II diabetes patients, you would need to subset the data to observations endorsing Type II Diabetes (i.e. diabetes2=1 or “yes”)
SPSS /SELECT=diabetes2 EQ 1 (must be added as a command option) Stata if diabetes2==1 (put this at the end of the command) SAS if diabetes2=1; (put in the data step before sorting the data) R > title_of_subsetted_data <- title_of_data_set[“diabetes2”==1,]
Descriptive statistics (one variable at a time)
Descriptive statistics are used to describe the basic features of individual variables. Also known as univariate analysis, descriptive statistics summarize one variable at a time, across the observations in your data set.
- Displaying frequency tables
SPSS FREQUENCIES VARIABLES=var1 var2 var3
/ORDER=ANALYSIS.Stata tab1 var1 var2 var3 SAS PROC FREQ; tables var1 var2 var3; R > library(descr)
> freq(as.ordered(title_of_data_set$VAR1))
> freq(as.ordered(title_of_data_set$VAR2))
> freq(as.ordered(title_of_data_set$VAR3))
- Central tendency
SPSS DESCRIPTIVES VARIABLES=var1 var2 var3
/STATISTICS=MEAN STDDEVStata summarize var1 var2 var3 SAS proc means; var var1 var2 var3; R > library(descr)
> freq(as.ordered(title_of_data_set$var1))
> freq(as.ordered(title_of_data_set$var2))
> freq(as.ordered(title_of_data_set$var3))
(Or for mean and sd: )
> summary(title_of_data_set$var1)
Descriptive statistics (comparing two variables)
Descriptive statistics can also show one variable in the context of a second (i.e. bivariate),
- One categorical IV and one quantitative DV
SPSS MEANS TABLES=IV by DV
/CELLS MEAN COUNT STDDEV.Stata bys IV: summarize DV SAS proc sort; by IV;
proc means; var DV; by IV;R > DV.byIV <- by(title_of_data_set$DV, title_of_data_set$IV, mean)
> DV.byIV # for table
> barplot(DV.byIV, beside=T) # for plot
- One categorical IV and one categorical DV
SPSS CROSSTABS
/TABLES=DV by IV.
/CELLS=COUNT ROW COLUMN TOTAL.Stata tab DV IV, row column cell chi2 SAS Proc freq; tables DV*IV; R > table(title_of_data_set$DV, title_of_data_set$IV) # for table
> prop.table(table(title_of_data_set$DV, title_of_data_set$IV)) # for cell %ages
> prop.table(table(title_of_data_set$DV, title_of_data_set$IV),1) # for row %ages
> prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2) # for column %age
> barplot(prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2)[rows,])) # for plots of column percentageNote: If your IV is continuous, for graphing purposes, create meaningful categories and then use the code above.
Descriptive statistics (adding a third variable)
- One categorical IV, one quantitative DV, and a categorical third variable
SPSS MEANS TABLES=DV BY IV BY THIRD_VAR
/CELLS MEAN COUNT STDDEV.Stata bys IV third_var: summarize DV SAS proc sort; by IV THIRD_VAR;
proc means; var DV; by IV THIRD_VAR;
R >ftable(by(title_of_data_set$DV, list(title_of_data_set$IV, title_of_data_set$THIRD_VAR), mean)) # to get table
> barplot(by(title_of_data_set$DV, list(title_of_data_set$IV,title_of_data_set$THIRD_VAR), mean), beside=T) # to get plot
- One categorical IV, one categorical DV, and a categorical third variable
SPSS CROSSTABS
/TABLES=DV BY IV BY THIRD_VAR.Stata bys IV third_var: tab DV SAS proc sort; by THIRD_VAR;
proc freq; tables DV*IV; by THIRD_VAR;R > ftable(title_of_data_set$DV, title_of_data_set$IV, title_of_data_set$THIRD_VAR) # for table
> prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR)) # for cell %ages
> prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR),1) # for row %ages
> prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR),2) # for column %age
> barplot(prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRDVAR),2)[rows,])) # for plots of column percentageNote: If your 3rd variable is continuous, create meaningful categories and then use the code above.
Bivatiate statistical tests
- T-test
TBA
- Analysis of Variance (ANOVA)
SPSS ONEWAY QUAN_DV BY CAT_IV
/STATISTICS DESCRIPTIVES.Stata oneway quan_DV cat_IV, tabulate SAS proc anova;
class CAT_IV;
model QUAN_DV = CAT_IV;
means CAT_IV;R > summary(aov(DV ~ IV, data=title_of_data_set))
- Pearson Correlation
A Pearson correlation coefficient evaluates the degree of linear relationship between quantitative two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. In other words, knowing the value of one variable, you can perfectly predict the value of the second.
SPSS CORRELATIONS
/VARIABLES= QUANIV QUANDV
/STATISTICS DESCRIPTIVES.Stata pwcorr quan_IV quan_DV, sig SAS Proc corr; var QUAN_IV QUAN_DV; R > cor.test(title_of_data_set$DV, title_of_data_set$IV)
- Chi-Square Test of Independence
A Chi-Square Test of Independence compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable.
SPSS CROSSTABS
/TABLES= CAT_DV by CAT_IV
/STATISTICS=CHISQ.Stata tab cat_dv cat_iv, row col chi2 SAS Proc freq; tables CAT_DV*CAT_IV/ chisq; R > chisq.test(title_of_data_set$DV, title_of_data_set$IV)
Multivatiate statistical tests
- Multiple Regression
Multiple regression is used when the DV (aka outcome variable) is quantitative.
SPSS REGRESSION
/DEPENDENT QUAN_DV
/METHOD ENTER IV THIRDVAR1 THIRDVAR2Stata reg quan_DV IV THIRDVAR1 THIRDVAR2 SAS Proc reg; model QUAN_DV=IV THIRDVAR1 THIRDVAR2; R > summary(lm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set))
- Logistic Regression
Logistic regression is used when the DV (aka outcome variable) is binary/dichotomous. Note that if the dependent variable is categorical, with more than two levels, it must be dichotomized (i.e. made into a two level variable), so that logistic regression can be used.
SPSS LOGISTIC REGRESSION BINARY_DV with IV THIRDVAR1. Stata logistic binary_DV IV thirdvar1 thirdvar2 SAS Proc logistic; class IV THIRDVAR (when these variables are categorical); modelBINARY_DV=IV THIRDVAR1 THIRDVAR2; R > library(Design)
> my.ddist <- datadist(title_of_data_set)
> options(datadist = “my.ddist”)
> lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set) # for p-values
> summary(lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set)) # for odds ratios