Translated R, SAS, STATA & SPSS

Understanding Syntax

The Passion Driven Statistics curriculum is intended to help students perform basic data management and statistical tests across 4 major statistical software platforms (R, SAS, Stata and SPSS). This web page provides a library of basic commands that the user can copy and paste into R, SAS, Stata or SPSS to perform a variety data management tasks and basic statistical tests. Our goal is to help student’s use statistical computing as a building block in scientific reasoning and creativity. Rather than producing students who can think about statistics from a software-specific perspective, these resources are meant to help students move flexibly and confidently between statistical software environments.

It is important to note that we use the following convention when presenting software-specific syntax. Bold text indicates syntax that does not need to change (Some of the bolded text could be changed, but the fact that it is listed in bold indicates that it does not need to be). Unbolded text indicates syntax that needs to be adapted to your own project (e.g. the actual name of your data set, your unique variable names, etc.).

Contents needed for every program

Calling in a data set

SPSS	GET FILE=‘P:\QAC\qac201\Studies\study name\filename.sav
Stata	use “P:\QAC\qac201\Studies\study name\filename"
SAS	LIBNAME in “P:\QAC\QAC201\study name; DATA new; set in.filename;
R	> newdata <- read.table(file = “filename.txt”, sep = “\t”, header=T)

Sorting the data

SPSS SORT CASES BY UNIQUE_ID.

Stata sort unique_id

SAS proc sort; by unique_id;

R > title_of_data_set <-
title_of_data_set[order(title_of_data_set$unique_id,decreasing=F).]

Abbreviating a data set to a smaller number of variables (i.e. columns)

Selecting variables you want to examine

Because many data sets are very large in terms of both observations and variables, any analyses that you conduct could take several minutes. Subsetting or abbreviating the data based on the variables that you will be examining can shorten the analytic time required to run your program. While this will not make a huge difference if you are running a program only a few times, the time you will save can be substantial if you plan to work extensively with the data.

SPSS	/KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8. (Must follow the SAVE OUTFILE='dataname' command)
Stata	keep var1 var2 var3 var4 var5 var6 var7 var8
SAS	KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8;
R	> var.keep <- c(“VAR1”, “VAR2”, “VAR3”, “VAR4”, “VAR5”, “VAR6”, “VAR7”, “VAR8”) > title_of_new_data_set <- new.data[,var.keep]

Outputting your abbreviated data set

SPSS	SAVE OUTFILE= 'P:\QAC\qac201\Studies\study name\title_of_new_data_set’
Stata	save filename
SAS	Data libname.title_of_new_data_set; set dataname; by unique_id;
R	> write.table(title_of_data_set, file=”filename.txt”, sep=”\t”, row.names=F)

> title_of_data_set <-title_of_data_set[order(title_of_data_set$unique_id,decreasing=F),]

Data management tasks

Basic Operations:

SPSS EQ or = >= or GE <= or LE > or GT < or LT NE

STATA EQ or = >= <= > < !=

SAS == >= or GE <= or LE > or GT < or LT NE

R == >= <= > < !=

Identify missing data

Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value for a particular variable (VAR1), you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

SPSS RECODE var1 (9=SYSMIS)

Stata replace var1=. if var1==9

SAS if VAR1=9 then VAR1=.;

R > title_of_data_set$VAR1[title_of_data_set$VAR1==9] <- NA

Recode responses to “no” based on skip patterns

There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer to this question regarding daily use is “no”. This would need to be explicitly recoded. Note that we commonly code a “no” as 0 and a “yes” as 1.

SPSS RECODE var1 (SYSMIS=7).

Stata eplace var1=7 if var1==.

SAS if VAR1=. then VAR1=7;

R > title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 7

Recoding string variables into numeric

In most software packages, it is important when preparing to run statistical analyses that all variables have response categories that are numeric rather than “string” or “character” (i.e. response categories are numbers rather than strings of characters and/or symbols). While it is not always needed, it is often recommended that all variables with string responses be recoded into numeric values. These numeric values are known as dummy codes in that they carry no direct numeric meaning

SPSS	RECODE TREE (‘Maple’=1) (‘Oak’=2) INTO TREE_N.
Stata	generate TREE_N=. replace TREE_N=1 if TREE=="Maple" replace TREE_N=2 if TREE=="Oak" OR by using the encode command encode TREE, gen(TREE_N)
SAS	IF TREE=‘Maple’ then TREE_N=1; else if TREE= ‘Oak’ then TREE_N=2;
R	(Not necessary in R)

Collapsing response categories

If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternately, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. For example, if you have the following categories for geographic region, you may want to collapse some of these categories:

Region: New England=1, Middle Atlantic=2, East North Central=3, West North Central=4, South Atlantic=5, East South Central=6, West South Central=7, Mountain=8, Pacific=9.

New_Region: East=1, West=2.

SPSS	COMPUTE new_region=2. IF (region=1\| region=2\|region=3\| region=5\|region=6) new_region=1.
Stata	generate new_region =2 replace new_region=1 if region==1\| region==2\|region==3\| region==5\|region==6 OR by using the recode command recode region (1/3 5 6=2) gen(new_region)
SAS	if region=1 or region=2 or region=3 or region=5 or region=6 then new_region=1; else if region=4 or region=7 or region=8 or region=9 then new_region=2;
R	>new_region <- rep(NA, # of observations) > new_region[title_of_data_set$region == 1 \| title_of_data_set$region == 2 \|title_of_data_set$region == 3 \| title_of_data_set$region == 5 \| title_of_data_set$region == 6] <-1 > new_region[title_of_data_set$region == 4 \| title_of_data_set$region == 7 \|title_of_data_set$region == 8 \| title_of_data_set$region == 9] <- 2

Aggregating variables

In many cases, you will want to combine multiple variables into one. For example, a data set may include a variable for each of several different individual anxiety disorders. You may however be interested in anxiety more generally. In this case you could create a general anxiety variable in which those individuals who received a diagnosis of social phobia, generalized anxiety disorder, specific phobia, panic disorder, agoraphobia, or obsessive compulsive disorder would be coded “yes” and those who were free from all of these diagnoses would be coded “no”.

SPSS	IF (socphob=1\|gad=1\|specphob=1\| panic=1\|agora=1\|ocd=1) anxiety=1. RECODE anxiety (SYSMIS=0).
Stata	gen anxiety=1 if socphob==1\|gad==1\|specphob==1\| panic==1\|agora==1\|ocd==1 replace anxiety=0 if anxiety==.
SAS	if socphob=1 or gad=1 or specphob=1 or panic=1 or agora=1 or ocd=1 then anxiety=1; elseanxiety=0;
R	> anxiety <- rep(0, # of observations) > anxiety[title_of_data_set$socphob == 1 \| title_of_data_set$gad==1 \| title_of_data_set$panic == 1 \| title_of_data_set$agora==1 \| title_of_data_set$ocd == 1] <- 1

Creating a continuous variable

If you are working with a number of items that represent a single construct, it may be useful to create a composite variable or score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent a single composite score ranging from 0 to 4 (i.e. 4 corresponding to the total number of symptoms measured and summed).

SPSS	COMPUTE nd_sum=sum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4).
Stata	egen nd_sum=rsum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4)
SAS	nd_sum=sum (of nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4);
R	> nd_sum <- title_of_data_set$nd_symptom1 + title_of_data_set$nd_symptom2 + title_of_data_set$nd_symptom3 + title_of_data_set$nd_symptom4 > title_of_data_set$nd_sum <- nd_sum

Renaming variables

Given the often cryptic names that variables are given in some data sets, it can often be useful to rename them into something you find meaningful (i.e. easier to remember or type)

SPSS COMPUTE newvarname=var1

Stata rename var1 newvarname

SAS RENAME var1=newvarname;

R > names(title_of_data_set)[names(title_of_data_set)=="VAR1"] <- "newvarname"

Subsetting data to a particular set of observations (i.e. rows)

It can also be necessary to subset the data so that you are including only those observations (i.e. rows of data) that assist in answering your particular research question. For example, if you are interested in identifying demographic predictors of depression, but only among Type II diabetes patients, you would need to subset the data to observations endorsing Type II Diabetes (i.e. diabetes2=1 or “yes”)

SPSS	/SELECT=diabetes2 EQ 1 (must be added as a command option)
Stata	if diabetes2==1 (put this at the end of the command)
SAS	if diabetes2=1; (put in the data step before sorting the data)
R	> title_of_subsetted_data <- title_of_data_set[“diabetes2”==1,]

Descriptive statistics (one variable at a time)

Descriptive statistics are used to describe the basic features of individual variables. Also known as univariate analysis, descriptive statistics summarize one variable at a time, across the observations in your data set.

Displaying frequency tables

SPSS	FREQUENCIES VARIABLES=var1 var2 var3 /ORDER=ANALYSIS.
Stata	tab1 var1 var2 var3
SAS	PROC FREQ; tables var1 var2 var3;
R	> library(descr) > freq(as.ordered(title_of_data_set$VAR1)) > freq(as.ordered(title_of_data_set$VAR2)) > freq(as.ordered(title_of_data_set$VAR3))

Central tendency

SPSS	DESCRIPTIVES VARIABLES=var1 var2 var3 /STATISTICS=MEAN STDDEV
Stata	summarize var1 var2 var3
SAS	proc means; var var1 var2 var3;
R	> library(descr) > freq(as.ordered(title_of_data_set$var1)) > freq(as.ordered(title_of_data_set$var2)) > freq(as.ordered(title_of_data_set$var3)) (Or for mean and sd: ) > summary(title_of_data_set$var1)

Descriptive statistics (comparing two variables)

Descriptive statistics can also show one variable in the context of a second (i.e. bivariate),

One categorical IV and one quantitative DV

SPSS	MEANS TABLES=IV by DV /CELLS MEAN COUNT STDDEV.
Stata	bys IV: summarize DV
SAS	proc sort; by IV; proc means; var DV; by IV;
R	> DV.byIV <- by(title_of_data_set$DV, title_of_data_set$IV, mean) > DV.byIV # for table > barplot(DV.byIV, beside=T) # for plot

One categorical IV and one categorical DV

SPSS	CROSSTABS /TABLES=DV by IV. /CELLS=COUNT ROW COLUMN TOTAL.
Stata	tab DV IV, row column cell chi2
SAS	Proc freq; tables DV*IV;
R	> table(title_of_data_set$DV, title_of_data_set$IV) # for table > prop.table(table(title_of_data_set$DV, title_of_data_set$IV)) # for cell %ages > prop.table(table(title_of_data_set$DV, title_of_data_set$IV),1) # for row %ages > prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2) # for column %age > barplot(prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2)[rows,])) # for plots of column percentage

Note: If your IV is continuous, for graphing purposes, create meaningful categories and then use the code above.

Descriptive statistics (adding a third variable)

One categorical IV, one quantitative DV, and a categorical third variable

SPSS	MEANS TABLES=DV BY IV BY THIRD_VAR /CELLS MEAN COUNT STDDEV.
Stata	bys IV third_var: summarize DV
SAS	proc sort; by IV THIRD_VAR; proc means; var DV; by IV THIRD_VAR;
R	>ftable(by(title_of_data_set$DV, list(title_of_data_set$IV, title_of_data_set$THIRD_VAR), mean)) # to get table > barplot(by(title_of_data_set$DV, list(title_of_data_set$IV,title_of_data_set$THIRD_VAR), mean), beside=T) # to get plot

One categorical IV, one categorical DV, and a categorical third variable

SPSS	CROSSTABS /TABLES=DV BY IV BY THIRD_VAR.
Stata	bys IV third_var: tab DV
SAS	proc sort; by THIRD_VAR; proc freq; tables DV*IV; by THIRD_VAR;
R	> ftable(title_of_data_set$DV, title_of_data_set$IV, title_of_data_set$THIRD_VAR) # for table > prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR)) # for cell %ages > prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR),1) # for row %ages > prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRD_VAR),2) # for column %age > barplot(prop.table(ftable(title_of_data_set$DV,title_of_data_set$IV,title_of_data_set$THIRDVAR),2)[rows,])) # for plots of column percentage

Note: If your 3rd variable is continuous, create meaningful categories and then use the code above.

Bivatiate statistical tests

T-test
TBA

Analysis of Variance (ANOVA)

SPSS	ONEWAY QUAN_DV BY CAT_IV /STATISTICS DESCRIPTIVES.
Stata	oneway quan_DV cat_IV, tabulate
SAS	proc anova; class CAT_IV; model QUAN_DV = CAT_IV; means CAT_IV;
R	> summary(aov(DV ~ IV, data=title_of_data_set))

Pearson Correlation

A Pearson correlation coefficient evaluates the degree of linear relationship between quantitative two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. In other words, knowing the value of one variable, you can perfectly predict the value of the second.

SPSS	CORRELATIONS /VARIABLES= QUANIV QUANDV /STATISTICS DESCRIPTIVES.
Stata	pwcorr quan_IV quan_DV, sig
SAS	Proc corr; var QUAN_IV QUAN_DV;
R	> cor.test(title_of_data_set$DV, title_of_data_set$IV)

Chi-Square Test of Independence

A Chi-Square Test of Independence compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable.

SPSS	CROSSTABS /TABLES= CAT_DV by CAT_IV /STATISTICS=CHISQ.
Stata	tab cat_dv cat_iv, row col chi2
SAS	Proc freq; tables CAT_DVCAT_IV/ chisq;*
R	> chisq.test(title_of_data_set$DV, title_of_data_set$IV)

Multivatiate statistical tests

Multiple Regression

Multiple regression is used when the DV (aka outcome variable) is quantitative.

SPSS	REGRESSION /DEPENDENT QUAN_DV /METHOD ENTER IV THIRDVAR1 THIRDVAR2
Stata	reg quan_DV IV THIRDVAR1 THIRDVAR2
SAS	Proc reg; model QUAN_DV=IV THIRDVAR1 THIRDVAR2;
R	> summary(lm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set))

Logistic Regression

Logistic regression is used when the DV (aka outcome variable) is binary/dichotomous. Note that if the dependent variable is categorical, with more than two levels, it must be dichotomized (i.e. made into a two level variable), so that logistic regression can be used.

SPSS	LOGISTIC REGRESSION BINARY_DV with IV THIRDVAR1.
Stata	logistic binary_DV IV thirdvar1 thirdvar2
SAS	Proc logistic; class IV THIRDVAR (when these variables are categorical); modelBINARY_DV=IV THIRDVAR1 THIRDVAR2;
R	> library(Design) > my.ddist <- datadist(title_of_data_set) > options(datadist = “my.ddist”) > lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set) # for p-values > summary(lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set)) # for odds ratios

SPSS	SORT CASES BY UNIQUE_ID.
Stata	sort unique_id
SAS	proc sort; by unique_id;
R	> title_of_data_set <- title_of_data_set[order(title_of_data_set$unique_id,decreasing=F).]

SPSS	EQ or =	>= or GE	<= or LE	> or GT	< or LT	NE
STATA	EQ or =	>=	<=	>	<	!=
SAS	==	>= or GE	<= or LE	> or GT	< or LT	NE
R	==	>=	<=	>	<	!=

SPSS	RECODE var1 (9=SYSMIS)
Stata	replace var1=. if var1==9
SAS	if VAR1=9 then VAR1=.;
R	> title_of_data_set$VAR1[title_of_data_set$VAR1==9] <- NA

SPSS	RECODE var1 (SYSMIS=7).
Stata	eplace var1=7 if var1==.
SAS	if VAR1=. then VAR1=7;
R	> title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 7

SPSS	COMPUTE newvarname=var1
Stata	rename var1 newvarname
SAS	RENAME var1=newvarname;
R	> names(title_of_data_set)[names(title_of_data_set)=="VAR1"] <- "newvarname"