Quantitative Analysis Center
translatedsyntax

Understanding Syntax

The Passion Driven Statistics curriculum is intended to help students perform basic data management and statistical tests across 4 major statistical software platforms (R, SAS, Stata and SPSS). This web page provides a library of basic commands that the user can copy and paste into R, SAS, Stata or SPSS to perform a variety data management tasks and basic statistical tests. Our goal is to help student’s use statistical computing as a building block in scientific reasoning and creativity. Rather than producing students who can think about statistics from a software-specific perspective, these resources are meant to help students move flexibly and confidently between statistical software environments.

It is important to note that we use the following convention when presenting software-specific syntax. Bold text indicates syntax that does not need to change (Some of the bolded text could be changed, but the fact that it is listed in bold indicates that it does not need to be). Unbolded text indicates syntax that needs to be adapted to your own project (e.g. the actual name of your data set, your unique variable names, etc.).

Contents needed for every program

  • Calling in a data set
    • SPSS

      GET FILE=‘P:\QAC\qac201\Studies\study name\filename.sav

      Stata

      use “P:\QAC\qac201\Studies\study name\filename"

      SAS

      LIBNAME in “P:\QAC\QAC201\study name;
      DATA new; set in.filename;

      R

      > newdata <- read.table(file = filename.txt”, sep = “\t”, header=T)

  • Sorting the data
    • SPSS

      SORT CASES BY UNIQUE_ID.

      Stata

      sort unique_id

      SAS

      proc sort; by unique_id;

      R > title_of_data_set <-
      title_of_data_set[order(title_of_data_set$unique_id,decreasing=F).]

       

       

       

       

Abbreviating a data set to a smaller number of variables (i.e. columns)

  • Selecting variables you want to examine
    • Because many data sets are very large in terms of both observations and variables, any analyses that you conduct could take several minutes. Subsetting or abbreviating the data based on the variables that you will be examining can shorten the analytic time required to run your program. While this will not make a huge difference if you are running a program only a few times, the time you will save can be substantial if you plan to work extensively with the data.

      SPSS

      /KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8. (Must follow the SAVE OUTFILE='dataname' command)

      Stata

      keep var1 var2 var3 var4 var5 var6 var7 var8

      SAS

      KEEP VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8;

      R

      > var.keep <- c(“VAR1”, “VAR2”, “VAR3”, “VAR4”, “VAR5”, “VAR6”, “VAR7”, “VAR8”)

      > title_of_new_data_set <- new.data[,var.keep]



  • Outputting your abbreviated data set
    • SPSS

      SAVE OUTFILE= 'P:\QAC\qac201\Studies\study name\title_of_new_data_set’

      Stata

      save filename

      SAS

      Data libname.title_of_new_data_set; set dataname; by unique_id;

      R

      > write.table(title_of_data_set, file=”filename.txt”, sep=”\t”, row.names=F)

       

      > title_of_data_set <- title_of_data_set[order(title_of_data_set$unique_id,decreasing=F),]

Data management tasks

  • Basic Operations:
    • SPSS

      EQ or =

      >= or GE

      <= or LE

      > or GT

      < or LT

      NE

      STATA

      ==

      >=

      <=

      !=

      SAS

      EQ or =

      >= or GE

      <= or LE

      > or GT

      < or LT

      NE

      R

      ==

      >=

      <=

      !=

  • Identify missing data
    • Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value for a particular variable (VAR1), you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

      SPSS

      RECODE var1 (9=SYSMIS)

      Stata

      replace var1=. if var1==9

      SAS

      if VAR1=9 then VAR1=.;

      R

      > title_of_data_set$VAR1[title_of_data_set$VAR1==9] <- NA

  • Recode responses to “no” based on skip patterns
    • There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer to this question regarding daily use is “no”. This would need to be explicitly recoded. Note that we commonly code a “no” as 0 and a “yes” as 1.

      SPSS

      RECODE var1 (SYSMIS=7).

      Stata

      replace var1=7 if var1==.

      SAS

      if VAR1=. then VAR1=7;

      R

      > title_of_data_set$VAR1[is.na(title_of_data_set$VAR1)] <- 7



  • Recoding string variables into numeric
    • In most software packages, it is important when preparing to run statistical analyses that all variables have response categories that are numeric rather than “string” or “character” (i.e. response categories are numbers rather than strings of characters and/or symbols). While it is not always needed, it is often recommended that all variables with string responses be recoded into numeric values. These numeric values are known as dummy codes in that they carry no direct numeric meaning

      SPSS

      RECODE TREE (‘Maple’=1) (‘Oak’=2) INTO TREE_N.

      Stata

      generate TREE_N=.
      replace TREE_N=1 if TREE=="Maple"
      replace TREE_N=2 if TREE=="Oak"
      OR by using the encode command
      encode TREE, gen(TREE_N)

      SAS

      IF TREE=‘Maple’ then TREE_N=1;
      else if TREE= ‘Oak’ then TREE_N=2;

      R

      (Not necessary in R)

  • Collapsing response categories
    • If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternately, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. For example, if you have the following categories for geographic region, you may want to collapse some of these categories:

      Region: New England=1, Middle Atlantic=2, East North Central=3, West North Central=4, South Atlantic=5, East South Central=6, West South Central=7, Mountain=8, Pacific=9.

      New_Region: East=1, West=2.

      SPSS

      COMPUTE new_region=2.
      IF (region=1| region=2|region=3| region=5|region=6) new_region=1.

      Stata

      generate new_region =2
      replace new_region=1 if region==1| region==2|region==3| region==5|region==6
      OR by using the recode command
      recode region (1/3 5 6=2) gen(new_region)

      SAS

      if region=1 or region=2 or region=3 or region=5 or region=6 then new_region=1;
      else if region=4 or region=7 or region=8 or region=9 then new_region=2;

      R

      >new_region <- rep(NA, # of observations)
      > new_region[title_of_data_set$region == 1 | title_of_data_set$region == 2 | title_of_data_set$region == 3 | title_of_data_set$region == 5 | title_of_data_set$region == 6] <- 1
      > new_region[title_of_data_set$region == 4 | title_of_data_set$region == 7 | title_of_data_set$region == 8 | title_of_data_set$region == 9] <- 2

  • Aggregating variables
    • In many cases, you will want to combine multiple variables into one. For example, a data set may include a variable for each of several different individual anxiety disorders. You may however be interested in anxiety more generally. In this case you could create a general anxiety variable in which those individuals who received a diagnosis of social phobia, generalized anxiety disorder, specific phobia, panic disorder, agoraphobia, or obsessive compulsive disorder would be coded “yes” and those who were free from all of these diagnoses would be coded “no”.

      SPSS

      IF (socphob=1|gad=1|specphob=1| panic=1|agora=1|ocd=1) anxiety=1.
      RECODE anxiety (SYSMIS=0).

      Stata

      gen anxiety=1 if socphob==1|gad==1|specphob==1| panic==1|agora==1|ocd==1
      replace anxiety=0 if anxiety==.

      SAS

      if socphob=1 or gad=1 or specphob=1 or panic=1 or agora=1 or ocd=1 then anxiety=1; else anxiety=0;

      R

      > anxiety <- rep(0, # of observations)

      > anxiety[title_of_data_set$socphob == 1 | title_of_data_set$gad==1 | title_of_data_set$panic == 1 | title_of_data_set$agora==1 | title_of_data_set$ocd == 1] <- 1

  • Creating a continuous variable
    • If you are working with a number of items that represent a single construct, it may be useful to create a composite variable or score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent a single composite score ranging from 0 to 4 (i.e. 4 corresponding to the total number of symptoms measured and summed).

      SPSS

      COMPUTE nd_sum=sum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4).

      Stata

      egen nd_sum=rsum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4)

      SAS

      nd_sum=sum (of nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4);

      R

      > nd_sum <- title_of_data­_set$nd_symptom1 +  title­_of_data_set$nd_symptom2 +              title­_of_data_set$nd_symptom3 +  title­_of_data_set$nd_symptom4

      > title_of_data_set$nd_sum <- nd_sum

  • Renaming variables
    • Given the often cryptic names that variables are given in some data sets, it can often be useful to rename them into something you find meaningful (i.e. easier to remember or type)

      SPSS

      COMPUTE newvarname=var1

      Stata

      rename var1 newvarname

      SAS

      RENAME var1=newvarname;

      R

      > names(title_of_data_set)[names(title_of_data_set)=="VAR1"] <- "newvarname"

  • Subsetting data to a particular set of observations (i.e. rows)
    • It can also be necessary to subset the data so that you are including only those observations (i.e. rows of data) that assist in answering your particular research question. For example, if you are interested in identifying demographic predictors of depression, but only among Type II diabetes patients, you would need to subset the data to observations endorsing Type II Diabetes (i.e. diabetes2=1 or “yes”)

      SPSS

      /SELECT=diabetes2 EQ 1 (must be added as a command option)

      Stata

      if diabetes2==1 (put this at the end of the command)

      SAS

      if diabetes2=1; (put in the data step before sorting the data)

      R

      > title_of_subsetted_data <- title_of_data_set[“diabetes2”==1,]

Descriptive statistics (one variable at a time)

Descriptive statistics are used to describe the basic features of individual variables. Also known as univariate analysis, descriptive statistics summarize one variable at a time, across the observations in your data set.

  • Displaying frequency tables
    • SPSS

      FREQUENCIES VARIABLES=var1 var2 var3
      /ORDER=ANALYSIS.

      Stata

      tab1 var1 var2 var3

      SAS

      PROC FREQ; tables var1 var2 var3;

      R

      > library(descr)

      > freq(as.ordered(title_of_data_set$VAR1))

      > freq(as.ordered(title_of_data_set$VAR2))

      > freq(as.ordered(title_of_data_set$VAR3))

  • Central tendency
    • SPSS

      DESCRIPTIVES VARIABLES=var1 var2 var3
      /STATISTICS=MEAN STDDEV

      Stata

      summarize var1 var2 var3

      SAS

      proc means; var var1 var2 var3;

      R

      > library(descr)

      > freq(as.ordered(title_of_data_set$var1))

      > freq(as.ordered(title_of_data_set$var2))

      > freq(as.ordered(title_of_data_set$var3))

      (Or for mean and sd: )

      > summary(title_of_data_set$var1)

Descriptive statistics (comparing two variables)

Descriptive statistics can also show one variable in the context of a second (i.e. bivariate),

  • One categorical IV and one quantitative DV
    • SPSS

      MEANS TABLES=IV by DV
      /CELLS MEAN COUNT STDDEV.

      Stata

      bys IV: summarize DV

      SAS

      proc sort; by IV;
      proc means; var DV; by IV;

      R

      > DV.byIV <- by(title_of_data_set$DV, title_of_data_set$IV, mean)
      > DV.byIV                                                      
      # for table
      > barplot(DV.byIV, beside=T)                  
      # for plot

  • One categorical IV and one categorical DV
    • SPSS

      CROSSTABS
      /TABLES=DV by IV.
      /CELLS=COUNT ROW COLUMN TOTAL.

      Stata

      tab DV IV, row column cell chi2

      SAS

      Proc freq; tables DV*IV;

      R

      > table(title_of_data_set$DV, title_of_data_set$IV) # for table
      > prop.table(table(title_of_data_set$DV, title_of_data_set$IV)) # for cell %ages
      > prop.table(table(title_of_data_set$DV, title_of_data_set$IV),1) # for row %ages
      > prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2) # for column %age
      > barplot(prop.table(table(title_of_data_set$DV, title_of_data_set$IV),2)[rows,])) # for plots of column percentage

       

       

       

       

       

       

       

       

      Note: If your IV is continuous, for graphing purposes, create meaningful categories and then use the code above.

Descriptive statistics (adding a third variable)

  • One categorical IV, one quantitative DV, and a categorical third variable
    • SPSS

      MEANS TABLES=DV BY IV BY THIRD_VAR
      /CELLS MEAN COUNT STDDEV.

      Stata

      bys IV third_var: summarize DV

      SAS

      proc sort; by IV THIRD_VAR;

      proc means; var DV; by IV THIRD_VAR;

      R

      >ftable(by(title_of_data_set$DV, list(title_of_data_set$IV, title_of_data_set$THIRD_VAR), mean)) # to get table

      > barplot(by(title_of_data_set$DV, list(title_of_data_set$IV, title_of_data_set$THIRD_VAR), mean), beside=T) # to get plot

  • One categorical IV, one categorical DV, and a categorical third variable
    • SPSS

      CROSSTABS
      /TABLES=DV BY IV BY THIRD_VAR.

      Stata

      bys IV third_var: tab DV

      SAS

      proc sort; by THIRD_VAR;
      proc freq; tables DV*IV; by THIRD_VAR;

      R

      > ftable(title_of_data_set$DV, title_of_data_set$IV, title_of_data_set$THIRD_VAR)                                                                              # for table
      > prop.table(ftable(title_of_data_set$DV, title_of_data_set$IV,title_of_data_set$THIRD_VAR)) # for cell %ages
      > prop.table(ftable(
      title_of_data_set$DV, title_of_data_set$IV,title_of_data_set$THIRD_VAR),1) # for row %ages
      > prop.table(ftable(
      title_of_data_set$DV, title_of_data_set$IV,title_of_data_set$THIRD_VAR),2) # for column %age
      > barplot(prop.table(ftable(title_of_data_set$DV, title_of_data_set$IV,title_of_data_set$THIRDVAR),2)[rows,])) # for plots of column percentage

      Note: If your 3rd variable is continuous, create meaningful categories and then use the code above.

Bivatiate statistical tests

  • T-test
    • TBA
  • Analysis of Variance (ANOVA)
    • SPSS

      ONEWAY QUAN_DV BY CAT_IV
      /STATISTICS DESCRIPTIVES.

      Stata

      oneway quan_DV cat_IV, tabulate

      SAS

      proc anova;
      class CAT_IV;
      model QUAN_DV = CAT_IV;
      means CAT_IV;

      R

      > summary(aov(DV ~ IV, data=title_of_data_set))

  • Pearson Correlation
    • A Pearson correlation coefficient evaluates the degree of linear relationship between quantitative two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. In other words, knowing the value of one variable, you can perfectly predict the value of the second.

      SPSS

      CORRELATIONS
      /VARIABLES= QUANIV QUANDV
      /STATISTICS DESCRIPTIVES.

      Stata

      pwcorr quan_IV quan_DV, sig

      SAS

      Proc corr; var QUAN_IV QUAN_DV;

      R

      > cor.test(title_of_data_set$DV, title_of_data_set$IV)

  • Chi-Square Test of Independence
    • A Chi-Square Test of Independence compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable.

      SPSS

      CROSSTABS
      /TABLES= CAT_DV by CAT_IV
      /STATISTICS=CHISQ.

      Stata

      tab cat_dv cat_iv,  row col chi2

      SAS

      Proc freq; tables CAT_DV*CAT_IV/ chisq;

      R

      > chisq.test(title_of_data_set$DV, title_of_data_set$IV)

Multivatiate statistical tests

  • Multiple Regression
    • Multiple regression is used when the DV (aka outcome variable) is quantitative.

      SPSS

      REGRESSION
      /DEPENDENT QUAN_DV
      /METHOD ENTER IV THIRDVAR1 THIRDVAR2

      Stata

      reg quan_DV IV THIRDVAR1 THIRDVAR2

      SAS

      Proc reg; model QUAN_DV=IV THIRDVAR1 THIRDVAR2;

      R

      > summary(lm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set))

       

  • Logistic Regression
    • Logistic regression is used when the DV (aka outcome variable) is binary/dichotomous. Note that if the dependent variable is categorical, with more than two levels, it must be dichotomized (i.e. made into a two level variable), so that logistic regression can be used.

      SPSS

      LOGISTIC REGRESSION BINARY_DV with IV THIRDVAR1.

      Stata

      logistic binary_DV IV thirdvar1 thirdvar2

      SAS

      Proc logistic; class IV THIRDVAR (when these variables are categorical); model BINARY_DV=IV THIRDVAR1 THIRDVAR2;

      R

      > library(Design)

      > my.ddist <- datadist(title_of_data_set)

      > options(datadist = “my.ddist”)

      > lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set)                                              # for p-values

      > summary(lrm(DV ~ IV + THIRDVAR1 + THIRDVAR2, data=title_of_data_set))                          # for odds ratios