Lecture: More on SAS Procedures

General Comments

  • There is an application called Jupyter notebooks (formerly iPython), that has some functionality similar to R Markdown and can execute SAS code. However, it is still very glitchy for SAS and not worth the hassle in this class.
  • Manually Entering Data can also be challenging in SAS, next is a similar example to what you worked on in lab.
In [1]:
* Manual Input of Data;
DATA Favorites;
    INPUT Names $ Food $ Activity $;
        DATALINES;
    Andy Milkshakes Golf
    Eleanor Cauliflower Skiing
    Georgiana Anything Gymnastics
        ;
RUN;

PROC PRINT DATA=Favorites;
RUN;
Out[1]:
SAS Output

The SAS System

Obs Names Food Activity
1 Andy Milkshak Golf
2 Eleanor Cauliflo Skiing
3 Georgian Anything Gymnasti
In [2]:
* Manual Input of Data;
DATA Favorites;
    INPUT Names :$15. Food :$15. Activity :$15.;
        DATALINES;
    Andy Milkshakes Golf
    Eleanor Cauliflower Skiing
    Georgiana Anything Gymnastics
        ;
RUN;

PROC PRINT DATA=Favorites;
RUN;
Out[2]:
SAS Output

The SAS System

Obs Names Food Activity
1 Andy Milkshakes Golf
2 Eleanor Cauliflower Skiing
3 Georgiana Anything Gymnastics

Using PROC MEANS

In [3]:
* SET LIBNAME (Note this is slightly different in SAS OnDemand, use the link we have seen before);
libname mydata '/folders/myfolders/';
Out[3]:

45   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
46
47 * SET LIBNAME (Note this is slightly different in SAS OnDemand, use the link we have seen before);
48 libname mydata '/folders/myfolders/';
NOTE: Libref MYDATA was successfully assigned as follows:
Engine: V9
Physical Name: /folders/myfolders
49
50 ods html5 close;ods listing;

51
In [4]:
PROC MEANS DATA = mydata.housing;
RUN;
Out[4]:
SAS Output

The SAS System

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum
Living_Sq_Ft
Closing_Price
2000
2000
1840.32
244146.72
853.2350143
319263.47
488.0000000
20000.00
6963.00
7650000.00

PROC MEANS OPTIONS

  • MAXDEC = n (number of decimal places to display)
  • MAX
  • MIN
  • MEAN
  • MEDIAN
  • MODE
  • N (number of non-missing values)
  • NMISS (number of missing values)
  • RANGE
  • STDDEV
  • SUM
In [5]:
PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
RUN;
Out[5]:
SAS Output

The SAS System

The MEANS Procedure

Variable Mean Range Std Dev N Mode
Living_Sq_Ft
Closing_Price
1840
244147
6475
7630000
853
319263
2000
2000
1008
125000

Specifying Variables in Proc Means

  • The default setting is to compute means for all numeric values in the data set.
  • To specify which variable to include use a VAR statement within PROC MEANS.
In [6]:
PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
    VAR Living_Sq_Ft;
RUN;
Out[6]:
SAS Output

The SAS System

The MEANS Procedure

Analysis Variable : Living_Sq_Ft
Mean Range Std Dev N Mode
1840 6475 853 2000 1008

Writing Proc Means Output to New Data Set

In [7]:
PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
    OUTPUT OUT = Summaries;
RUN;
Out[7]:
SAS Output

The SAS System

The MEANS Procedure

Variable Mean Range Std Dev N Mode
Living_Sq_Ft
Closing_Price
1840
244147
6475
7630000
853
319263
2000
2000
1008
125000
In [8]:
PROC PRINT DATA= Summaries;
    TITLE 'DATA FILE FROM PROC MEANS';
RUN;
Out[8]:
SAS Output

DATA FILE FROM PROC MEANS

Obs _TYPE_ _FREQ_ _STAT_ Living_Sq_Ft Closing_Price
1 0 2000 N 2000 2000
2 0 2000 MIN 488 20000
3 0 2000 MAX 6963 7650000
4 0 2000 MEAN 1840.32 244146.7215
5 0 2000 STD 853.23501428 319263.46852

Proc Means Exercise 1

  1. Use the bikes data set (this can be accessed using the course specific libname statement).
  2. Include only days that are specified as holidays.
  3. Summarize the Count variable using Proc Means to compute: the mean, median, mode, and range of the variable Count.
In [9]:
Data Bikes_Holiday;
	set mydata.bikes;
	if Holiday = 1;
RUN;

Proc Means DATA = Bikes_Holiday MEAN MEDIAN MODE RANGE;
	VAR Count;
RUN;
Out[9]:
SAS Output

DATA FILE FROM PROC MEANS

The MEANS Procedure

Analysis Variable : count
Mean Median Mode Range
185.8778135 133.0000000 4.0000000 711.0000000

PROC FREQ

PROC FREQ is a procedure for creating frequency tables. Each row in a frequency table corresponds to a level of a categorical variable.

In [10]:
PROC FREQ DATA = mydata.housing;
    TABLES State;
RUN;
Out[10]:
SAS Output

DATA FILE FROM PROC MEANS

The FREQ Procedure

State Frequency Percent Cumulative
Frequency
Cumulative
Percent
CA 180 9.00 180 9.00
CO 80 4.00 260 13.00
CT 141 7.05 401 20.05
DC 20 1.00 421 21.05
FL 140 7.00 561 28.05
GA 60 3.00 621 31.05
HI 20 1.00 641 32.05
IA 40 2.00 681 34.05
ID 20 1.00 701 35.05
IL 39 1.95 740 37.00
IN 60 3.00 800 40.00
KS 20 1.00 820 41.00
KY 20 1.00 840 42.00
MA 123 6.15 963 48.15
ME 20 1.00 983 49.15
MI 60 3.00 1043 52.15
MN 40 2.00 1083 54.15
MO 20 1.00 1103 55.15
MT 20 1.00 1123 56.15
NC 160 8.00 1283 64.15
NE 40 2.00 1323 66.15
NY 60 3.00 1383 69.15
OH 120 6.00 1503 75.15
OK 20 1.00 1523 76.15
OR 20 1.00 1543 77.15
PA 137 6.85 1680 84.00
RI 20 1.00 1700 85.00
TX 120 6.00 1820 91.00
UT 20 1.00 1840 92.00
VA 80 4.00 1920 96.00
WA 60 3.00 1980 99.00
WY 20 1.00 2000 100.00

More on Data Statements

  • What is the new data set and what is the old data set from this code.
DATA housing;
    SET mydata.housing;
RUN;
  • Which of these data sets are permanent files.

What do these two data steps do?

DATA virginia;
    Set mydata.housing;
    IF state = 'VA';
RUN;

DATA colorado;
    Set mydata.housing;
    IF state = 'CO';
RUN;
In [11]:
DATA virginia;
    Set mydata.housing;
    IF state = 'VA';
RUN;

DATA colorado;
    Set mydata.housing;
    IF state = 'CO';
RUN;
Out[11]:

113  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
114
115 DATA virginia;
116 Set mydata.housing;
117 IF state = 'VA';
118 RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.VIRGINIA has 80 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

119
120 DATA colorado;
121 Set mydata.housing;
122 IF state = 'CO';
123 RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.COLORADO has 80 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

124 ods html5 close;ods listing;

125

Stacking Data Sets

In R to combine (or vertically stack two data frames) we used the rbind() function.

This can also be done in a standard SAS DATA STEP.

In [12]:
DATA VA_CO;
    SET virginia colorado;
RUN;
Out[12]:

127  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
128
129 DATA VA_CO;
130 SET virginia colorado;
131 RUN;
NOTE: There were 80 observations read from the data set WORK.VIRGINIA.
NOTE: There were 80 observations read from the data set WORK.COLORADO.
NOTE: The data set WORK.VA_CO has 160 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

132
133 ods html5 close;ods listing;

134
In [13]:
PROC PRINT DATA=VA_CO;
RUN;
Out[13]:
SAS Output

DATA FILE FROM PROC MEANS

Obs City State Zip_Code Living_Sq_Ft Closing_Price
1 PROVIDENC VA 23140 1621 224900
2 PROVIDENC VA 23140 1972 195196
3 PROVIDENC VA 23140 1796 159000
4 PROVIDENC VA 23140 2718 409950
5 PROVIDENC VA 23140 1482 57900
6 PROVIDENC VA 23140 2759 40000
7 PROVIDENC VA 23140 2522 370000
8 PROVIDENC VA 23140 2072 179900
9 PROVIDENC VA 23140 1212 90000
10 PROVIDENC VA 23140 3604 494369
11 PROVIDENC VA 23140 3180 125000
12 PROVIDENC VA 23140 2635 450000
13 PROVIDENC VA 23140 3087 118900
14 PROVIDENC VA 23140 2632 352282
15 PROVIDENC VA 23140 3072 379990
16 PROVIDENC VA 23140 3882 284700
17 PROVIDENC VA 23140 2352 520000
18 PROVIDENC VA 23140 1365 306553
19 PROVIDENC VA 23140 1425 135000
20 PROVIDENC VA 23140 5986 849000
21 MIDDLETOW VA 22645 2259 480123
22 MIDDLETOW VA 22645 3084 511000
23 MIDDLETOW VA 22645 1200 189900
24 MIDDLETOW VA 22645 1031 145000
25 MIDDLETOW VA 22645 960 42500
26 MIDDLETOW VA 22645 2480 120000
27 MIDDLETOW VA 22645 1217 350000
28 MIDDLETOW VA 22645 2251 582000
29 MIDDLETOW VA 22645 1895 90000
30 MIDDLETOW VA 22645 1932 265500
31 MIDDLETOW VA 22645 2128 483000
32 MIDDLETOW VA 22645 1337 209950
33 MIDDLETOW VA 22645 2618 229500
34 MIDDLETOW VA 22645 1090 155000
35 MIDDLETOW VA 22645 1739 135000
36 MIDDLETOW VA 22645 3588 468553
37 MIDDLETOW VA 22645 3096 484107
38 MIDDLETOW VA 22645 1889 301000
39 MIDDLETOW VA 22645 3826 310000
40 MIDDLETOW VA 22645 2578 424000
41 HENRICO VA 23238 1980 240000
42 HENRICO VA 23238 5976 1075000
43 HENRICO VA 23238 2107 315000
44 HENRICO VA 23238 1562 234950
45 HENRICO VA 23238 1616 254500
46 HENRICO VA 23238 3001 475000
47 HENRICO VA 23238 1760 262500
48 HENRICO VA 23238 1980 329000
49 HENRICO VA 23238 1572 162500
50 HENRICO VA 23238 1370 174950
51 HENRICO VA 23238 2943 280000
52 HENRICO VA 23238 1320 177000
53 HENRICO VA 23238 2207 253000
54 HENRICO VA 23238 1809 329000
55 HENRICO VA 23238 1080 155000
56 HENRICO VA 23238 2176 275000
57 HENRICO VA 23238 1360 153500
58 HENRICO VA 23238 1397 199950
59 RICHMOND VA 23238 4638 815000
60 HENRICO VA 23238 1360 234500
61 RICHMOND VA 23236 3413 365000
62 RICHMOND VA 23236 1396 104100
63 RICHMOND VA 23236 2450 261250
64 RICHMOND VA 23236 1902 43670
65 RICHMOND VA 23236 2230 200000
66 RICHMOND VA 23236 1766 192400
67 RICHMOND VA 23236 3317 410000
68 RICHMOND VA 23236 1304 126500
69 RICHMOND VA 23236 2115 220000
70 RICHMOND VA 23236 1448 77500
71 RICHMOND VA 23236 1254 155000
72 RICHMOND VA 23236 1548 171350
73 RICHMOND VA 23236 4082 489500
74 RICHMOND VA 23236 1764 199950
75 RICHMOND VA 23236 1446 115000
76 RICHMOND VA 23236 1528 185000
77 RICHMOND VA 23236 2170 279000
78 RICHMOND VA 23236 3556 275000
79 RICHMOND VA 23236 2860 328000
80 RICHMOND VA 23236 1340 182000
81 PEYTON CO 80831 2459 52000
82 PEYTON CO 80831 1491 178900
83 PEYTON CO 80831 2878 400000
84 PEYTON CO 80831 2552 209000
85 PEYTON CO 80831 2170 259900
86 PEYTON CO 80831 1680 217895
87 PEYTON CO 80831 2277 283000
88 PEYTON CO 80831 3134 350000
89 PEYTON CO 80831 2476 216000
90 PEYTON CO 80831 3334 245000
91 PEYTON CO 80831 3080 255000
92 PEYTON CO 80831 3382 70938
93 PEYTON CO 80831 3016 244800
94 PEYTON CO 80831 2448 278750
95 PEYTON CO 80831 2523 276000
96 PEYTON CO 80831 2958 258850
97 PEYTON CO 80831 2280 223008
98 PEYTON CO 80831 3055 295000
99 PEYTON CO 80831 2128 140000
100 PEYTON CO 80831 2865 323000
101 TABERNASH CO 80478 1080 214500
102 TABERNASH CO 80478 3296 555000
103 TABERNASH CO 80478 1272 287000
104 TABERNASH CO 80478 1575 349900
105 TABERNASH CO 80478 3290 890000
106 TABERNASH CO 80478 2410 685000
107 TABERNASH CO 80478 1956 224000
108 TABERNASH CO 80478 5126 1335000
109 TABERNASH CO 80478 1054 164000
110 TABERNASH CO 80478 1387 299900
111 TABERNASH CO 80478 3000 355000
112 TABERNASH CO 80478 3469 601000
113 TABERNASH CO 80478 2968 619000
114 TABERNASH CO 80478 2466 463600
115 TABERNASH CO 80478 1600 265000
116 TABERNASH CO 80478 1742 259000
117 TABERNASH CO 80478 3914 132500
118 TABERNASH CO 80478 2768 740000
119 TABERNASH CO 80478 2194 395000
120 TABERNASH CO 80478 2473 120000
121 CENTENNIA CO 80122 4843 601000
122 CENTENNIA CO 80122 1850 200000
123 CENTENNIA CO 80122 2018 198000
124 CENTENNIA CO 80122 1714 248000
125 CENTENNIA CO 80122 2737 208000
126 CENTENNIA CO 80122 2240 252500
127 CENTENNIA CO 80122 2535 289000
128 CENTENNIA CO 80122 1886 275000
129 LITTLETON CO 80122 2621 310000
130 CENTENNIA CO 80122 3160 451000
131 CENTENNIA CO 80122 2939 271000
132 CENTENNIA CO 80122 2236 270000
133 CENTENNIA CO 80122 2068 262500
134 CENTENNIA CO 80122 1968 249900
135 CENTENNIA CO 80122 2596 359925
136 CENTENNIA CO 80122 1680 200000
137 LITTLETON CO 80122 1465 172000
138 CENTENNIA CO 80122 1936 257450
139 CENTENNIA CO 80122 3385 682000
140 CENTENNIA CO 80122 4776 486250
141 SNOWMASS CO 81615 2342 2500000
142 SNOWMASS CO 81615 3423 613299
143 SNOWMASS CO 81615 2476 955000
144 SNOWMASS CO 81615 5489 6550000
145 SNOWMASS CO 81615 5598 7650000
146 SNOWMASS CO 81615 1938 3341250
147 SNOWMASS CO 81615 1328 336262
148 SNOWMASS CO 81615 1217 1085100
149 SNOWMASS CO 81615 1520 1700000
150 SNOWMASS CO 81615 894 450000
151 SNOWMASS CO 81615 3923 2100000
152 SNOWMASS CO 81615 1206 620000
153 SNOWMASS CO 81615 4811 2575000
154 SNOWMASS CO 81615 1563 1883300
155 SNOWMASS CO 81615 1278 765000
156 SNOWMASS CO 81615 2316 1760000
157 SNOWMASS CO 81615 1935 300000
158 SNOWMASS CO 81615 1973 495000
159 SNOWMASS CO 81615 1218 769200
160 SNOWMASS CO 81615 676 908400

Merging Data Sets

The DATA STEP can also be used to merge data sets using the following format:

DATA newdata;
    MERGE data1 data2;
    BY ID_VAR;
RUN;
In [14]:
DATA middlenames;
    INPUT Names :$15. MiddleName :$15. ;
        DATALINES;
    Eleanor Larson
    Georgiana Otelia
        ;
RUN;
Out[14]:

143  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
144
145 DATA middlenames;
146 INPUT Names :$15. MiddleName :$15. ;
147 DATALINES;
NOTE: The data set WORK.MIDDLENAMES has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

150 ;
151 RUN;
152 ods html5 close;ods listing;

153
In [15]:
DATA COMBINED;
    MERGE FAVORITES MIDDLENAMES;
    BY NAMES;
RUN;
Out[15]:

155  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
156
157 DATA COMBINED;
158 MERGE FAVORITES MIDDLENAMES;
159 BY NAMES;
160 RUN;
NOTE: There were 3 observations read from the data set WORK.FAVORITES.
NOTE: There were 2 observations read from the data set WORK.MIDDLENAMES.
NOTE: The data set WORK.COMBINED has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

161 ods html5 close;ods listing;

162
QUESTION:

What does the following code return?

PROC PRINT DATA = COMBINED;
    TITLE 'Hoegh Family';
RUN;
In [16]:
PROC PRINT DATA = COMBINED;
    TITLE 'Hoegh Family';
RUN;
Out[16]:
SAS Output

Hoegh Family

Obs Names Food Activity MiddleName
1 Andy Milkshakes Golf  
2 Eleanor Cauliflower Skiing Larson
3 Georgiana Anything Gymnastics Otelia

Exercise 2: Merging and Splitting Data

Using the shark attacks data set (attacks):

  1. create two new data sets one for attacks in the USA (Country = USA) and another for attacks in Australia.
  2. Create a new data set that combines these two data sets by stacking them.
  3. Use Proc Freq to get a frequency table for activity type for attacks in USA and Australia.

Writing to Multiple Data Sets Using Output

A single DATA step can be used to create multiple data sets.

In [17]:
DATA virginia colorado others;
    SET mydata.housing;
    IF state = 'CO' THEN OUTPUT colorado;
        ELSE IF state = 'VA' THEN OUTPUT virginia;
        ELSE OUTPUT others;
RUN;
Out[17]:

172  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
173
174 DATA virginia colorado others;
175 SET mydata.housing;
176 IF state = 'CO' THEN OUTPUT colorado;
177 ELSE IF state = 'VA' THEN OUTPUT virginia;
178 ELSE OUTPUT others;
179 RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.VIRGINIA has 80 observations and 5 variables.
NOTE: The data set WORK.COLORADO has 80 observations and 5 variables.
NOTE: The data set WORK.OTHERS has 1840 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

180 ods html5 close;ods listing;

181

Using DO statements for creating DATA sets

Often in R we used the function seq() particularly when thinking about creating graphics. A similar approach in R takes advantage of the DO command.

In [18]:
DATA quadratic;
    DO x=1 TO 10;
        y= x ** 2;
        OUTPUT;
    END;
RUN;
Out[18]:

183  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
184
185 DATA quadratic;
186 DO x=1 TO 10;
187 y= x ** 2;
188 OUTPUT;
189 END;
190 RUN;
NOTE: The data set WORK.QUADRATIC has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

191 ods html5 close;ods listing;

192
In [19]:
PROC PRINT DATA=quadratic;
TITLE;
RUN;
Out[19]:
SAS Output
Obs x y
1 1 1
2 2 4
3 3 9
4 4 16
5 5 25
6 6 36
7 7 49
8 8 64
9 9 81
10 10 100

SAS Data Set Options

Data set options can be used in both DATA and PROC statements. To use a data set option, put it between parentheses directly following the data set. The syntax follows as:

DATA STEP
DATA newdata;
    SET olddata (options here);
RUN;
PROC STEP
PROC PRINT DATA = dataset (options here);
RUN;

Data Set Options

  • KEEP = variable-list (specifies which variables to keep)
  • DROP = variable-list (specifies which variables to drop)
  • RENAME = (oldvar = newvar) (specifies which variables to rename)
  • FIRSTOBS = n (tells SAS to start reading at observation n)
  • OBS = n (tells SAS to stop reading at observation n)
  • WHERE = condition (selects observations that meet a specified criteria)
In [20]:
DATA HoeghFamily;
    SET COMBINED (KEEP = Food Activity MiddleName);
RUN;
Out[20]:

202  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
203
204 DATA HoeghFamily;
205 SET COMBINED (KEEP = Food Activity MiddleName);
206 RUN;
NOTE: There were 3 observations read from the data set WORK.COMBINED.
NOTE: The data set WORK.HOEGHFAMILY has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

207 ods html5 close;ods listing;

208

Consider the following code, what will this print?

PROC PRINT DATA= HoeghFamily (Firstobs = 2);
RUN;
In [21]:
PROC PRINT DATA= HoeghFamily (Firstobs = 2);
RUN;
Out[21]:
SAS Output
Obs Food Activity MiddleName
2 Cauliflower Skiing Larson
3 Anything Gymnastics Otelia

SAS Automatic Variables

SAS has built in variables that can be quite useful.

  • N (Indicates the number of times SAS has looped through the data set. This is generally the observation number, but can be different when using conditional statements.)
  • FIRST.variable (where variable is the name of the variable you are referencing. This requires using a by statement and will produce a 1 for the first variable on 0 otherwise.
  • LAST.variable (same idea as first.variable, but applies to last instance of a variable)
In [22]:
PROC SORT DATA=mydata.housing OUT=housing;
    BY State;
RUN;
Out[22]:

217  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
218
219 PROC SORT DATA=mydata.housing OUT=housing;
220 BY State;
221 RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.HOUSING has 2000 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

222 ods html5 close;ods listing;

223
In [23]:
DATA HousingSample;
    SET housing;
    BY STATE;
    IF FIRST.STATE =1;
RUN;
Out[23]:

225  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
226
227 DATA HousingSample;
228 SET housing;
229 BY STATE;
230 IF FIRST.STATE =1;
231 RUN;
NOTE: There were 2000 observations read from the data set WORK.HOUSING.
NOTE: The data set WORK.HOUSINGSAMPLE has 32 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

232 ods html5 close;ods listing;

233
In [24]:
PROC PRINT DATA=HousingSample;
RUN;
Out[24]:
SAS Output
Obs City State Zip_Code Living_Sq_Ft Closing_Price
1 BODEGA BA CA 94923 1237 680000
2 PEYTON CO 80831 2459 52000
3 TOLLAND CT 60842 2083 298900
4 WASHINGTO DC 20012 2345 626000
5 LAKELAND FL 33812 1713 105000
6 MCDONOUGH GA 30252 3066 369400
7 HONOLULU HI 96826 1200 83900
8 FAIRFIELD IA 52556 1740 111800
9 CALDWELL ID 83607 1666 172225
10 JOLIET IL 60433 1400 91000
11 FORT WAYN IN 46805 1289 86773
12 TOPEKA KS 66608 1431 104500
13 BARDSTOWN KY 40004 2494 23700
14 LOWELL MA 18501 3229 400000
15 YORK ME 39095 2304 310230
16 FLINT MI 48504 1358 50600
17 LAKE CITY MN 55041 1463 299900
18 SAINT LOU MO 63125 1048 165000
19 LEWISTOWN MT 59457 2464 150920
20 CHERRYVIL NC 28021 3362 47000
21 LINCOLN NE 68524 1458 139000
22 DRYDEN NY 13053 1104 114000
23 MAUMEE OH 43537 3301 314500
24 CLAREMORE OK 74019 1400 106500
25 CAVE JUNC OR 97523 1440 51000
26 PITTSBURG PA 15234 1450 119000
27 CRANSTON RI 29051 1259 259000
28 ALLEN TX 75002 1780 148500
29 HUNTSVILL UT 84317 2910 520695
30 PROVIDENC VA 23140 1621 224900
31 VANCOUVER WA 98682 1508 175000
32 ROCK SPRI WY 82901 1820 191893

SAS Data Exercise #2

In [25]:
Data USA;
	set mydata.attacks;
	if Country = 'USA';
RUN;

Data Australia;
	set mydata.attacks;
	if Country = 'AUSTRALIA';
RUN;

DATA Combined;
	set USA Australia;
RUN;

Proc Freq Data = Combined;
	TABLES Activity;
RUN;
Out[25]:
SAS Output

The FREQ Procedure

Activity Frequency Percent Cumulative
Frequency
Cumulative
Percent
Bathing 82 4.46 82 4.46
Diving 56 3.04 138 7.50
Fishing 220 11.96 358 19.46
Snorkeling 44 2.39 402 21.85
Spearfishing 118 6.41 520 28.26
Standing 66 3.59 586 31.85
Surfing 704 38.26 1290 70.11
Swimming 438 23.80 1728 93.91
Wading 112 6.09 1840 100.00

STATISTICAL PROCEDURES

Next we will focus on statistical procedure available in SAS.

  • PROC UNIVARIATE
  • PROC MEANS (will not cover in detail again)
  • PROC TTEST
  • PROC REG
  • PROC ANOVA

PROC UNIVARIATE

PROC UNIVARIATE produces statistics and graphs describing the distribution of a single variable.

The syntax follows as:

PROC UNIVARIATE DATA = yourdata;
    VAR variblename;
RUN;
In [26]:
PROC UNIVARIATE DATA=housing;
    VAR LIVING_SQ_FT;
RUN;
Out[26]:
SAS Output

The UNIVARIATE Procedure

Variable: Living_Sq_Ft

Moments
N 2000 Sum Weights 2000
Mean 1840.32 Sum Observations 3680640
Std Deviation 853.235014 Variance 728009.99
Skewness 1.77833485 Kurtosis 4.87450208
Uncorrected SS 8228847374 Corrected SS 1455291969
Coeff Variation 46.363405 Std Error Mean 19.0789149
Basic Statistical Measures
Location Variability
Mean 1840.320 Std Deviation 853.23501
Median 1625.000 Variance 728010
Mode 1008.000 Range 6475
    Interquartile Range 944.50000
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 96.45832 Pr > |t| <.0001
Sign M 1000 Pr >= |M| <.0001
Signed Rank S 1000500 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 6963.0
99% 4838.5
95% 3423.5
90% 2938.0
75% Q3 2194.5
50% Median 1625.0
25% Q1 1250.0
10% 1019.0
5% 904.5
1% 704.0
0% Min 488.0
Extreme Observations
Lowest Highest
Value Obs Value Obs
488 631 6046 977
540 639 6506 1766
576 122 6640 1647
594 158 6695 1812
596 637 6963 610

Graphics from PROC UNIVARIATE

We will go in depth on SAS graphics next week, but many of the statistical procedure have built in options for creating graphics.

Syntax follows as:

PROC UNIVARIATE DATA=yourdata;
    VAR yourvariable;
    plot-request yourvariable/ options;
RUN;
Plot Requests
  • CDFPLOT (cumulative distribution function plot)
  • HISTOGRAM
  • QQPLOT (quantile-quantile plot)
In [27]:
PROC UNIVARIATE DATA=housing noprint;
    VAR LIVING_SQ_FT;
    HISTOGRAM LIVING_SQ_FT / Normal; * Normal creates a normal curve;
RUN;
Out[27]:
SAS Output

The UNIVARIATE Procedure

Histogram for Living_Sq_Ft

The UNIVARIATE Procedure

Fitted Normal Distribution for Living_Sq_Ft

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 1840.32
Std Dev Sigma 853.235
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.1094403 Pr > D <0.010
Cramer-von Mises W-Sq 10.0771813 Pr > W-Sq <0.005
Anderson-Darling A-Sq 59.7937631 Pr > A-Sq <0.005
Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 704.000 -144.601
5.0 904.500 436.873
10.0 1019.000 746.855
25.0 1250.000 1264.822
50.0 1625.000 1840.320
75.0 2194.500 2415.818
90.0 2938.000 2933.785
95.0 3423.500 3243.767
99.0 4838.500 3825.241
In [28]:
PROC UNIVARIATE DATA=housing noprint;
    VAR LIVING_SQ_FT;
    HISTOGRAM LIVING_SQ_FT / Lognormal; * Normal creates a normal curve;
RUN;
    
Out[28]:
SAS Output

The UNIVARIATE Procedure

Histogram for Living_Sq_Ft

The UNIVARIATE Procedure

Fitted Lognormal Distribution for Living_Sq_Ft

Parameters for Lognormal Distribution
Parameter Symbol Estimate
Threshold Theta 0
Scale Zeta 7.428118
Shape Sigma 0.413559
Mean   1832.862
Std Dev   791.5927
Goodness-of-Fit Tests for Lognormal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.03801564 Pr > D <0.010
Cramer-von Mises W-Sq 0.71048795 Pr > W-Sq <0.005
Anderson-Darling A-Sq 4.20760256 Pr > A-Sq <0.005
Quantiles for Lognormal Distribution
Percent Quantile
Observed Estimated
1.0 704.000 642.930
5.0 904.500 852.245
10.0 1019.000 990.412
25.0 1250.000 1273.057
50.0 1625.000 1682.638
75.0 2194.500 2223.994
90.0 2938.000 2858.682
95.0 3423.500 3322.135
99.0 4838.500 4403.704

Additional Options for PROC MEANS

In addition the options outlined earlier, here are a few more options built into PROC MEANS.

  • CLM (two-sided confidence limits)
  • P1 (1 percent quantile)
  • p10 (10 percent quantile)
  • CV (coefficient of variation)
  • LCLM (lower confidence limit)
  • UCLM (upper confidence limit)

PROC TTEST

Q: Describe a t-test and give an example of when it might be used (ideally in the context of one of the data sets we have seen in this course).

One Sample Comparison

This is useful to test whether the mean is significantly different from a hypothesis $H_0$ value of n.

PROC TTEST DATA=yourdata H0 = n options;
    VAR variable;
RUN;

Options include:

  • ALPHA = n (alpha level, default is 0.05)
  • H0 = n (requests a test of $H_0=n$, default value is 0)
  • SIDES = type (options are 2, the default, and L for lower one-sided, and U for upper one-sided)
In [29]:
PROC TTEST DATA=HOUSING H0 = 2500;
    VAR LIVING_SQ_FT;
RUN;
Out[29]:
SAS Output

The TTEST Procedure

Variable: Living_Sq_Ft

N Mean Std Dev Std Err Minimum Maximum
2000 1840.3 853.2 19.0789 488.0 6963.0
Mean 95% CL Mean Std Dev 95% CL Std Dev
1840.3 1802.9 1877.7 853.2 827.6 880.5
DF t Value Pr > |t|
1999 -34.58 <.0001
Summary Panel for Living_Sq_Ft
Q-Q Plot for Living_Sq_Ft

Two independent sample comparisons

To compare differences between two groups use this procedure:

PROC TTEST DATA=yourdata options;
    CLASS variable;
    VAR variable;
RUN;

Class is the categorical variable that you like to test differences.

In [30]:
PROC TTEST DATA=VA_CO;
    CLASS State;
    VAR Closing_Price;
RUN;
Out[30]:
SAS Output

The TTEST Procedure

Variable: Closing_Price

State N Mean Std Dev Std Err Minimum Maximum
CO 80 716647 1196747 133800 52000.0 7650000
VA 80 281179 182367 20389.3 40000.0 1075000
Diff (1-2)   435468 855997 135345    
State Method Mean 95% CL Mean Std Dev 95% CL Std Dev
CO   716647 450324 982970 1196747 1035728 1417514
VA   281179 240595 321763 182367 157830 216009
Diff (1-2) Pooled 435468 168149 702787 855997 771122 962032
Diff (1-2) Satterthwaite 435468 166256 704680      
Method Variances DF t Value Pr > |t|
Pooled Equal 158 3.22 0.0016
Satterthwaite Unequal 82.667 3.22 0.0018
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 79 79 43.06 <.0001
Summary Panel for Closing_Price
Q-Q Plots for Closing_Price

Paired t-test

Another type of t-test involves paired data.

PROC TTEST DATA=yourdata options;
    PAIRED variable1 * variable2;
RUN;

PROC TTEST GRAPHICS

The ttest procedure automatically creates a few graphics, but that can be controlled using the following syntax.

PROC TTEST DATA=yourdata PLOTS = (plot-request)

The plot requests are:

  • ALL (produce all plots)
  • BOXPLOT (creates box plots)
  • HISTOGRAM
  • INTERVALPLOT (creates plots of confidence interval of means)
  • NONE (suppresses all plots)
  • QQPLOT
  • SUMMARYPLOT (creates one plot that contains histrograms and box plots)

Exercise 3: Proc Univariate

  1. Use Proc Univariate to summarize the registered and casual variables in the bikes dataset.
  2. Include a histogram with a normal density superimposed.
  3. Does the normal density match the data? Why or why not?
In [31]:
Proc Univariate Data = mydata.bikes;
	Var Registered Casual;
	HISTOGRAM Registered Casual / Normal; * Normal creates a normal curve;
run;
Out[31]:
SAS Output

The UNIVARIATE Procedure

Variable: registered

Moments
N 10886 Sum Weights 10886
Mean 155.552177 Sum Observations 1693341
Std Deviation 151.039033 Variance 22812.7895
Skewness 1.52480459 Kurtosis 2.626081
Uncorrected SS 511720093 Corrected SS 248317214
Coeff Variation 97.0986301 Std Error Mean 1.44762152
Basic Statistical Measures
Location Variability
Mean 155.5522 Std Deviation 151.03903
Median 118.0000 Variance 22813
Mode 3.0000 Range 886.00000
    Interquartile Range 186.00000
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 107.4536 Pr > |t| <.0001
Sign M 5435.5 Pr >= |M| <.0001
Signed Rank S 29547378 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 886
99% 697
95% 464
90% 354
75% Q3 222
50% Median 118
25% Q1 36
10% 7
5% 4
1% 1
0% Min 0
Extreme Observations
Lowest Highest
Value Obs Value Obs
0 4012 833 9585
0 3890 839 9897
0 3867 857 9298
0 3725 857 9753
0 3318 886 9346

The UNIVARIATE Procedure

Histogram for registered

The UNIVARIATE Procedure

Fitted Normal Distribution for registered

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 155.5522
Std Dev Sigma 151.039
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.151715 Pr > D <0.010
Cramer-von Mises W-Sq 59.670109 Pr > W-Sq <0.005
Anderson-Darling A-Sq 383.449398 Pr > A-Sq <0.005
Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 1.00000 -195.8172
5.0 4.00000 -92.8849
10.0 7.00000 -38.0121
25.0 36.00000 53.6779
50.0 118.00000 155.5522
75.0 222.00000 257.4265
90.0 354.00000 349.1165
95.0 464.00000 403.9893
99.0 697.00000 506.9215

The UNIVARIATE Procedure

Variable: casual

Moments
N 10886 Sum Weights 10886
Mean 36.0219548 Sum Observations 392135
Std Deviation 49.9604766 Variance 2496.04922
Skewness 2.4957484 Kurtosis 7.55162931
Uncorrected SS 41294965 Corrected SS 27169495.8
Coeff Variation 138.694518 Std Error Mean 0.47884219
Basic Statistical Measures
Location Variability
Mean 36.02195 Std Deviation 49.96048
Median 17.00000 Variance 2496
Mode 0.00000 Range 367.00000
    Interquartile Range 45.00000
Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 75.2272 Pr > |t| <.0001
Sign M 4950 Pr >= |M| <.0001
Signed Rank S 24504975 Pr >= |S| <.0001
Quantiles (Definition 5)
Level Quantile
100% Max 367
99% 241
95% 141
90% 94
75% Q3 49
50% Median 17
25% Q1 4
10% 1
5% 0
1% 0
0% Min 0
Extreme Observations
Lowest Highest
Value Obs Value Obs
0 10866 356 7687
0 10844 357 6729
0 10842 361 7686
0 10840 362 9652
0 10839 367 6730

The UNIVARIATE Procedure

Histogram for casual

The UNIVARIATE Procedure

Fitted Normal Distribution for casual

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 36.02195
Std Dev Sigma 49.96048
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.235452 Pr > D <0.010
Cramer-von Mises W-Sq 171.574837 Pr > W-Sq <0.005
Anderson-Darling A-Sq 948.029130 Pr > A-Sq <0.005
Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 0.000 -80.20349
5.0 0.000 -46.15572
10.0 1.000 -28.00497
25.0 4.000 2.32413
50.0 17.000 36.02195
75.0 49.000 69.71978
90.0 94.000 100.04888
95.0 141.000 118.19963
99.0 241.000 152.24740

PROC REG

Similar to lm() in R, we can also easily run regression in SAS.

PROC REG DATA=yourdata;
    MODEL dependent = independent;
RUN;

Note including categorical variables can take a bit of extra work to create dummy variables (or use PROC GLM).

In [32]:
PROC REG DATA=housing;
    MODEL CLOSING_PRICE = LIVING_SQ_FT;
RUN;
Out[32]:
SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

Number of Observations Read 2000
Number of Observations Used 2000
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3.287656E13 3.287656E13 384.41 <.0001
Error 1998 1.708798E14 85525443800    
Corrected Total 1999 2.037564E14      
Root MSE 292447 R-Square 0.1614
Dependent Mean 244147 Adj R-Sq 0.1609
Coeff Var 119.78344    
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -32459 15550 -2.09 0.0370
Living_Sq_Ft 1 150.30316 7.66607 19.61 <.0001

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

Panel of fit diagnostics for Closing_Price.
Scatter plot of residuals by Living_Sq_Ft for Closing_Price.
Scatterplot of Closing_Price by Living_Sq_Ft overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

PROC REG GRAPHICS

Graphics can easily be created from PROC REG using the following syntax:

PROC REG DATA=yourdata PLOTS(options) = (plot request list);
    MODEL dependent = independent;
RUN;

For the options trailing PLOTS use (ONLY) if you wish to omit the default plots.

Plot requests

  • FITPLOT (scatter plot with regression line and confidence bands)
  • RESIDUALS ( residuals plotted against independent variables)
  • DIAGNOSTICS (diagnostics panel)
  • COOKSD (Cook's D statistic)
  • OBSERVEDBYPREDICTED (dependent variable by predicted)
  • QQPLOT
  • RESIDUALBYPREDICTED
  • RESIDUALHISTOGRAM
  • RFPLOT (residual fit plot)
In [33]:
PROC REG DATA=housing PLOTS(ONLY) = DIAGNOSTICS;
    MODEL CLOSING_PRICE = LIVING_SQ_FT;
RUN;
Out[33]:
SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

Number of Observations Read 2000
Number of Observations Used 2000
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 3.287656E13 3.287656E13 384.41 <.0001
Error 1998 1.708798E14 85525443800    
Corrected Total 1999 2.037564E14      
Root MSE 292447 R-Square 0.1614
Dependent Mean 244147 Adj R-Sq 0.1609
Coeff Var 119.78344    
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -32459 15550 -2.09 0.0370
Living_Sq_Ft 1 150.30316 7.66607 19.61 <.0001

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

Panel of fit diagnostics for Closing_Price.

PROC ANOVA

This procedure allows you to run an analysis of variance procedure with the following syntax:

PROC ANOVA DATA =yourdata;
    CLASS classvariable;
    MODEL dependent = effects;
RUN;
In [34]:
DATA ANDYLIVED;
    SET HOUSING;
    IF STATE in ('IA', 'MT','CO','MN','CA','VA');
RUN;
Out[34]:
SAS Output
In [35]:
PROC ANOVA DATA=ANDYLIVED;
    CLASS STATE;
    MODEL CLOSING_PRICE = STATE;
RUN;
Out[35]:
SAS Output

The ANOVA Procedure

Class Level Information
Class Levels Values
State 6 CA CO IA MN MT VA
Number of Observations Read 440
Number of Observations Used 440

The ANOVA Procedure

Dependent Variable: Closing_Price

Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 1.43091E13 2.8618199E12 9.81 <.0001
Error 434 1.2654988E14 291589585117    
Corrected Total 439 1.4085898E14      
R-Square Coeff Var Root MSE Closing_Price Mean
0.101585 143.7899 539990.4 375541.3
Source DF Anova SS Mean Square F Value Pr > F
State 5 1.43091E13 2.8618199E12 9.81 <.0001
Distribution of Closing_Price by State

Exercise 4: Proc Reg / Anova

  1. Using the bikes dataset fit a regression model to predict the count of bikes.
  2. What is the best adjusted - R-squared value you can acheive? (without using casual and registered).
  3. Does the model treat season correctly?
In [36]:
Proc REG Data = mydata.bikes;
	MODEL Count = season holiday temp workingday weather;
RUN;
Out[36]:
SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: count

Number of Observations Read 10886
Number of Observations Used 10886
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 5 61261467 12252293 450.49 <.0001
Error 10880 295911447 27198    
Corrected Total 10885 357172914      
Root MSE 164.91738 R-Square 0.1715
Dependent Mean 191.57413 Adj R-Sq 0.1711
Coeff Var 86.08541    
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 32.97402 6.68752 4.93 <.0001
season 1 11.16378 1.46727 7.61 <.0001
holiday 1 -8.24676 9.80449 -0.84 0.4003
temp 1 8.61542 0.21053 40.92 <.0001
workingday 1 1.09982 3.50658 0.31 0.7538
weather 1 -31.15689 2.49998 -12.46 <.0001

The REG Procedure

Model: MODEL1

Dependent Variable: count

Panel of heat maps of residuals by regressors for count.