Lecture: More on SAS Procedures¶

General Comments¶

There is an application called Jupyter notebooks (formerly iPython), that has some functionality similar to R Markdown and can execute SAS code. However, it is still very glitchy for SAS and not worth the hassle in this class.
Manually Entering Data can also be challenging in SAS, next is a similar example to what you worked on in lab.

In [1]:

* Manual Input of Data;
DATA Favorites;
    INPUT Names $ Food $ Activity $;
        DATALINES;
    Andy Milkshakes Golf
    Eleanor Cauliflower Skiing
    Georgiana Anything Gymnastics
        ;
RUN;

PROC PRINT DATA=Favorites;
RUN;

Out[1]:

SAS Output

The SAS System


Obs	Names	Food	Activity
1	Andy	Milkshak	Golf
2	Eleanor	Cauliflo	Skiing
3	Georgian	Anything	Gymnasti

In [2]:

* Manual Input of Data;
DATA Favorites;
    INPUT Names :$15. Food :$15. Activity :$15.;
        DATALINES;
    Andy Milkshakes Golf
    Eleanor Cauliflower Skiing
    Georgiana Anything Gymnastics
        ;
RUN;

PROC PRINT DATA=Favorites;
RUN;

Out[2]:

SAS Output

The SAS System


Obs	Names	Food	Activity
1	Andy	Milkshakes	Golf
2	Eleanor	Cauliflower	Skiing
3	Georgiana	Anything	Gymnastics

Using PROC MEANS¶

In [3]:

* SET LIBNAME (Note this is slightly different in SAS OnDemand, use the link we have seen before);
libname mydata '/folders/myfolders/';

Out[3]:

45   ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
46   
47   * SET LIBNAME (Note this is slightly different in SAS OnDemand, use the link we have seen before);
48   libname mydata '/folders/myfolders/';
NOTE: Libref MYDATA was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: /folders/myfolders
49   
50   ods html5 close;ods listing;

51

In [4]:

PROC MEANS DATA = mydata.housing;
RUN;

Out[4]:

SAS Output

The SAS System

The MEANS Procedure


Variable	N	Mean	Std Dev	Minimum	Maximum
Living_Sq_Ft Closing_Price	2000 2000	1840.32 244146.72	853.2350143 319263.47	488.0000000 20000.00	6963.00 7650000.00

PROC MEANS OPTIONS¶

MAXDEC = n (number of decimal places to display)
MAX
MIN
MEAN
MEDIAN
MODE
N (number of non-missing values)
NMISS (number of missing values)
RANGE
STDDEV
SUM

In [5]:

PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
RUN;

Out[5]:

SAS Output

The SAS System

The MEANS Procedure


Variable	Mean	Range	Std Dev	N	Mode
Living_Sq_Ft Closing_Price	1840 244147	6475 7630000	853 319263	2000 2000	1008 125000

Specifying Variables in Proc Means¶

The default setting is to compute means for all numeric values in the data set.
To specify which variable to include use a VAR statement within PROC MEANS.

In [6]:

PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
    VAR Living_Sq_Ft;
RUN;

Out[6]:

SAS Output

The SAS System

The MEANS Procedure


Analysis Variable : Living_Sq_Ft
Mean	Range	Std Dev	N	Mode
1840	6475	853	2000	1008

Writing Proc Means Output to New Data Set¶

In [7]:

PROC MEANS DATA = mydata.housing MAXDEC=0 MEAN RANGE STDDEV N MODE;
    OUTPUT OUT = Summaries;
RUN;

Out[7]:

SAS Output

The SAS System

The MEANS Procedure


Variable	Mean	Range	Std Dev	N	Mode
Living_Sq_Ft Closing_Price	1840 244147	6475 7630000	853 319263	2000 2000	1008 125000

In [8]:

PROC PRINT DATA= Summaries;
    TITLE 'DATA FILE FROM PROC MEANS';
RUN;

Out[8]:

SAS Output

DATA FILE FROM PROC MEANS


Obs	_FREQ_	_STAT_	Living_Sq_Ft	Closing_Price
1	2000	N	2000	2000
2	2000	MIN	488	20000
3	2000	MAX	6963	7650000
4	2000	MEAN	1840.32	244146.7215
5	2000	STD	853.23501428	319263.46852

Proc Means Exercise 1¶

Use the bikes data set (this can be accessed using the course specific libname statement).
Include only days that are specified as holidays.
Summarize the Count variable using Proc Means to compute: the mean, median, mode, and range of the variable Count.

In [9]:

Data Bikes_Holiday;
	set mydata.bikes;
	if Holiday = 1;
RUN;

Proc Means DATA = Bikes_Holiday MEAN MEDIAN MODE RANGE;
	VAR Count;
RUN;

Out[9]:

SAS Output

DATA FILE FROM PROC MEANS

The MEANS Procedure


Analysis Variable : count
Mean	Median	Mode	Range
185.8778135	133.0000000	4.0000000	711.0000000

PROC FREQ¶

PROC FREQ is a procedure for creating frequency tables. Each row in a frequency table corresponds to a level of a categorical variable.

In [10]:

PROC FREQ DATA = mydata.housing;
    TABLES State;
RUN;

Out[10]:

SAS Output

DATA FILE FROM PROC MEANS

The FREQ Procedure


State	Frequency	Percent	Cumulative Frequency	Cumulative Percent
CA	180	9.00	180	9.00
CO	80	4.00	260	13.00
CT	141	7.05	401	20.05
DC	20	1.00	421	21.05
FL	140	7.00	561	28.05
GA	60	3.00	621	31.05
HI	20	1.00	641	32.05
IA	40	2.00	681	34.05
ID	20	1.00	701	35.05
IL	39	1.95	740	37.00
IN	60	3.00	800	40.00
KS	20	1.00	820	41.00
KY	20	1.00	840	42.00
MA	123	6.15	963	48.15
ME	20	1.00	983	49.15
MI	60	3.00	1043	52.15
MN	40	2.00	1083	54.15
MO	20	1.00	1103	55.15
MT	20	1.00	1123	56.15
NC	160	8.00	1283	64.15
NE	40	2.00	1323	66.15
NY	60	3.00	1383	69.15
OH	120	6.00	1503	75.15
OK	20	1.00	1523	76.15
OR	20	1.00	1543	77.15
PA	137	6.85	1680	84.00
RI	20	1.00	1700	85.00
TX	120	6.00	1820	91.00
UT	20	1.00	1840	92.00
VA	80	4.00	1920	96.00
WA	60	3.00	1980	99.00
WY	20	1.00	2000	100.00

More on Data Statements¶

What is the new data set and what is the old data set from this code.

DATA housing;
    SET mydata.housing;
RUN;

Which of these data sets are permanent files.

What do these two data steps do?

DATA virginia;
    Set mydata.housing;
    IF state = 'VA';
RUN;

DATA colorado;
    Set mydata.housing;
    IF state = 'CO';
RUN;

In [11]:

DATA virginia;
    Set mydata.housing;
    IF state = 'VA';
RUN;

DATA colorado;
    Set mydata.housing;
    IF state = 'CO';
RUN;

Out[11]:

113  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
114  
115  DATA virginia;
116      Set mydata.housing;
117      IF state = 'VA';
118  RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.VIRGINIA has 80 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
      
119  
120  DATA colorado;
121      Set mydata.housing;
122      IF state = 'CO';
123  RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.COLORADO has 80 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
124  ods html5 close;ods listing;

125

Stacking Data Sets¶

In R to combine (or vertically stack two data frames) we used the rbind() function.

This can also be done in a standard SAS DATA STEP.

In [12]:

DATA VA_CO;
    SET virginia colorado;
RUN;

Out[12]:

127  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
128  
129  DATA VA_CO;
130      SET virginia colorado;
131  RUN;
NOTE: There were 80 observations read from the data set WORK.VIRGINIA.
NOTE: There were 80 observations read from the data set WORK.COLORADO.
NOTE: The data set WORK.VA_CO has 160 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
132  
133  ods html5 close;ods listing;

134

In [13]:

PROC PRINT DATA=VA_CO;
RUN;

Out[13]:

SAS Output

DATA FILE FROM PROC MEANS


Obs	City	State	Zip_Code	Living_Sq_Ft	Closing_Price
1	PROVIDENC	VA	23140	1621	224900
2	PROVIDENC	VA	23140	1972	195196
3	PROVIDENC	VA	23140	1796	159000
4	PROVIDENC	VA	23140	2718	409950
5	PROVIDENC	VA	23140	1482	57900
6	PROVIDENC	VA	23140	2759	40000
7	PROVIDENC	VA	23140	2522	370000
8	PROVIDENC	VA	23140	2072	179900
9	PROVIDENC	VA	23140	1212	90000
10	PROVIDENC	VA	23140	3604	494369
11	PROVIDENC	VA	23140	3180	125000
12	PROVIDENC	VA	23140	2635	450000
13	PROVIDENC	VA	23140	3087	118900
14	PROVIDENC	VA	23140	2632	352282
15	PROVIDENC	VA	23140	3072	379990
16	PROVIDENC	VA	23140	3882	284700
17	PROVIDENC	VA	23140	2352	520000
18	PROVIDENC	VA	23140	1365	306553
19	PROVIDENC	VA	23140	1425	135000
20	PROVIDENC	VA	23140	5986	849000
21	MIDDLETOW	VA	22645	2259	480123
22	MIDDLETOW	VA	22645	3084	511000
23	MIDDLETOW	VA	22645	1200	189900
24	MIDDLETOW	VA	22645	1031	145000
25	MIDDLETOW	VA	22645	960	42500
26	MIDDLETOW	VA	22645	2480	120000
27	MIDDLETOW	VA	22645	1217	350000
28	MIDDLETOW	VA	22645	2251	582000
29	MIDDLETOW	VA	22645	1895	90000
30	MIDDLETOW	VA	22645	1932	265500
31	MIDDLETOW	VA	22645	2128	483000
32	MIDDLETOW	VA	22645	1337	209950
33	MIDDLETOW	VA	22645	2618	229500
34	MIDDLETOW	VA	22645	1090	155000
35	MIDDLETOW	VA	22645	1739	135000
36	MIDDLETOW	VA	22645	3588	468553
37	MIDDLETOW	VA	22645	3096	484107
38	MIDDLETOW	VA	22645	1889	301000
39	MIDDLETOW	VA	22645	3826	310000
40	MIDDLETOW	VA	22645	2578	424000
41	HENRICO	VA	23238	1980	240000
42	HENRICO	VA	23238	5976	1075000
43	HENRICO	VA	23238	2107	315000
44	HENRICO	VA	23238	1562	234950
45	HENRICO	VA	23238	1616	254500
46	HENRICO	VA	23238	3001	475000
47	HENRICO	VA	23238	1760	262500
48	HENRICO	VA	23238	1980	329000
49	HENRICO	VA	23238	1572	162500
50	HENRICO	VA	23238	1370	174950
51	HENRICO	VA	23238	2943	280000
52	HENRICO	VA	23238	1320	177000
53	HENRICO	VA	23238	2207	253000
54	HENRICO	VA	23238	1809	329000
55	HENRICO	VA	23238	1080	155000
56	HENRICO	VA	23238	2176	275000
57	HENRICO	VA	23238	1360	153500
58	HENRICO	VA	23238	1397	199950
59	RICHMOND	VA	23238	4638	815000
60	HENRICO	VA	23238	1360	234500
61	RICHMOND	VA	23236	3413	365000
62	RICHMOND	VA	23236	1396	104100
63	RICHMOND	VA	23236	2450	261250
64	RICHMOND	VA	23236	1902	43670
65	RICHMOND	VA	23236	2230	200000
66	RICHMOND	VA	23236	1766	192400
67	RICHMOND	VA	23236	3317	410000
68	RICHMOND	VA	23236	1304	126500
69	RICHMOND	VA	23236	2115	220000
70	RICHMOND	VA	23236	1448	77500
71	RICHMOND	VA	23236	1254	155000
72	RICHMOND	VA	23236	1548	171350
73	RICHMOND	VA	23236	4082	489500
74	RICHMOND	VA	23236	1764	199950
75	RICHMOND	VA	23236	1446	115000
76	RICHMOND	VA	23236	1528	185000
77	RICHMOND	VA	23236	2170	279000
78	RICHMOND	VA	23236	3556	275000
79	RICHMOND	VA	23236	2860	328000
80	RICHMOND	VA	23236	1340	182000
81	PEYTON	CO	80831	2459	52000
82	PEYTON	CO	80831	1491	178900
83	PEYTON	CO	80831	2878	400000
84	PEYTON	CO	80831	2552	209000
85	PEYTON	CO	80831	2170	259900
86	PEYTON	CO	80831	1680	217895
87	PEYTON	CO	80831	2277	283000
88	PEYTON	CO	80831	3134	350000
89	PEYTON	CO	80831	2476	216000
90	PEYTON	CO	80831	3334	245000
91	PEYTON	CO	80831	3080	255000
92	PEYTON	CO	80831	3382	70938
93	PEYTON	CO	80831	3016	244800
94	PEYTON	CO	80831	2448	278750
95	PEYTON	CO	80831	2523	276000
96	PEYTON	CO	80831	2958	258850
97	PEYTON	CO	80831	2280	223008
98	PEYTON	CO	80831	3055	295000
99	PEYTON	CO	80831	2128	140000
100	PEYTON	CO	80831	2865	323000
101	TABERNASH	CO	80478	1080	214500
102	TABERNASH	CO	80478	3296	555000
103	TABERNASH	CO	80478	1272	287000
104	TABERNASH	CO	80478	1575	349900
105	TABERNASH	CO	80478	3290	890000
106	TABERNASH	CO	80478	2410	685000
107	TABERNASH	CO	80478	1956	224000
108	TABERNASH	CO	80478	5126	1335000
109	TABERNASH	CO	80478	1054	164000
110	TABERNASH	CO	80478	1387	299900
111	TABERNASH	CO	80478	3000	355000
112	TABERNASH	CO	80478	3469	601000
113	TABERNASH	CO	80478	2968	619000
114	TABERNASH	CO	80478	2466	463600
115	TABERNASH	CO	80478	1600	265000
116	TABERNASH	CO	80478	1742	259000
117	TABERNASH	CO	80478	3914	132500
118	TABERNASH	CO	80478	2768	740000
119	TABERNASH	CO	80478	2194	395000
120	TABERNASH	CO	80478	2473	120000
121	CENTENNIA	CO	80122	4843	601000
122	CENTENNIA	CO	80122	1850	200000
123	CENTENNIA	CO	80122	2018	198000
124	CENTENNIA	CO	80122	1714	248000
125	CENTENNIA	CO	80122	2737	208000
126	CENTENNIA	CO	80122	2240	252500
127	CENTENNIA	CO	80122	2535	289000
128	CENTENNIA	CO	80122	1886	275000
129	LITTLETON	CO	80122	2621	310000
130	CENTENNIA	CO	80122	3160	451000
131	CENTENNIA	CO	80122	2939	271000
132	CENTENNIA	CO	80122	2236	270000
133	CENTENNIA	CO	80122	2068	262500
134	CENTENNIA	CO	80122	1968	249900
135	CENTENNIA	CO	80122	2596	359925
136	CENTENNIA	CO	80122	1680	200000
137	LITTLETON	CO	80122	1465	172000
138	CENTENNIA	CO	80122	1936	257450
139	CENTENNIA	CO	80122	3385	682000
140	CENTENNIA	CO	80122	4776	486250
141	SNOWMASS	CO	81615	2342	2500000
142	SNOWMASS	CO	81615	3423	613299
143	SNOWMASS	CO	81615	2476	955000
144	SNOWMASS	CO	81615	5489	6550000
145	SNOWMASS	CO	81615	5598	7650000
146	SNOWMASS	CO	81615	1938	3341250
147	SNOWMASS	CO	81615	1328	336262
148	SNOWMASS	CO	81615	1217	1085100
149	SNOWMASS	CO	81615	1520	1700000
150	SNOWMASS	CO	81615	894	450000
151	SNOWMASS	CO	81615	3923	2100000
152	SNOWMASS	CO	81615	1206	620000
153	SNOWMASS	CO	81615	4811	2575000
154	SNOWMASS	CO	81615	1563	1883300
155	SNOWMASS	CO	81615	1278	765000
156	SNOWMASS	CO	81615	2316	1760000
157	SNOWMASS	CO	81615	1935	300000
158	SNOWMASS	CO	81615	1973	495000
159	SNOWMASS	CO	81615	1218	769200
160	SNOWMASS	CO	81615	676	908400

Merging Data Sets¶

The DATA STEP can also be used to merge data sets using the following format:

DATA newdata;
    MERGE data1 data2;
    BY ID_VAR;
RUN;

In [14]:

DATA middlenames;
    INPUT Names :$15. MiddleName :$15. ;
        DATALINES;
    Eleanor Larson
    Georgiana Otelia
        ;
RUN;

Out[14]:

143  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
144  
145  DATA middlenames;
146      INPUT Names :$15. MiddleName :$15. ;
147          DATALINES;
NOTE: The data set WORK.MIDDLENAMES has 2 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
150          ;
151  RUN;
152  ods html5 close;ods listing;

153

In [15]:

DATA COMBINED;
    MERGE FAVORITES MIDDLENAMES;
    BY NAMES;
RUN;

Out[15]:

155  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
156  
157  DATA COMBINED;
158      MERGE FAVORITES MIDDLENAMES;
159      BY NAMES;
160  RUN;
NOTE: There were 3 observations read from the data set WORK.FAVORITES.
NOTE: There were 2 observations read from the data set WORK.MIDDLENAMES.
NOTE: The data set WORK.COMBINED has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
      
161  ods html5 close;ods listing;

162

QUESTION:¶

What does the following code return?

PROC PRINT DATA = COMBINED;
    TITLE 'Hoegh Family';
RUN;

In [16]:

PROC PRINT DATA = COMBINED;
    TITLE 'Hoegh Family';
RUN;

Out[16]:

SAS Output

Hoegh Family


Obs	Names	Food	Activity	MiddleName
1	Andy	Milkshakes	Golf
2	Eleanor	Cauliflower	Skiing	Larson
3	Georgiana	Anything	Gymnastics	Otelia

Exercise 2: Merging and Splitting Data¶

Using the shark attacks data set (attacks):

create two new data sets one for attacks in the USA (Country = USA) and another for attacks in Australia.
Create a new data set that combines these two data sets by stacking them.
Use Proc Freq to get a frequency table for activity type for attacks in USA and Australia.

Writing to Multiple Data Sets Using Output¶

A single DATA step can be used to create multiple data sets.

In [17]:

DATA virginia colorado others;
    SET mydata.housing;
    IF state = 'CO' THEN OUTPUT colorado;
        ELSE IF state = 'VA' THEN OUTPUT virginia;
        ELSE OUTPUT others;
RUN;

Out[17]:

172  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
173  
174  DATA virginia colorado others;
175      SET mydata.housing;
176      IF state = 'CO' THEN OUTPUT colorado;
177          ELSE IF state = 'VA' THEN OUTPUT virginia;
178          ELSE OUTPUT others;
179  RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.VIRGINIA has 80 observations and 5 variables.
NOTE: The data set WORK.COLORADO has 80 observations and 5 variables.
NOTE: The data set WORK.OTHERS has 1840 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
180  ods html5 close;ods listing;

181

Using DO statements for creating DATA sets¶

Often in R we used the function seq() particularly when thinking about creating graphics. A similar approach in R takes advantage of the DO command.

In [18]:

DATA quadratic;
    DO x=1 TO 10;
        y= x ** 2;
        OUTPUT;
    END;
RUN;

Out[18]:

183  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
184  
185  DATA quadratic;
186      DO x=1 TO 10;
187          y= x ** 2;
188          OUTPUT;
189      END;
190  RUN;
NOTE: The data set WORK.QUADRATIC has 10 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
191  ods html5 close;ods listing;

192

In [19]:

PROC PRINT DATA=quadratic;
TITLE;
RUN;

Out[19]:

SAS Output


Obs	x	y
1	1	1
2	2	4
3	3	9
4	4	16
5	5	25
6	6	36
7	7	49
8	8	64
9	9	81
10	10	100

SAS Data Set Options¶

Data set options can be used in both DATA and PROC statements. To use a data set option, put it between parentheses directly following the data set. The syntax follows as:

DATA STEP¶

DATA newdata;
    SET olddata (options here);
RUN;

PROC STEP¶

PROC PRINT DATA = dataset (options here);
RUN;

Data Set Options¶

KEEP = variable-list (specifies which variables to keep)
DROP = variable-list (specifies which variables to drop)
RENAME = (oldvar = newvar) (specifies which variables to rename)
FIRSTOBS = n (tells SAS to start reading at observation n)
OBS = n (tells SAS to stop reading at observation n)
WHERE = condition (selects observations that meet a specified criteria)

In [20]:

DATA HoeghFamily;
    SET COMBINED (KEEP = Food Activity MiddleName);
RUN;

Out[20]:

202  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
203  
204  DATA HoeghFamily;
205      SET COMBINED (KEEP = Food Activity MiddleName);
206  RUN;
NOTE: There were 3 observations read from the data set WORK.COMBINED.
NOTE: The data set WORK.HOEGHFAMILY has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
207  ods html5 close;ods listing;

208

Consider the following code, what will this print?

PROC PRINT DATA= HoeghFamily (Firstobs = 2);
RUN;

In [21]:

PROC PRINT DATA= HoeghFamily (Firstobs = 2);
RUN;

Out[21]:

SAS Output


Obs	Food	Activity	MiddleName
2	Cauliflower	Skiing	Larson
3	Anything	Gymnastics	Otelia

SAS Automatic Variables¶

SAS has built in variables that can be quite useful.

N (Indicates the number of times SAS has looped through the data set. This is generally the observation number, but can be different when using conditional statements.)
FIRST.variable (where variable is the name of the variable you are referencing. This requires using a by statement and will produce a 1 for the first variable on 0 otherwise.
LAST.variable (same idea as first.variable, but applies to last instance of a variable)

In [22]:

PROC SORT DATA=mydata.housing OUT=housing;
    BY State;
RUN;

Out[22]:

217  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
218  
219  PROC SORT DATA=mydata.housing OUT=housing;
220      BY State;
221  RUN;
NOTE: There were 2000 observations read from the data set MYDATA.HOUSING.
NOTE: The data set WORK.HOUSING has 2000 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
222  ods html5 close;ods listing;

223

In [23]:

DATA HousingSample;
    SET housing;
    BY STATE;
    IF FIRST.STATE =1;
RUN;

Out[23]:

225  ods listing close;ods html5 file=stdout options(bitmap_mode='inline') device=png; ods graphics on / outputfmt=png;
NOTE: Writing HTML5 Body file: STDOUT
226  
227  DATA HousingSample;
228      SET housing;
229      BY STATE;
230      IF FIRST.STATE =1;
231  RUN;
NOTE: There were 2000 observations read from the data set WORK.HOUSING.
NOTE: The data set WORK.HOUSINGSAMPLE has 32 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      
232  ods html5 close;ods listing;

233

In [24]:

PROC PRINT DATA=HousingSample;
RUN;

Out[24]:

SAS Output


Obs	City	State	Zip_Code	Living_Sq_Ft	Closing_Price
1	BODEGA BA	CA	94923	1237	680000
2	PEYTON	CO	80831	2459	52000
3	TOLLAND	CT	60842	2083	298900
4	WASHINGTO	DC	20012	2345	626000
5	LAKELAND	FL	33812	1713	105000
6	MCDONOUGH	GA	30252	3066	369400
7	HONOLULU	HI	96826	1200	83900
8	FAIRFIELD	IA	52556	1740	111800
9	CALDWELL	ID	83607	1666	172225
10	JOLIET	IL	60433	1400	91000
11	FORT WAYN	IN	46805	1289	86773
12	TOPEKA	KS	66608	1431	104500
13	BARDSTOWN	KY	40004	2494	23700
14	LOWELL	MA	18501	3229	400000
15	YORK	ME	39095	2304	310230
16	FLINT	MI	48504	1358	50600
17	LAKE CITY	MN	55041	1463	299900
18	SAINT LOU	MO	63125	1048	165000
19	LEWISTOWN	MT	59457	2464	150920
20	CHERRYVIL	NC	28021	3362	47000
21	LINCOLN	NE	68524	1458	139000
22	DRYDEN	NY	13053	1104	114000
23	MAUMEE	OH	43537	3301	314500
24	CLAREMORE	OK	74019	1400	106500
25	CAVE JUNC	OR	97523	1440	51000
26	PITTSBURG	PA	15234	1450	119000
27	CRANSTON	RI	29051	1259	259000
28	ALLEN	TX	75002	1780	148500
29	HUNTSVILL	UT	84317	2910	520695
30	PROVIDENC	VA	23140	1621	224900
31	VANCOUVER	WA	98682	1508	175000
32	ROCK SPRI	WY	82901	1820	191893

SAS Data Exercise #2¶

In [25]:

Data USA;
	set mydata.attacks;
	if Country = 'USA';
RUN;

Data Australia;
	set mydata.attacks;
	if Country = 'AUSTRALIA';
RUN;

DATA Combined;
	set USA Australia;
RUN;

Proc Freq Data = Combined;
	TABLES Activity;
RUN;

Out[25]:

SAS Output

The FREQ Procedure


Activity	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Bathing	82	4.46	82	4.46
Diving	56	3.04	138	7.50
Fishing	220	11.96	358	19.46
Snorkeling	44	2.39	402	21.85
Spearfishing	118	6.41	520	28.26
Standing	66	3.59	586	31.85
Surfing	704	38.26	1290	70.11
Swimming	438	23.80	1728	93.91
Wading	112	6.09	1840	100.00

STATISTICAL PROCEDURES¶

Next we will focus on statistical procedure available in SAS.

PROC UNIVARIATE
PROC MEANS (will not cover in detail again)
PROC TTEST
PROC REG
PROC ANOVA

PROC UNIVARIATE¶

PROC UNIVARIATE produces statistics and graphs describing the distribution of a single variable.

The syntax follows as:

PROC UNIVARIATE DATA = yourdata;
    VAR variblename;
RUN;

In [26]:

PROC UNIVARIATE DATA=housing;
    VAR LIVING_SQ_FT;
RUN;

Out[26]:

SAS Output

The UNIVARIATE Procedure

Variable: Living_Sq_Ft


Moments
N	2000	Sum Weights	2000
Mean	1840.32	Sum Observations	3680640
Std Deviation	853.235014	Variance	728009.99
Skewness	1.77833485	Kurtosis	4.87450208
Uncorrected SS	8228847374	Corrected SS	1455291969
Coeff Variation	46.363405	Std Error Mean	19.0789149


Basic Statistical Measures
Location		Variability
Mean	1840.320	Std Deviation	853.23501
Median	1625.000	Variance	728010
Mode	1008.000	Range	6475
		Interquartile Range	944.50000


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	96.45832	Pr > \|t\|	<.0001
Sign	M	1000	Pr >= \|M\|	<.0001
Signed Rank	S	1000500	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	6963.0
99%	4838.5
95%	3423.5
90%	2938.0
75% Q3	2194.5
50% Median	1625.0
25% Q1	1250.0
10%	1019.0
5%	904.5
1%	704.0
0% Min	488.0


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
488	631	6046	977
540	639	6506	1766
576	122	6640	1647
594	158	6695	1812
596	637	6963	610

Graphics from PROC UNIVARIATE¶

We will go in depth on SAS graphics next week, but many of the statistical procedure have built in options for creating graphics.

Syntax follows as:

PROC UNIVARIATE DATA=yourdata;
    VAR yourvariable;
    plot-request yourvariable/ options;
RUN;

Plot Requests¶

CDFPLOT (cumulative distribution function plot)
HISTOGRAM
QQPLOT (quantile-quantile plot)

In [27]:

PROC UNIVARIATE DATA=housing noprint;
    VAR LIVING_SQ_FT;
    HISTOGRAM LIVING_SQ_FT / Normal; * Normal creates a normal curve;
RUN;

Out[27]:

SAS Output

The UNIVARIATE Procedure

The UNIVARIATE Procedure

Fitted Normal Distribution for Living_Sq_Ft


Parameters for Normal Distribution
Parameter	Symbol	Estimate
Mean	Mu	1840.32
Std Dev	Sigma	853.235


Goodness-of-Fit Tests for Normal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.1094403	Pr > D	<0.010
Cramer-von Mises	W-Sq	10.0771813	Pr > W-Sq	<0.005
Anderson-Darling	A-Sq	59.7937631	Pr > A-Sq	<0.005


Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	704.000	-144.601
5.0	904.500	436.873
10.0	1019.000	746.855
25.0	1250.000	1264.822
50.0	1625.000	1840.320
75.0	2194.500	2415.818
90.0	2938.000	2933.785
95.0	3423.500	3243.767
99.0	4838.500	3825.241

In [28]:

PROC UNIVARIATE DATA=housing noprint;
    VAR LIVING_SQ_FT;
    HISTOGRAM LIVING_SQ_FT / Lognormal; * Normal creates a normal curve;
RUN;

Out[28]:

SAS Output

The UNIVARIATE Procedure

Fitted Lognormal Distribution for Living_Sq_Ft


Parameters for Lognormal Distribution
Parameter	Symbol	Estimate
Threshold	Theta	0
Scale	Zeta	7.428118
Shape	Sigma	0.413559
Mean		1832.862
Std Dev		791.5927


Goodness-of-Fit Tests for Lognormal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.03801564	Pr > D	<0.010
Cramer-von Mises	W-Sq	0.71048795	Pr > W-Sq	<0.005
Anderson-Darling	A-Sq	4.20760256	Pr > A-Sq	<0.005


Quantiles for Lognormal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	704.000	642.930
5.0	904.500	852.245
10.0	1019.000	990.412
25.0	1250.000	1273.057
50.0	1625.000	1682.638
75.0	2194.500	2223.994
90.0	2938.000	2858.682
95.0	3423.500	3322.135
99.0	4838.500	4403.704

Additional Options for PROC MEANS¶

In addition the options outlined earlier, here are a few more options built into PROC MEANS.

CLM (two-sided confidence limits)
P1 (1 percent quantile)
p10 (10 percent quantile)
CV (coefficient of variation)
LCLM (lower confidence limit)
UCLM (upper confidence limit)

PROC TTEST¶

Q: Describe a t-test and give an example of when it might be used (ideally in the context of one of the data sets we have seen in this course).

One Sample Comparison¶

This is useful to test whether the mean is significantly different from a hypothesis $H_0$ value of n.

PROC TTEST DATA=yourdata H0 = n options;
    VAR variable;
RUN;

Options include:

ALPHA = n (alpha level, default is 0.05)
H0 = n (requests a test of $H_0=n$, default value is 0)
SIDES = type (options are 2, the default, and L for lower one-sided, and U for upper one-sided)

In [29]:

PROC TTEST DATA=HOUSING H0 = 2500;
    VAR LIVING_SQ_FT;
RUN;

Out[29]:

SAS Output

The TTEST Procedure

Variable: Living_Sq_Ft


N	Mean	Std Dev	Std Err	Minimum	Maximum
2000	1840.3	853.2	19.0789	488.0	6963.0


Mean	95% CL Mean		Std Dev	95% CL Std Dev
1840.3	1802.9	1877.7	853.2	827.6	880.5


DF	t Value	Pr > \|t\|
1999	-34.58	<.0001

Two independent sample comparisons¶

To compare differences between two groups use this procedure:

PROC TTEST DATA=yourdata options;
    CLASS variable;
    VAR variable;
RUN;

Class is the categorical variable that you like to test differences.

In [30]:

PROC TTEST DATA=VA_CO;
    CLASS State;
    VAR Closing_Price;
RUN;

Out[30]:

SAS Output

The TTEST Procedure

Variable: Closing_Price


State	N	Mean	Std Dev	Std Err	Minimum	Maximum
CO	80	716647	1196747	133800	52000.0	7650000
VA	80	281179	182367	20389.3	40000.0	1075000
Diff (1-2)		435468	855997	135345


State	Method	Mean	95% CL Mean		Std Dev	95% CL Std Dev
CO		716647	450324	982970	1196747	1035728	1417514
VA		281179	240595	321763	182367	157830	216009
Diff (1-2)	Pooled	435468	168149	702787	855997	771122	962032
Diff (1-2)	Satterthwaite	435468	166256	704680


Method	Variances	DF	t Value	Pr > \|t\|
Pooled	Equal	158	3.22	0.0016
Satterthwaite	Unequal	82.667	3.22	0.0018


Equality of Variances
Method	Num DF	Den DF	F Value	Pr > F
Folded F	79	79	43.06	<.0001

Paired t-test¶

Another type of t-test involves paired data.

PROC TTEST DATA=yourdata options;
    PAIRED variable1 * variable2;
RUN;

PROC TTEST GRAPHICS¶

The ttest procedure automatically creates a few graphics, but that can be controlled using the following syntax.

PROC TTEST DATA=yourdata PLOTS = (plot-request)

The plot requests are:

ALL (produce all plots)
BOXPLOT (creates box plots)
HISTOGRAM
INTERVALPLOT (creates plots of confidence interval of means)
NONE (suppresses all plots)
QQPLOT
SUMMARYPLOT (creates one plot that contains histrograms and box plots)

Exercise 3: Proc Univariate¶

Use Proc Univariate to summarize the registered and casual variables in the bikes dataset.
Include a histogram with a normal density superimposed.
Does the normal density match the data? Why or why not?

In [31]:

Proc Univariate Data = mydata.bikes;
	Var Registered Casual;
	HISTOGRAM Registered Casual / Normal; * Normal creates a normal curve;
run;

Out[31]:

SAS Output

The UNIVARIATE Procedure

Variable: registered


Moments
N	10886	Sum Weights	10886
Mean	155.552177	Sum Observations	1693341
Std Deviation	151.039033	Variance	22812.7895
Skewness	1.52480459	Kurtosis	2.626081
Uncorrected SS	511720093	Corrected SS	248317214
Coeff Variation	97.0986301	Std Error Mean	1.44762152


Basic Statistical Measures
Location		Variability
Mean	155.5522	Std Deviation	151.03903
Median	118.0000	Variance	22813
Mode	3.0000	Range	886.00000
		Interquartile Range	186.00000


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	107.4536	Pr > \|t\|	<.0001
Sign	M	5435.5	Pr >= \|M\|	<.0001
Signed Rank	S	29547378	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	886
99%	697
95%	464
90%	354
75% Q3	222
50% Median	118
25% Q1	36
10%	7
5%	4
1%	1
0% Min	0


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
0	4012	833	9585
0	3890	839	9897
0	3867	857	9298
0	3725	857	9753
0	3318	886	9346

The UNIVARIATE Procedure

The UNIVARIATE Procedure

Fitted Normal Distribution for registered


Parameters for Normal Distribution
Parameter	Symbol	Estimate
Mean	Mu	155.5522
Std Dev	Sigma	151.039


Goodness-of-Fit Tests for Normal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.151715	Pr > D	<0.010
Cramer-von Mises	W-Sq	59.670109	Pr > W-Sq	<0.005
Anderson-Darling	A-Sq	383.449398	Pr > A-Sq	<0.005


Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	1.00000	-195.8172
5.0	4.00000	-92.8849
10.0	7.00000	-38.0121
25.0	36.00000	53.6779
50.0	118.00000	155.5522
75.0	222.00000	257.4265
90.0	354.00000	349.1165
95.0	464.00000	403.9893
99.0	697.00000	506.9215

The UNIVARIATE Procedure

Variable: casual


Moments
N	10886	Sum Weights	10886
Mean	36.0219548	Sum Observations	392135
Std Deviation	49.9604766	Variance	2496.04922
Skewness	2.4957484	Kurtosis	7.55162931
Uncorrected SS	41294965	Corrected SS	27169495.8
Coeff Variation	138.694518	Std Error Mean	0.47884219


Basic Statistical Measures
Location		Variability
Mean	36.02195	Std Deviation	49.96048
Median	17.00000	Variance	2496
Mode	0.00000	Range	367.00000
		Interquartile Range	45.00000


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	75.2272	Pr > \|t\|	<.0001
Sign	M	4950	Pr >= \|M\|	<.0001
Signed Rank	S	24504975	Pr >= \|S\|	<.0001


Quantiles (Definition 5)
Level	Quantile
100% Max	367
99%	241
95%	141
90%	94
75% Q3	49
50% Median	17
25% Q1	4
10%	1
5%	0
1%	0
0% Min	0


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
0	10866	356	7687
0	10844	357	6729
0	10842	361	7686
0	10840	362	9652
0	10839	367	6730

The UNIVARIATE Procedure

The UNIVARIATE Procedure

Fitted Normal Distribution for casual


Parameters for Normal Distribution
Parameter	Symbol	Estimate
Mean	Mu	36.02195
Std Dev	Sigma	49.96048


Goodness-of-Fit Tests for Normal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.235452	Pr > D	<0.010
Cramer-von Mises	W-Sq	171.574837	Pr > W-Sq	<0.005
Anderson-Darling	A-Sq	948.029130	Pr > A-Sq	<0.005


Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	0.000	-80.20349
5.0	0.000	-46.15572
10.0	1.000	-28.00497
25.0	4.000	2.32413
50.0	17.000	36.02195
75.0	49.000	69.71978
90.0	94.000	100.04888
95.0	141.000	118.19963
99.0	241.000	152.24740

PROC REG¶

Similar to lm() in R, we can also easily run regression in SAS.

PROC REG DATA=yourdata;
    MODEL dependent = independent;
RUN;

Note including categorical variables can take a bit of extra work to create dummy variables (or use PROC GLM).

In [32]:

PROC REG DATA=housing;
    MODEL CLOSING_PRICE = LIVING_SQ_FT;
RUN;

Out[32]:

SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price


Number of Observations Read	2000
Number of Observations Used	2000


Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3.287656E13	3.287656E13	384.41	<.0001
Error	1998	1.708798E14	85525443800
Corrected Total	1999	2.037564E14


Root MSE	292447	R-Square	0.1614
Dependent Mean	244147	Adj R-Sq	0.1609
Coeff Var	119.78344


Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	-32459	15550	-2.09	0.0370
Living_Sq_Ft	1	150.30316	7.66607	19.61	<.0001

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

Panel of fit diagnostics for Closing_Price.

Scatter plot of residuals by Living_Sq_Ft for Closing_Price.

Scatterplot of Closing_Price by Living_Sq_Ft overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits.

PROC REG GRAPHICS¶

Graphics can easily be created from PROC REG using the following syntax:

PROC REG DATA=yourdata PLOTS(options) = (plot request list);
    MODEL dependent = independent;
RUN;

For the options trailing PLOTS use (ONLY) if you wish to omit the default plots.

Plot requests

FITPLOT (scatter plot with regression line and confidence bands)
RESIDUALS ( residuals plotted against independent variables)
DIAGNOSTICS (diagnostics panel)
COOKSD (Cook's D statistic)
OBSERVEDBYPREDICTED (dependent variable by predicted)
QQPLOT
RESIDUALBYPREDICTED
RESIDUALHISTOGRAM
RFPLOT (residual fit plot)

In [33]:

PROC REG DATA=housing PLOTS(ONLY) = DIAGNOSTICS;
    MODEL CLOSING_PRICE = LIVING_SQ_FT;
RUN;

Out[33]:

SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price


Number of Observations Read	2000
Number of Observations Used	2000


Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3.287656E13	3.287656E13	384.41	<.0001
Error	1998	1.708798E14	85525443800
Corrected Total	1999	2.037564E14


Root MSE	292447	R-Square	0.1614
Dependent Mean	244147	Adj R-Sq	0.1609
Coeff Var	119.78344


Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	-32459	15550	-2.09	0.0370
Living_Sq_Ft	1	150.30316	7.66607	19.61	<.0001

The REG Procedure

Model: MODEL1

Dependent Variable: Closing_Price

PROC ANOVA¶

This procedure allows you to run an analysis of variance procedure with the following syntax:

PROC ANOVA DATA =yourdata;
    CLASS classvariable;
    MODEL dependent = effects;
RUN;

In [34]:

DATA ANDYLIVED;
    SET HOUSING;
    IF STATE in ('IA', 'MT','CO','MN','CA','VA');
RUN;

Out[34]:

SAS Output

In [35]:

PROC ANOVA DATA=ANDYLIVED;
    CLASS STATE;
    MODEL CLOSING_PRICE = STATE;
RUN;

Out[35]:

SAS Output

The ANOVA Procedure


Class Level Information
Class	Levels	Values
State	6	CA CO IA MN MT VA


Number of Observations Read	440
Number of Observations Used	440

The ANOVA Procedure

Dependent Variable: Closing_Price


Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	5	1.43091E13	2.8618199E12	9.81	<.0001
Error	434	1.2654988E14	291589585117
Corrected Total	439	1.4085898E14


R-Square	Coeff Var	Root MSE	Closing_Price Mean
0.101585	143.7899	539990.4	375541.3


Source	DF	Anova SS	Mean Square	F Value	Pr > F
State	5	1.43091E13	2.8618199E12	9.81	<.0001

Exercise 4: Proc Reg / Anova¶

Using the bikes dataset fit a regression model to predict the count of bikes.
What is the best adjusted - R-squared value you can acheive? (without using casual and registered).
Does the model treat season correctly?

In [36]:

Proc REG Data = mydata.bikes;
	MODEL Count = season holiday temp workingday weather;
RUN;

Out[36]:

SAS Output

The REG Procedure

Model: MODEL1

Dependent Variable: count


Number of Observations Read	10886
Number of Observations Used	10886


Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	5	61261467	12252293	450.49	<.0001
Error	10880	295911447	27198
Corrected Total	10885	357172914


Root MSE	164.91738	R-Square	0.1715
Dependent Mean	191.57413	Adj R-Sq	0.1711
Coeff Var	86.08541


Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	32.97402	6.68752	4.93	<.0001
season	1	11.16378	1.46727	7.61	<.0001
holiday	1	-8.24676	9.80449	-0.84	0.4003
temp	1	8.61542	0.21053	40.92	<.0001
workingday	1	1.09982	3.50658	0.31	0.7538
weather	1	-31.15689	2.49998	-12.46	<.0001

The REG Procedure

Model: MODEL1

Dependent Variable: count

Panel of heat maps of residuals by regressors for count.