Exploring Sas Viya
Exploring Sas Viya
SAS Viya
® ®
Fundamentals
This series is based on content from SAS® Viya® Enablement,
a free course available from SAS Education. You can follow along
with the examples in real time by watching the videos if you prefer.
Topics covered illustrate the features and capabilities of SAS Viya.
SAS Viya extends the SAS platform to enable everyone –
data scientists, business analysts, developers, and executives alike –
to collaborate and realize innovative results faster.
sas.com/books
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2019 SAS Institute Inc. All rights reserved. M1913158 US.0419
About This Book
SAS Viya is designed to coexist with SAS 9.4 solutions and the SAS 9 environment. While SAS 9 and
SAS Viya are two run-time environments built for different use cases, you can make your SAS 9.4 data
available to SAS Viya. These environments also share some functionality. For example, SAS 9 uses the
SAS programming language, and SAS Viya uses the next generation of SAS programming with the new
CAS programming language. The CAS language is very similar to the SAS language. Some procedures are
available in both SAS 9 and SAS Viya, so some existing SAS code can be run in SAS Viya. However,
SAS Viya also contains new procedures that take advantage of the open, distributed environment. As a
result, some SAS 9 procedures do not exist in SAS Viya.
It is easy to connect to SAS Viya’s CAS to submit code. To write and run SAS code through your web
browser, you can use the SAS Studio interface. With SAS Studio, you can access your data files, libraries,
and existing programs and write new programs. SAS Viya uses PROC CAS to run CAS actions in SAS
Cloud Analytic Services. You can use the REST APIs for any client language to access SAS analytics,
data, and services. You can also use programming interfaces for Python, Java, and Lua to access this CAS
functionality. In addition, you can continue to submit SAS code in batch mode.
The content in this book is based on SAS® Viya® Enablement, a free course available from SAS Education.
This book covers how to access data files, libraries, and existing code in SAS Studio. You will also learn
about new procedures in SAS Viya, how to write new code, as well as how to use some of the pre-installed
tasks that come with SAS Visual Data Mining and Machine Learning. In the last chapter, you will learn
how to use the features in SAS Data Preparation to perform data management tasks using SAS Data
Explorer, SAS Data Studio, and SAS Lineage Viewer.
SAS Viya extends the SAS Platform to enable everyone – data scientists, business analysts, developers and
executives alike – to collaborate and realize innovative results faster. If you are curious about SAS Viya
and want to learn more about some of its features and capabilities, then this book is also for you.
vi About this Book
This book includes tutorials for you to follow to gain hands-on experience with SAS Viya and SAS 9.4M5.
Wherever possible, the source of the sample data is provided in a link. Some features shown may only be
available if your site has licensed that feature in SAS Viya. Therefore, the options in your version of SAS
may look different.
SAS Press books are written by SAS Users for SAS Users. Please visit sas.com/books to sign up to request
information on how to become a SAS Press author.
Learn about new books and exclusive discounts. Sign up for our new books mailing list today at
https://support.sas.com/en/books/subscribe-books.html.
Chapter 1: SAS Viya Deployment
Introduction ................................................................................................................................1
Deployment ................................................................................................................................2
Topologies ..................................................................................................................................3
Single Machine Deployment ....................................................................................................................... 3
Multiple Machine Deployment .................................................................................................................... 3
Hadoop .......................................................................................................................................4
Co-located Deployment .............................................................................................................................. 4
Remote Deployment .................................................................................................................................... 4
Connecting to the CAS Server in SAS 9 Clients...........................................................................5
SAS Studio Example .................................................................................................................................... 5
SAS Enterprise Guide Example .................................................................................................................. 6
Resources...................................................................................................................................6
Introduction
The high-performance processing power of the SAS Viya platform is provided by SAS Cloud Analytics
Services (CAS). CAS is an in-memory engine that can dramatically accelerate data management and
analytics with SAS. Some of the benefits of CAS include:
In this chapter, we look at the tools and utilities for deploying SAS Viya products, several possible
topologies, and some deployment options with Hadoop.
2 Exploring SAS® Viya®
Deployment
Deployment of SAS Viya uses industry-standard deployment software such as Ansible and yum. SAS
provides software as RPM packages, and uses the Linux utility yum to install the RPM packages in your
environment.
RPM your
yum
packages environment
Ansible automates a series of yum commands to install the RPM packages on the machines that you
designate. It uses a configuration management script called a playbook that maps a machine (or groups of
machines) to well-defined roles, which associate groups of services to specific machines. To support
Ansible, SAS provides a utility to generate a playbook that you customize for your environment, as shown
in Figure 1.1.
The machine where Ansible is installed is called the Ansible controller machine. Ansible might be on the
same machine as SAS Viya, or on a separate machine. (See Figure 1.2). This can simplify a multi-machine
deployment because Ansible only needs to be installed on a single machine (the Ansible controller
machine).
Ansible can use SSH to deliver instructions and retrieve results from the other machines in the installation,
called managed nodes.
Chapter 1: SAS Viya Deployment 3
Topologies
There are several possible topologies that you can use to deploy SAS Viya. Let’s look at some examples.
SAS Cloud Analytics Services (CAS) includes the in-memory run-time server for SAS Viya products. You
can potentially improve the performance of analytical processing by deploying CAS to its own machine or
machines. In a simple scenario, a single machine handles all CAS operations, which include in-memory
run-time analytics and supporting services. This is symmetric multiprocessing (SMP) architecture as
shown in Figure 1.3.
CAS can also be configured in distributed mode. This is massively parallel processing (MPP) architecture,
which is a core design feature of SAS Viya. This scenario provides optimal processing capabilities. The
CAS Controller distributes work to each of the CAS worker nodes, and the worker nodes send the results
of the computations back to the CAS controller. (See Figure 1.4).
In this configuration, the node labeled SAS Viya Applications provides infrastructure support for SAS
products, such as reporting and administrative services for web applications.
Hadoop
In both SMP and MPP architecture, the CAS server is multi-threaded for high performance. CAS servers
are optimized to work jointly with Hadoop. You can connect to a Hadoop cluster in two ways: a co-located
deployment and a remote deployment.
Co-located Deployment
Co-located deployments install the CAS in-memory run-time server onto an existing Hadoop cluster. The
CAS controller software is installed on the Hadoop NameNode, and the CAS worker software is installed
on the Hadoop DataNodes as shown in Figure 1.5. Notice that other SAS Viya applications are deployed to
a separate machine.
Remote Deployment
Remote deployments pair CAS controller nodes and worker nodes on one set of machines with name and
data nodes on a remote Hadoop cluster. Similar to the co-located deployment configuration, SAS Viya
applications are deployed to a separate machine as shown in Figure 1.6.
SAS Viya Embedded Processes can run on Hadoop or Teradata machines to provide a computational
engine near the data. This reduces unnecessary data movement and speeds up model scoring. SAS plug-ins
for Hadoop provide connection and configuration information and can vary based on the Hadoop
distribution that is used.
Chapter 1: SAS Viya Deployment 5
● SAS Studio
● SAS Enterprise Guide
● SAS Windowing Environment
● SAS Data Integration Studio
Even if you are running an earlier version of SAS, you can still use SAS/CONNECT technology to
remotely execute code and transfer data to and from SAS Viya.
Let’s look at how easy it is to connect to SAS Viya’s Cloud Analytics Services to submit code that will
load and process data in-memory.
Open your SAS Studio session. Remember, SAS 9 is the default server. If you want to take advantage of
SAS Viya and the CAS server, you can start a new CAS Session. To open the new CAS session, you can
open the New CAS Session snippet under Snippets – SAS Viya Cloud Analytics Services and double-click
to send the code to the code window without any changes. See Figure 1.7.
Submit the code. The log confirms that you have successfully connected to the CAS server.
Use the following caslib statement to access the CAS libraries within SAS 9.
caslib _all_ assign;
This statement will assign all of the CAS libraries that are available to your user ID and make them visible
in SAS. In your environment, you might already have data loaded in your assigned caslibs. Because CAS
6 Exploring SAS® Viya®
processes only in-memory tables, we have to load tables into memory before we can use them in CAS. You
will learn more about caslibs and loading data in the next chapter.
To terminate your CAS session, enter the following statement before exiting SAS Studio.
cas mysession terminate;
Regardless of which SAS programming interface you are using, you can now submit code to SAS9 or SAS
Viya using SAS Studio, SAS Enterprise Guide, or the SAS Windowing Environment. In the next chapter,
we will look at foundational programming in SAS Viya and the differences between SAS 9 code and SAS
Viya code that takes advantage of CAS.
Resources
This chapter is based on the “Introduction to SAS Viya” videos in SAS® Viya® Enablement, a free course
available from SAS Education.
You may find the following documentation helpful as you learn more about deployment of SAS® Viya®:
To stay informed about SAS Viya development, please refer to the SAS Viya Community website.
Chapter 2: Foundational Programming in SAS Viya
Introduction ................................................................................................................................7
Accessing Data ...........................................................................................................................8
A Quick-Start Guide to Loading Data in CAS ............................................................................................ 8
Differences Between SAS 9 and SAS Viya............................................................................................... 10
DATA Step ................................................................................................................................12
Saving Modified Tables ............................................................................................................................. 12
Differences Between SAS 9 and SAS Viya............................................................................................... 13
BY Statement for Processing Data in Groups ............................................................................15
BY-group Processing in SAS 9 ................................................................................................................. 15
BY-group Processing in SAS Viya ............................................................................................................ 17
PARTITION= and ORDERBY= ................................................................................................................... 19
FORMAT Procedure ..................................................................................................................20
Formats in SAS 9 ....................................................................................................................................... 20
Formats in SAS Viya .................................................................................................................................. 21
Code Snippets ..........................................................................................................................22
Pre-installed Code Snippets ..................................................................................................................... 22
Create New Snippets ................................................................................................................................. 23
Resources.................................................................................................................................24
Introduction
The first thing you should know about SAS Viya is that you can use all of your existing SAS programming
knowledge in this new, high-powered SAS environment. With your SAS experience and the capabilities of
SAS Viya, you will be able to more efficiently and effectively analyze your data.
SAS offers a collection of new, high-performance CAS procedures as well as SAS procedures that will be
familiar to users of SAS 9 and run in CAS with familiar syntax. The DATA step, DS2, and FedSQL all run
in CAS as well. However, some aspects of the SAS programming language are not compatible with a
multi-threaded approach. For example, you might want to run a DATA step in multiple threads in CAS and
other times you need the DATA step to process the entire table sequentially on the same thread on either
the CAS server or the workspace server. To address this, SAS Viya not only provides the CAS server, but
also a SAS workspace server that is single-threaded so that you can choose.
There are three options for writing your programs in SAS Viya:
● SAS Studio provides a SAS programming environment for developing and submitting programs
to the server.
● Batch submission is also still an option.
● Open-source languages such as Python, Lua, and Java can submit code to the CAS server.
In this chapter, you will learn how to access your data in SAS Viya, including how to load programs and
access libraries. Then we will look at some simple DATA step programs to show how code in SAS Viya
differs from SAS 9 code. We will also look at PROC FORMAT to show how to create and apply user-
defined formats in SAS Viya. Finally, we will briefly look at Code Snippets, a feature in SAS Studio that
allows you to access pre-defined code and save your own code for later use.
8 Exploring SAS® Viya®
Accessing Data
Accessing your data is a critical part of any SAS program. It might not be the most exciting part of your
data analysis, but you won’t get far without it! In the previous chapter, we learned how to connect to the
CAS server. In this section, we will look at some key concepts for accessing data in SAS Viya and then
compare and contrast the code in SAS 9 and SAS Viya.
Let’s look at the log. We can see that sashelp.cars was successfully added to my active caslib as
MYCARS.
At the top of the results in Output 2.2, there is table information including memory allocation followed by
column information for MYWORLDDATA. If you go to the log, you will see that a table,
MYWORLDDATA has been created and has been loaded into your active caslib.
Chapter 2: Foundational Programming in SAS Viya 9
List Tables
In the logs of both Programs 2.1 and 2.2 we can see that we have loaded both MYCARS and
MYWORLDDATA. But there is another way to tell which tables have been loaded. We can use the LIST
TABLES statement in Program 2.3 to list the in-memory tables.
Let’s look at the results in Output 2.2. Notice that the active caslib is CASUSER(viyauser). And the loaded
tables are MYCARS and MYWORLDDATA, just as we would expect from this example. Both of these
tables are available as CAS tables in your active caslib, and they will remain available until you end your
CAS session.
Save Tables
An alternative way to save tables is shown in Program 2.4. You can use a SAVE statement to create a
permanent copy of these tables saved to the data source associated with the caslib as SASHDAT files.
Assign a Libref
Let’s learn how to associate a SAS libref with tables on the CAS server so that you can access them with
procedure and DATA steps. In Program 2.5, the active caslib is used. After the code is run, the libref
MYCAS will access the active caslib, and it is associated with the tables in the active caslib.
In addition, you can add a file directory structure at the conclusion of the LIBNAME statement.
Accessing Libraries
If you are working in SAS Studio, you can go to the libraries section of the Navigation pane to expand the
mycas library. You will notice in Figure 2.1 that there are slightly different icons that distinguish the caslib
and the CAS tables from the SAS libraries and SAS data sets.
In the log, you will see the names of any active sessions that need to be shut down. In this example, the
name of the active session is CASAUTO, so we can shut down that active session using the code in
Program 2.7.
In SAS 9, we use the familiar LIBNAME statement to establish a connection with the data. In the
LIBNAME statement in Program 2.8, the libref orion creates a logical reference or “shortcut” to a physical
location. Then we can take advantage of the orion libref throughout the rest of the program to identify the
location of the tables we reference.
data orion.cars;
set sashelp.cars;
Average_MPG=mean(MPG_City, MPG_Highway);
Keep Make Model Type MSRP Average_MPG;
run;
In SAS Viya, we also need to identify the data that we want to access and analyze with SAS Cloud
Analytic Services (CAS). A caslib is how the CAS server accesses data. The best way to describe a caslib
is that it is a container. The container has two areas where data is referenced: a physical space that includes
the source data or files, and an in-memory space that makes the data available for CAS processing. The
container also holds connection information for the source data and access controls governing who can
access the caslib and what they can do with it.
To process data in CAS, the source data must be loaded into memory, as shown in Figure 2.2. The in-
memory data is referred to as a CAS table. Additional tables can be loaded into memory from the caslib’s
data source or data can be loaded into the caslib’s in-memory space from areas other than the caslib’s
source path. For example, we can load CSV, Excel, or SAS data sets into CAS. When we are finished with
a CAS table, it can be dropped from the in-memory space. If we make changes to an in-memory table, they
can be saved back to the caslib’s data source so that the changes are available the next time you load the
table to memory.
Now that we have illustrated the caslib conceptually, let’s look at what this means for a SAS program.
Program 2.9 shows how to load a table into CAS.
proc casutil;
load data=sashelp.cars replace;
run;
data mycas.cars;
set mycas.cars;
Average_MPG=mean(MPG_City, MPG_Highway);
Keep Make Model Type MSRP Average_MPG;
run;
proc casutil;
save casdata=”cars” replace;
droptable casdata=”cars”;
run;
12 Exploring SAS® Viya®
Before we add a caslib, we need to start a CAS session. A CAS statement initiates a CAS session that in
Program 2.9 is called mysess. The SESSOPTS= option is used with the CASLIB= session option to
ensure that the casuser personal caslib is set as the active caslib. Much like the traditional sasuser
library in SAS 9, casuser is your own personal caslib assigned to your credentials.
Now we need to associate a libref with my casuser caslib to be able to use it in the program as
libref.tablename. The libref will be mycas, and it will use the CAS engine and connect to the casuser
caslib.
If you have a SAS 9 data set that you want to process on the CAS server, you can use the new procedure
PROC CASUTIL to load that table into memory in your caslib. In the LOAD statement, the DATA=
option identifies the table that we want to load. The REPLACE option will overwrite the table if it
already exists in the caslib.
At this point, we can take advantage of the DATA step or many other CAS-enabled procedures to
manipulate or analyze our data. In the next section, we will continue working with the same program and
look at the DATA step.
DATA Step
The familiar DATA step code that you use in SAS Version 9 can also work in SAS Viya. Compare the
differences between the code in Programs 2.8 and 2.9 side by side below. Notice in Program 2.9 that we
use standard SAS naming conventions to reference the data, with mycas as the libref and cars as the table.
proc casutil;
save casdata=”cars” replace;
droptable casdata=”cars”;
run;
After running the DATA step in Program 2.9, the log confirms that it ran in CAS. This is the case when
both the DATA and SET statements reference a CAS table.
If we run the final step in Program 2.9 and look at the log, we see that CAS saved cars.sashdat in the
casuser caslib and that the cars table was dropped from memory.
data bigcars_score;
set bigcars;
length myscore 8;
myscore=0.3*Invoice/(MSRP-Invoice)
+ 0.5*(EngineSize+Horsepower)/Weight + 0.2*(MPG_City+MPG_Highway);
run;
In the first DATA step, we are simply creating a large version of the sashelp.cars data set that has 42.8
million rows. In the second DATA step, we are creating a new variable named myscore that is a formula
based on values within the data. If we run this program in SAS Studio with SAS 9.4 running behind the
scenes, then the log will show that the scoring DATA step takes about 23 seconds real time and almost 11
seconds CPU time. Not bad for such a large table, but keep in mind this program is fairly simple data and
code that ran on a single machine, single processor, on data that is stored on disk.
NOTE: There were 42800000 observations read from the data set WORK.BIGCARS.
NOTE: The data set WORK.BIGCARS_SCORE has 42800000 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 23.01 seconds
cpu time 10.88 seconds
Now let’s take advantage of SAS Viya’s Cloud Analytics Services (CAS). When Program 2.10 is run in
CAS, SAS takes advantage of multiple machines and processors in your cloud environment to load the data
into memory and partition the DATA step execution out among multiple worker nodes. With each of these
nodes running their piece of the DATA step in parallel, obviously this divide and conquer approach can
mean dramatic performance improvements for complex programs and large data sets.
When we switch to a version of SAS Studio that is running SAS Viya behind the scenes, we need to add a
few statements to the code to ensure it is running in CAS.
14 Exploring SAS® Viya®
proc casutil;
load data=sashelp.cars replace;
run;
data mycas.bigcars;
set mycas.sashelp.cars;
do i=1 to 1000000;
output;
end;
run;
data mycas.bigcars_score;
set mycas.bigcars;
length myscore 8;
myscore=0.3*Invoice/(MSRP-Invoice)
+ 0.5*(EngineSize+Horsepower)/Weight + 0.2*(MPG_City+MPG_Highway);
Thread=_threadid_;
run;
The first two statements start our CAS session then bind a traditional SAS libref, mycas, to a CAS
library, which enables the loading and reading of data stored in memory. Note that the library
references are updated in the rest of the program so that all data sets now reference the mycas library.
The PROC CASUTIL statement loads the sashelp.cars table to the in-memory CAS library.
The final addition is an assignment statement, creating a new column called Thread. This column will
indicate on which thread each row was processed.
Again, we can compare the code side by side to see the differences.
run;
Let’s run Program 2.11 and examine the log. We can see our CAS session has started and that we have 14
workers. Each worker has 16 CPU cores, which gives us a total of 224 threads.
Chapter 2: Foundational Programming in SAS Viya 15
Now, looking at the lines of the log for the DATA step where the scoring took place, notice that it ran in
1.93 seconds real time and practically no CPU time. That’s quite an improvement!
This is just one example of the power and simplicity of SAS’ cloud analytics services for data
manipulation with relatively simple data and code. Imagine what SAS Viya can do with more complex
data and programs.
When Program 2.12 is submitted in SAS Viya, it runs on the SAS workspace server in a single thread –
same code, same results as running it in SAS 9. The table viewer in Output 2.12 shows that cars2 is
ordered by Make.
What if instead we would like to sort by Type, and then sort within Type by MSRP? Of course, that
requires a PROC SORT step before the DATA step. We will use first. and last. variables to flag the high
and low values of MSRP within Type. We add the BY statement in the DATA step to create the first./last.
variables, and conditional logic to assign values to our new variables, as shown in Program 2.13.
data cars2;
set sort_cars;
Average_MPG=mean(MPG_City, MPG_Highway);
keep Make Model Type Average_MPG MSRP LowMSRP HighMSRP;
by Type;
if first.Type then LowMSRP=1;
else LowMSRP=0;
if last.Type then HighMSRP=1;
else HighMSRP=0;
run;
Again, when we run this code in SAS Viya, the DATA step is running in SAS, single-threaded, with all the
rows being processed sequentially, just as in SAS 9. See Output 2.13.
Chapter 2: Foundational Programming in SAS Viya 17
proc casutil;
load data=sashelp.cars replace;
run;
data mycas.cars2;
set mycas.cars;
Average_MPG=mean(MPG_City, MPG_Highway);
keep Make Model Type Average_MPG MSRP;
by Type;
run;
18 Exploring SAS® Viya®
When we run Program 2.14, we can see that the data is grouped by Type, but not in sorted sequence.
Hybrid is toward the bottom of the data.
Remember in the SAS 9 Program 2.13 we sorted the data within Type by MSRP. One way to accomplish
this is simply to add MSRP to the BY statement, as shown in Program 2.15. We will also include the
statements that use first. and last. variables to create our new high and low MSRP columns.
After running the program, we see in Output 2.15 that the data is grouped by Type, although not in sorted
order, but MSRP is sorted within each group so that the low and high MSRP columns are created correctly.
When more than one variable is listed in the BY statement, CAS groups and distributes the rows based on
only the values of the first BY variable, and then orders the data within the groups by any subsequent
variables.
Chapter 2: Foundational Programming in SAS Viya 19
● PARTITION=
● ORDERBY=
The PARTITION= data set option specifies one or more partitioning variables. CAS then uses the
formatted values of the partitioning variables to divide and distribute the rows of the input table across
threads. The ORDERBY= data set option orders the rows within each partition. See Program 2.16 for an
example of code using these options.
Program 2.16: SAS Viya Code with PARTITION= and ORDERBY= Options
data mycas.cars2 (partition=(type) orderby=(MSRP));
set mycas.cars;
Average_MPG=mean(MPG_City, MPG_Highway);
keep Make Model Type Average_MPG MSRP LowMSRP HighMSRP;
by Type;
if first.Type then LowMSRP=1;
else LowMSRP=0;
if last.Type then HighMSRP=1;
else HighMSRP=0;
run;
Although the PARTITION= option is used in Program 2.16, the BY statement is still required to create the
first. and last. variables. When we run Program 2.16, notice in Output 2.16 that the values of the
PARTITION variable, Type, are in sorted sequence and within each Type, MSRP is sorted as well.
20 Exploring SAS® Viya®
We have seen that SAS Viya uses practically the same syntax as SAS 9 to process data in groups using the
BY statement in the DATA step. But given the fact that SAS Viya allows us to run the DATA step in CAS,
we can potentially realize great gains in efficiency.
FORMAT Procedure
SAS formats have always provided a powerful way to look at your data in different ways. SAS offers a rich
collection of formats and the FORMAT procedure allows you to create custom formats. In this section, we
will look at how to create and apply user-defined formats in SAS Viya.
Formats in SAS 9
Let’s start by looking at code that works in SAS 9. In Program 2.17, we use PROC FORMAT to create a
user-defined format named PRICERANGE_SAS that groups the cost of cars into four categories. Creating
a format won’t do much good until we apply it, which is done by the FORMAT statement in the DATA
step that follows.
data cars_formatted;
set sashelp.cars;
format MSRP pricerange_sas.;
keep Make Model MSRP MPG_Highway;
run;
After running Program 2.17, we see MSRP displayed as the formatted values in Output 2.17.
Chapter 2: Foundational Programming in SAS Viya 21
In the SAS 9 world, formats are stored in catalogs in SAS libraries. In Program 2.17, the
PRICERANGE_SAS format was saved in the default WORK library and FORMATS catalog.
CAS-enabled procedures can also take advantage of user-defined formats. But when the code and data run
in CAS, the user-defined formats must be available via a CAS format library. Program 2.18 shows how the
code will change when creating and applying formats to in-memory CAS tables. The end goal is to use a
CAS procedure PROC MDSUMMARY to compute summary statistics on an in-memory table for each
formatted value of MSRP.
Program 2.18
cas mysess sessopts=(caslib=casuser);
libname mycas cas;
data mycas.cars_formatted;
set sashelp.cars;
format MSRP pricerange_cas.;
keep Make Model MSRP MPG_Highway;
run;
This code starts the CAS session and assigns a libref, to your caslib.
Here we use PROC FORMAT to build the custom format, PRICERANGE_CAS. The only difference
between this step and the SAS 9 example in Program 2.18 is the addition of the CASFMTLIB= option.
22 Exploring SAS® Viya®
This option is used to add the format to a CAS format library named casformats. Now this
PRICERANGE_SAS format will be available in the CAS session to use with in-memory data.
In previous programs (Program 2.11 and 2.14), we used the CASUTIL procedure to load data in CAS.
This time we are using an alternative method – the DATA step. In the DATA step, we use a FORMAT
statement to apply the PRICERANGE_CAS format to MSRP. Notice that this DATA step is exactly
the same as Program 2.17 with one small but significant difference: the DATA statement names a
CAS table as the output source. So this step executes on the SAS workspace server, reads the
sashelp.cars table from disk, assigns a CAS format, and loads the data in CAS. In future steps when
this cars_formatted CAS table is referenced, the assigned PRICERANGE_CAS format can be pulled
from the CAS format library.
The results of Program 2.18 are shown in Output 2.18. The cars_summary table created by PROC
MDSUMMARY shows the statistics based on the formatted values of MSRP.
The notes in the log also confirm that MDSUMMARY was processed by CAS. If we returned to Program
2.18 and changed the format to PRICRANGE_SAS, the formatted created by the SAS workspace server,
instead of PRICRANGE_CAS, the log would show that the DATA step ran successfully on the workspace
server. But PROC MDSUMMARY would fail because CAS cannot load the PRICERANGE_SAS format.
Steps that run in CAS must reference formats in a CAS format library.
The PRICERANGE_CAS format that we created is temporary and will persist only for the duration of our
CAS session. You can save permanent formats for personal use or promote formats so that they can be
used by other people or sessions.
Code Snippets
Has there ever been a block of code that you use so infrequently that you always seem to forget the options
that you need? Conversely, has there ever been a block of code that you use so frequently that you grow
tired of typing it all the time? Code snippets can greatly assist with both of these scenarios. In this section,
we discuss using pre-installed code snippets and creating new code snippets within SAS Viya.
If you double-click a pre-installed code snippet, or if you click and drag the snippet into the code editor
panel, then the snippet will appear in the panel.
Snippets can range from very simple to very complex. Some contain comments. Some contain macro
variables. Some might be only a couple of lines of code. That is the advantage of snippets. They can be
anything that you want them to be.
A window will appear that asks you to name the snippet. Naming the snippet then saves it into the My
Snippets area in the left navigation panel for future use.
24 Exploring SAS® Viya®
Remember that snippets are extremely flexible. The code that you save does not have to be fully
executable. Instead of supplying the data source in your code, you may instead include notes or comments
about what needs to be added, which makes the code more general, but it is still a very useful snippet.
To use one of your saved snippets, simply navigate to the My Snippets area, then double-click on your
snippet or drag it into the code window.
Resources
This chapter is based on the “An Introduction to SAS Viya Programming for SAS 9 Programmers” videos
in “SAS® Viya® Enablement,” a free course available from SAS Education.
You may find the following documentation helpful as you learn more about programming in SAS® Viya®:
Introduction
As you have seen in the previous chapter, you can use all of your existing SAS programming knowledge in
SAS Viya to more efficiently and effectively analyze your data. SAS Viya also offers several new
procedures that are available for sites with SAS Viya installed. You can see a complete list of the
procedures with explanations in the documentation.
In this chapter, we will look at several new procedures that are available in SAS Viya to aid in your data
wrangling and statistical programming. At the same time, we will look at the tasks in SAS Studio that can
help you code some of these procedures and compare the features available in SAS Viya with the features
of procedures you might already be using in SAS 9.
A predefined task in SAS Viya is an XML and Apache Velocity code file that generates SAS code and
formats results for you. Tasks include SAS procedures from simple data listings to complex analytical
procedures. You might already be familiar with tasks in SAS Studio that use SAS 9. If you license SAS
Visual Analytics, SAS Visual Data Mining and Machine Learning, or any other SAS Viya products, you
will have access to certain predefined tasks in SAS Studio.
This chapter is by no means a comprehensive guide to all of the new procedures, tasks, or features in SAS
Viya. It is a gentle introduction to using SAS Viya for your statistical programming so that you can begin
to take advantage of the scalability, elasticity, and flexibility of this modern analytics platform.
Here are the procedures and SAS Studio tasks that will be discussed in this section:
The data sourced used in the examples contains a list of donors to an organization and has a target variable,
TARGET_B, which is a binary variable that has the value 1 for a person who has donated during a mailing.
The example data is available to download from the SAS Enterprise Miner 5.3 documentation at this link.
Data Exploration
To begin, in SAS Studio click on Tasks and Utilities on the left menu. Expand the Tasks and Prepare and
Explore folders, as shown in Figure 3.1. Your menu may look different depending in which SAS Viya
products your site has installed.
Summary
First, we will extract variables summary information from our data. To accomplish this, we will use the
Summary task. Double click Summary from the Prepare and Explore folder in the left menu.
Chapter 3: Statistical Programming in SAS Viya 27
Then, select your data in the Data tab that opens in the Summary window. On the Options tab, we will
select Use custom value and leave the default at 20. Under Statistics, select All Statistics, as shown in
Figure 3.2.
Then, we will set the number of report observations to 500 in the Number of levels to show field. To run
this task, click on the Running Person at the top. To view your results in a larger window, you can click the
Maximize View option. The Variable Summary Report is shown in Figure 3.3. Scroll down to view the
Level Details report.
A level recommendation is provided for each variable. This recommendation is useful as a starting point.
For example, the recommended level for the variable MONTHS_SINCE_ORIGIN is Interval. On the
Levels Detail Report, the column TARGET_B has level 0 and 1, and level 1 is 25% of the total.
If you want to manually edit the code, click on the code tab to view the code. Then click edit to edit the
code.
28 Exploring SAS® Viya®
Sampling
Now, we will learn how to sample your data. To accomplish this, we will use the Sampling task. Double
click on Sampling from the Prepare and Explore folder in the left menu.
Select your data on the data tab. In this example, we want to perform stratified sampling on TARGET_B,
so under Sampling Method, we will select Stratified Sampling and then add the stratified variable,
TARGET_B. Under Output Data Set, we will enter the output table name. Notice that the code PROC
PARTITION is being created as you select your options, as shown in Figure 3.4.
On the Options Tab, we will use the default 50% sampling rate and arbitrarily enter 5,000 for the random
seed. After you run the data, notice that in the Frequency report shown in Figure 3.5, 50% of the data has
been sampled and that TARGET_B has the same distribution in the original data, which is a 25/75 split.
You can manually edit the program by clicking on the Code tab above the Frequency Report, then clicking
Edit.
Chapter 3: Statistical Programming in SAS Viya 29
Partitioning
Lastly, we will learn how to partition data. To accomplish this, we will use the Partitioning task. Double
click on Partitioning from the Prepare and Explore folder in the left menu. Select your data in the Data tab.
In this example, we will be performing stratified sampling on TARGET_B, so we will select Stratified
Sampling under the Sampling Method and then select the appropriate column. In the Partitions section, we
will enter 2 in the Number of Partitions field, and a 50% split in the Sampling percentage field. We will
arbitrarily enter 5000 for the random seed, as shown in Figure 3.6.
Under the Options tab, enter your output table name. Notice that the default name for the partition values is
given to you as _Partind_. After you run the task, notice that the column _PARTIND_ has the value 0 or 1
as shown in Figure 3.7. 1 is for the training data, and 0 is for the validation data. Also notice the 50/50 split
for the partition data and the 50/50 split for TARGET_B, which is what we specified.
You can manually edit the program by clicking on the Code tab above the Report, then clicking Edit.
30 Exploring SAS® Viya®
Data Transformation
In this section, you will learn how to use SAS Viya to perform data transformation tasks such as variable
imputation and variable bending. We will continue to use the same data source used in the previous
examples, which contains a list of donors to an organization. This source has a target variable,
TARGET_B, which is a binary variable with a value of 1 for a person who donated during a mailing.
Imputation
First, we will learn how to impute missing values. To accomplish this, we will use the Imputation task.
Double click on Imputation from the Prepare and Explore folder in the left menu.
Then, select your data in the Data tab that opens in the Summary window. Next, you need to select
variables for mean imputation. In this example, we will select DONOR_AGE for the mean imputation by
clicking on the + sign in the upper right corner of the first box in the Roles group. In the next box, use the
+ to sign select variables for median imputation. In this example, we are selecting INCOME_GROUP,
WEALTH_RATING, and MONTHS_SINCE_LAST_PROM_RESP, each of which contain missing
values.
On the output tab, choose a name for the data set in the Data set name box. In this example, we will select
All variables to include in the input data set. The results after clicking Run are shown in Figure 3.8.
Notice that the generated variables are named by prepending IM_ to the original variable names. You can
manually edit the program by clicking on the Code tab above the Report, then clicking Edit.
Binning
Now we will learn how to bin interval variables. To accomplish this, we will use the Binning task. Double
click on Binning from the Prepare and Explore folder in the left menu.
Select your data in the Data tab that opens in the Summary window. Next, you need to select the variables
that you want to bin in the Roles section. Click on the + sign to select variables. In this example, we are
selecting three variables: MONTHS_SINCE_ORIGIN, MONTHS_SINCE_LAST_GIFT, and
IM_DONOR_AGE.
Click the Options tab. The default method is Bucket Binning to bin each variable into 16 bins. Next, click
the Output tab. If you want to create a data set from your binned data, click the check box to turn on that
feature. Then specify the name of your new data set. In this example, we will select the options to include
all variables and also show output, as shown in Figure 3.9.
Chapter 3: Statistical Programming in SAS Viya 31
After you run the task, you will se the Bin Details report, which gives information about each of the 16
bins for each variable. Below that, you can scroll down to view the output table. Notice that the 3 binned
variables that we selected earlier have bin_ prepended to their names and that they contain the bin ID
values.
You can manually edit the program by clicking on the Code tab above the Report, then clicking Edit.
A common reason for performing variable reduction is that the size of your sample cannot support the
number of effects available for modeling. Eliminating redundancy in your effects before modeling is the
goal. Principal Component Analysis is an option, but it is a dimension reduction technique, not a variable
reduction technique. If you need to be able to interpret the model inputs, principal components are not a
good choice. Another technique called variable clustering is available in SAS 9. It is performed by the
VARCLUS procedure.
In SAS Viya, you can use the VARREDUCE procedure, one of the SAS Visual Data Mining and Machine
Learning statistics procedures. PROC VARREDUCE uses an entirely different algorithm to perform
unsupervised variable reduction. It can perform supervised variable reduction as well. Let’s look at an
example of unsupervised variable reduction that shows how you can achieve the same goals that you can
with PROC VARCLUS.
SAS 9 Code
For this example, we will use the getStarted data set, which is available from the SAS documentation. The
variables created are the generically named X1 through X10 for the numeric inputs, Y for the binary target,
and C for the categorical input variable. The binary target is needed only when you perform supervised
data reduction.
32 Exploring SAS® Viya®
Program 3.1 shows an example of code for variable reduction using PROC VARCLUS in SAS 9.
This code is fairly simple. We are specifying two stopping criteria. One is the MAXEIGEN criterion,
which stops the algorithm when all clusters at a step have second eigenvalues below the specified
value. The other is the MAXCLUSTERS criterion, which stops the algorithm when the specified
number of clusters is reached.
The output from running this code shows a summary of the number of clusters along with the members of
each cluster. The One minus R-squared table helps you select the most representative variable from each
cluster.
We specify that we would like to use variance analysis to select variables. Using this method, PROC
VARREDUCE continues to add variables until a specified proportion of the total variance is reached.
The CLASS statement creates dummy variables for the variable C. Unlike PROC VARCLUS, PROC
VARREDUCE accepts both categorical and interval inputs.
Next is the REDUCE statement, which specifies that we are performing unsupervised variable
reduction. Two stopping criteria are also mentioned here. The MAXEFFECTS option stops the
algorithm when the specified number of effects are added. The VAREXP option stops the algorithm
when the proportion of total variance is achieved.
There isn’t much output from the procedure. The selection summary table in Output 3.2 shows the
progression of proportion of variance explained as each parameter is added. Notice in that dummy
variables for C were added individually. The proportion of variance never reached 0.9 because the
algorithm already added the maximum of five effects. Other summary fit statistics are reported in the
selection summary table. In addition to accepting categorical inputs, PROC VARREDUCE can also assess
interaction, polynomial, and nested effects.
The first statement of Program 3.3 lets us view the data. The data we have in this example consists of a
binary label, ten numeric features, and one categorical feature, as shown in Output 3.3a.
The next statement in the code lets us view the categorical feature in more detail. Our categorical feature
has nine levels, as seen in Output 3.3b.
We can use the categorical feature in the VARREDUCE procedure without needing to explicitly create
a separate encoding for it as a prior step. The VARREDUCE procedure automatically creates a one hot
encoding of this feature during processing.
After invoking the VARREDUCE procedure, we name the input data set and specify the technique that
we want to employ for dimension reduction. In this instance, we used variance analysis.
The CLASS statement indicates the features for which special encoding is to be conducted before the
application of the reduction technique.
The REDUCE statement specifies the type of selection to conduct – unsupervised in this case – and the
features to be considered in the feature selection process. The REDUCE options, which are found after
the forward slash in the statement, control the number of variables to be selected. Here we indicate a
desire for either a maximum of 5 features or the number of variables that explain at least 95% of the
variance in the data, whichever comes first. The output for this procedure is shown in Output 3.3c.
34 Exploring SAS® Viya®
A summary of the selection process is presented in Output 3.3c. We can see that five variables have been
selected.
We invoke the PROC VARREDUCE procedure, as we did in the unsupervised case in Program 3.3. But
here we switch the technique employed to discriminant analysis.
We add our binary label to the CLASS statement so that encoding is conducted.
In the REDUCE statement, we specify that the type of selection to use is now supervised and the binary
label is the outcome that we are interested in modeling.
We add an ODS OUTPUT statement to save summary statistics in an output data set.
Let’s run Program 3.4 and view the output in Output 3.4.
36 Exploring SAS® Viya®
A summary of the selection process is presented in Output 3.4 and we can see that 5 variables have been
selected. We can make use of the output summary statistics by creating a graphic that reveals the amount
of variation explained by the iteration of the selection process. The code to create this graphic is given in
Program 3.5 and the graphic is shown in Output 3.5.
Unsupervised Learning
In this section, you will learn how to use SAS Viya to perform unsupervised learning tasks, including
Clustering and Principal Component Analysis (PCA).
Here are the procedures included in the SAS Studio tasks that will be discussed in this section:
K-Means Clustering
In this section, you will learn about k-means clustering, which falls under the umbrella of unsupervised
learning. The goal of clustering is to group observations into categories such that within-cluster or group
variability is minimized and between-cluster variability is maximized. In the end, every observation
belongs to exactly one cluster.
Let’s look at this technique more closely with an example. For this example, we will use the cars data set in
the SASHELP library. It contains information about cars, including their makes, models, manufacturers,
and other characteristics. The first step is to load the cars data set from the SASHELP library into the
MYCASLIB library. Note that dummy variables are created for the nominal inputs, Origin and DriveTrain.
To do k-means clustering, you can use a SAS Studio task or run code. We will look at both methods,
starting with the task.
38 Exploring SAS® Viya®
Select the cars data set in the MYCASLIB library, then select all available interval inputs in the data by
clicking on the + sign next to the interval inputs window and selecting all variables.
Go to the options tab and perform these actions as shown in Figure 3.11: select the Replace Missing Values
with the Mean check box, select Standard Deviation from the Standardization method menu, specify 4 as
the number of clusters. In SAS Studio, the code window on the right generates the necessary SAS code to
perform the k-means clustering. Run the task by clicking the running person.
Running the task creates four clusters with various descriptive model and cluster statistics on the Results
tab, as shown in Figure 3.12.
Chapter 3: Statistical Programming in SAS Viya 39
For this example, we will use the data set INPDATA created from the SAS Viya documentation for PROC
KCLUS. The names of the variables are generic, with the quantitative variables X and Y. To work with
this data in SAS Viya, add a caslib and load the data into CAS using a DATA step, or whichever method
you prefer.
SAS 9 Code
Before we look at the code for PROC KCLUS, let’s consider Program 3.6, which is an example of k-means
clustering using PROC FASTCLUS in SAS 9.
We often want to standardize input variables before clustering. In SAS 9, we have to do this outside of
PROC FASTCLUS. We can use the STDIZE procedure to perform range standardization and then use
PROC FASTCLUS on the standardized data.
By default, PROC FASTCLUS uses Euclidean distance and a maximum of one iteration. We are asking
for a maximum of 10 iterations in this code.
Another default option is to use the first k complete data points as initial seeds for clustering. Here,
however, I ask for random data values to act as cluster seeds, using the REPLACE=RANDOM option.
In this code, I use a random seed of 1, using the option RANDOM=1.
The OUT= option in PROC FASTCLUS requests an output data set containing all variables from the
input data set, plus a cluster assignment and a distance to cluster variable.
The VAR statement lists X and Y as the input variables.
The FREQ statement names the variable containing the frequency represented by a row of data.
After clustering is performed on the standardized data in PROC FASTCLUS, the clustering variables
should be unstandardized. This can be done using PROC STDIZE one more time with options for
unstandardizing.
You don’t need to standardize the data before you use the KCLUS procedure. Instead, you can request
standardization using the STANDARDIZE option. The only options available are no standardization,
range standardization, and standard deviation standardization.
By default, PROC KCLUS selects initial seeds using random selection from the input data. Here we are
using the same randomization seed of 1 that we used in the PROC FASTCLUS code.
PROC KCLUS has an INPUT statement instead of a VAR statement.
The same coding for the FREQ statement is used here as in the PROC FASTCLUS code.
An output data set is created using a SCORE statement. To retain all of the variables from the input data
set in the scored data set, use the COPYVARS= option.
The output of the PROC KCLUS code in SAS Viya looks very similar to the PROC FASTCLUS output.
The frequencies of the observations assigned to each cluster are very similar to what we would get from
PROC FASTCLUS using similar options.
You can produce plots using output data from PROC KCLUS and the same statistical graphics procedures
that you might be familiar with in SAS 9, such as PROC SGPLOT, as we will see in the example in the
next section. Be careful if you are working with very large data sets, because producing plots requires a
great deal of processing.
One of the main unknowns in the clustering algorithm is K, or the number of clusters to create. If you
have previous experience with your data and know how many clusters to create, then specify that
number using the MAXCLUSTERS= option. Otherwise, use the NOC=ABC or the aligned box
criterion (ABC) for the initial estimate of K. The ABC techniques as additional parameters to fine-
tune, so read the documentation for more details.
When using ABC to figure out K, the ABCResults and ABCStats ODS tables show the statistics used to
choose the best K. The ABC statistics plot in Output 3.8 suggests the number of clusters that will be a good
starting point.
You can also try different Ks for a more complete coverage or until you solve the business problem at
hand. In Program 3.9, we will pick a specific K and rerun PROC KCLUS to get various cluster statistics
that can be later used for generating plots.
Program 3.9
proc kclus data=mycaslib.cars
standardize=STD impute=MEAN distance=EUCLIDEAN
maxiters=50 maxclusters=6;
input &input_vars;
output out=mycaslib.cars_scored copyvars=(_all_);
42 Exploring SAS® Viya®
data mylib.clus_clustersum;
set mylib.clus_clustersum;
clusterLabel = catx(‘ ‘, ‘Cluster’, cluster);
run;
proc template;
define statgraph simplepie;
begingraph;
entrytitle “Cluster Frequency”;
layout region;
piechart category=clusterLabel response=frequency;
endlayout;
endgraph;
end;
run;
The CODE statement saves SAS DATA step score code to the cluster1.sas file. This can later be used to
score new data.
The pie chart in Output 3.9a shows the cluster frequency and the bar chart in Output 3.9b shows the profile
of car types: hybrid, sedan, SUV, sports, truck, and so on, in each of the clusters.
You can draw similar plots to profile other characteristics, and you can re-adjust K until you are satisfied
with the solution.
Let’s look at this technique more closely with an example. For this example, we will use the cars data set in
the SASHELP library. It contains information about cars, including their makes, models, manufacturers,
and other characteristics. The first step is to load the cars data set from the SASHELP library into the
MYCASLIB library. Note that dummy variables are created for the nominal inputs, Origin and DriveTrain.
44 Exploring SAS® Viya®
To perform PCA, you can use a SAS Studio task or run code. We will look at both methods, starting with
the task.
Select the cars data set in the MYCASLIB library, then select all available interval inputs in the data by
clicking on the + sign next to the interval inputs window and selecting all variables. Go to the Options tab
and perform these two actions as shown in Figure 3.13: select Use Custom Value from the Number of
Components menu and then specify 7 for the Custom Number of Components. In SAS Studio, the code
window on the right generates the necessary SAS code to perform PCA.
Run the task by clicking the running person icon. This creates seven principal components with various
descriptive and model statistics on the Results tab. The number 7 was chosen for this example because the
first 7 principal components explain more than 95% of the variance in this data.
SAS 9 Code
In this example, we will use the Heart data set from the Sashelp library of data sets that are included in
SAS. The numeric variables for the analysis are both demographic and health measures. Take a look at
Program 3.10, which uses PROC PRINCOMP in SAS 9.
Chapter 3: Statistical Programming in SAS Viya 45
The programming syntax is fairly simple. We name the data set and specify an output data set to
contain our component scores.
Here we list the input variables. The VAR statement uses a shorthand notation for the list of all
variables in creation order from AgeAtStart to Smoking.
We read the Heart data from our active caslib. We can ask for a scree plot and a pattern profile plot
using the PLOT=ALL option.
The VAR statement is identical to the one used in the SAS 9 code.
One difference between PROC PCA and PROC PRINCOMP is that the output data set is requested in
an OUTPUT statement in PROC PCA, whereas it is requested in the PROC statement in PROC
PRINCOMP. Another difference is that by default, PROC PCA does not include variables from the
input data set. However, they can be requested using the COPYVARS option.
In addition, we are including a CODE statement. The CODE statement produces a file of code for
creating component scores for new data in a SAS DATA step. This file must be created manually
using output from PROC PRINCOMP.
The output from PROC PCA looks very much like PROC PRINCOMP output. You will get a table of
simple statistics and a correlation matrix table. Next, you will see the scree plot, followed by the
eigenvalues table and the table of eigenvectors. The pattern profile plot is the same as the one produced by
PROC PRINCOMP.
It is worth noting that although PROC PRINCOMP is limited to principal component analysis using the
method of eigen decomposition, PROC PCA allows for three other methods as well. Another difference is
that PROC PCA can produce only a limited set of plots compared with PROC PRINCOMP. However, you
can produce other plots using output data sets and statistical graphics procedures, such as PROC SGPLOT.
var &input_vars;
run;
This statement uses PROC PCA to perform principal component analysis using Eigen decomposition
of the correlation matrix. The PLOTS= option draws various plots including the scree and variance
explained plots.
Using the scree plot and the eigenvalues output table, you can decide how many principal components to
keep. In this example, we will choose seven principal components because they explain 95% of the
cumulative variance in the data, as shown in Output 3.12.
46 Exploring SAS® Viya®
What Program 3.12 accomplished was to reduce the original 16 input variables to 7 without much loss of
information.
In Program 3.13, we will re-run PCA, this time to create only 7 principal components.
var &input_vars;
input &input_vars;
run;
Supervised Learning
In this section, you will learn how to use SAS Viya to perform supervised learning tasks. We will not be
looking at the tasks in SAS Studio; instead, we will focus on the differences between the new SAS Viya
procedures and the previous SAS 9 code. Here are the procedures that will be discussed in this section:
Linear Regression
In this section, you will learn how to build regression models using PROC REGSELECT, which is one of
the SAS Visual Data Mining and Machine Learning statistical procedures in SAS Viya. If you have used
PROC REG or PROC GLMSELECT in SAS 9 to perform linear regression, then keep reading to see how
the new procedure compares.
For this example, we will use the simulated data set ANALYSISDATA created from the SAS Viya
documentation. The names of the variables are generic, with the quantitative variables X1 through X20 and
categorical inputs C1 through C3. The target, or response variable, is named Y. We want to fit and validate
a model for a target, Y, using training and validation data. PROC REGSELECT will partition the data into
training and validation samples. PROC GLMSELECT in SAS 9 has the same functionality. However,
PROC REG does not have a CLASS statement for creating dummy variables for categorical inputs, nor
does it have functionality for using validation data. Let’s compare some code for a predictive regression
model in PROC GLMSELECT in SAS 9 with equivalent code in PROC REGSELECT in SAS Viya.
The PROC statements use only a DATA= option and are identical.
The PARTITION statements use exactly the same options for partitioning the data, in this case using a
variable named Role in the data set.
The CLASS statements are identical. The options for parameterizations are also the same.
The MODEL statements start out identical as well. The model we are specifying includes all main
effects and all two-way interactions. One difference is how the selection criteria are specified in the
two procedures. In PROC GLMSELECT, the selection options are requested in the MODEL
Chapter 3: Statistical Programming in SAS Viya 49
statement, whereas in PROC REGSELECT, they are selected in a separate statement – the
SELECTION statement.
In PROC REGSELECT, the SELECTION statement for selection, stopping, significance levels, and
choosing the final model, are the same as the MODEL statement selection options in PROC
GLMSELECT.
After you run Program 3.16 to get the results of the PROC REGSELECT procedure, at the top of the
results, you will see information about the model, the selection options, the number of observations in each
data partition, the CLASS statement variables, and the number of effects and parameters. A selection
summary comes next. By default, no details are displayed for each step in the selection, although those
details can be requested. Eighteen effects were entered based on the 0.1 significance level criterion. The
asterisk by the Validation Average Squared Error value shows that the model at step 10 has the lowest
value and is, therefore, the winner.
The next set of tables shows details about the selected model. The information is similar to the information
provided by PROC GLMSELECT. The last table in the PROC REGSELECT output is a task timing table.
This table is not in the PROC GLMSELECT output. Because SAS Viya is intended for use with very large
data sets, this information might be of interest to you.
In SAS Viya, plots and diagnostic plots are not available within the procedures themselves. However, you
can produce plots using output data from PROC REGSELECT and the same statistical graphics procedures
that you might have used in SAS 9, such as PROC SGPLOT.
Logistic Regression
In this section, you will learn how to perform logistic regression model building using PROC
LOGSELECT, which is one of the SAS Visual Data Mining and Machine Learning statistical procedures
in SAS Viya. If you have used PROC LOGISTIC in SAS 9 to perform logistic regression, then keep
reading to see how the new procedure compares.
For this example, we will use the simulated data set GETSTARTED created from the SAS Viya
documentation. The names of the variables are rather generic, with quantitative variables X1 through X10
and one categorical input, C. The target, or response variable, is named Y and it is binary, coded as 0 or 1.
In contrast to PROC LOGISTIC in SAS 9, PROC LOGSELECT, like all procedures in SAS Viya, can
perform calculations in multiple threads and is designed to run on a cluster of machines that distributes the
data and calculation. However, the programming and results are very similar between the two.
Let’s look at an example of coding for a logistic regression model using PROC LOGISTIC and code that
produces similar results in PROC LOGSELECT. We want to build a model using forward selection that
predicts Y using both quantitative and categorical inputs.
where PROC LOGSELECT also allows selection based on information criteria and LASSO. In
addition, PROC LOGSELECT allows the user to partition the data set into training, validation, and
test data sets, and to use validation error to select the best model in a sequence. To request the p-value
criteria for model selection in the SELECTION statement, we add suboptions to the
METHOD=FORWARD option. Those are SELECT=SL and STOP=SL, along with SLENTRY=.05,
which matches PROC LOGISTIC’s default entry criterion.
After you submit the code for PROC LOGSELECT, you can look at the results. The descriptive
information in the first five tables is identical to the information displayed in PROC LOGISTIC. Model
selection details are reported in the Selection Summary table. By default, no details are displayed for each
step, although the details can be requested.
Two effects entered based on the 0.05 p-value criterion. PROC LOGSELECT reports only one test of the
global null hypothesis, the Likelihood Ratio test. PROC LOGISTIC adds the Score and Wald tests. PROC
LOGSELECT adds the AICC to the list of fit statistics provided by PROC LOGISTIC.
The Parameter Estimates table is identical to the one that PROC LOGISTIC produces. The last table in the
PROC LOGSELECT output is a task timing table. This table is not in the PROC LOGISTIC output.
Because SAS Viya is intended for use with very large data sets, this information might be of interest to
you.
A few more differences between PROC LOGSELECT and PROC LOGISTIC are helpful to know about.
Because SAS Viya procedures are used with very large data sets, fewer plots are available within the
statistical procedures. However, you can produce plots using output data from PROC LOGSELECT and
the same statistical graphics procedures you might have used in SAS 9, such as PROC SGPLOT. Also,
PROC LOGSELECT does not include options for post-fitting, such as ESTIMATE and CONTRAST
statements. PROC LOGSELECT, compared with PROC LOGISTIC, also uses different optimization
methods by default.
For this example, we will use the simulated data set GETSTARTED created from the SAS Viya
documentation. The names of the variables are rather generic, with five categorical inputs C1 through C5.
The target, or response variable, is named Y and it represents counts.
PROC GENSELECT, like all procedures in SAS Viya, can perform calculations in multiple threads and is
designed to run on a cluster of machines that distributes the data and calculations. However, the
programming and results are very similar to what you are used to in PROC GENMOD in SAS 9.
Before we start looking at the code for PROC GENSELECT, let’s look at an example of coding for a
Poisson regression model using PROC GENMOD, and then see how they compare. We want to build a
model for predicting Y, a count variable, using all of the categorical inputs.
The CLASS statements in the two procedures are identical. All parameterizations available in one
procedure are available in the other. We are using the default parameterization – GLM
parameterization – a less than full rank parameterization. This is also the default in PROC GENMOD.
Chapter 3: Statistical Programming in SAS Viya 51
The MODEL statements are nearly identical. A slight difference is that the distribution option in
PROC GENMOD is DIST, whereas you can use the full name, DISTRIBUTION, in PROC
GENSELECT
After you submit the code for PROC GENSELECT, the first three tables in the results, down to the Class
Level information table, are nearly identical to those displayed from PROC GENMOD, and they report
descriptive information. The Dimensions table comes next and describes the design matrix. The Fit
Statistics table is somewhat different from the one that’s produced by PROC GENMOD. There are no
reported deviance or Pearson chi-square fit statistics.
PROC GENMOD reports the value of the log likelihood and the full log likelihood, whereas PROC
GENSELECT reports the -2 log likelihood. The -2 log likelihood is simply -2 times the value of the full
likelihood reported in PROC GENMOD.
The fit statistics tables from both procedures report the information criteria, AIC, AICC, and SBC. The
Parameter Estimates table from PROC GLMSELECT does not show the confidence limits for the
parameter estimates, and it displays the parameter values differently than the table from PROC GENMOD.
Otherwise, the tables show the same results except for the task timing table.
As the name implies, PROC GENSELECT can also be used for model selection. PROC GENMOD doesn’t
have that functionality. In PROC GENSELECT, you can use many different selection methods and
stopping criteria for arriving at a best subset model. These include stepwise methods based on p-values or
fit statistics and final selection based on average square error in the validation sample.
PROC GENSELECT can also be used to partition a data set into Training, Validation, and Test data sets. A
few more differences between PROC GENSELECT and PROC GENMOD are helpful to know about. In
SAS Viya, plots are not available within the procedures themselves. However, you can produce plots using
output data from PROC GENSELECT and the same statistical graphics procedures that you might have
used in SAS 9, such as PROC SGPLOT. Also, PROC GENSELECT does not include options for post-
fitting such as ESTIMATE and CONTRAST statements.
Resources
This chapter is based on the “An Introduction to SAS Viya Programming for SAS 9 Programmers” videos
in “SAS® Viya® Enablement,” a free course available from SAS Education.
You may find the following documentation helpful as you learn more about programming in SAS® Viya®:
Introduction
In this chapter you will learn about several applications that you can use to manage your data in SAS Viya.
First, we will look at SAS Data Explorer, which enables you to load, import, and profile your data. Then, we
will discuss SAS Data Studio, which helps you cleanse, prepare, and transform your data. Finally, we will
explore SAS Lineage Viewer, which enables you to manage and govern data assets and their relationships.
Connect to Data
In this section, you will learn how to use SAS Data Explorer to manage and discover data in the SAS Viya
environment using related tasks such as loading and analyzing data.
When you log into SAS Data Explorer through the Manage Data action, you will see the Choose Data
window. On the left, there are three tabs, as shown in Figure 4.1:
● Available – the Available tab shows you a list of all the tables that have been loaded into CAS
memory.
● Data Sources – the Data Sources tab shows you all of the CAS servers and data connections
available to you in your SAS Viya environment.
● Import – the Import tab enables you to import local files from your browser; social media such as
Twitter, Facebook, Google Analytics, or YouTube feeds; and Geo Enrichment data from Esri.
54 Exploring SAS® Viya®
You will be presented with a Connection Settings wizard as shown in Figure 4.3. In the Type field, select
“File System,” and then choose the source type. If you want your connection to persist after you leave the
SAS Viya environment, select “Persist this connection beyond the current session.”
Chapter 4: Data Management in SAS Viya 55
Then, specify a Path that is local to your CAS Server where you can access the data. Click “Test
Connection” to make sure that the CAS Server can access that location. If the connection is successful, you
can hit Save. When you return to the Data Sources tab, the data set connection should be added under the
server that you specified.
If you navigate into that connection by clicking on the arrow at the right, you can see the data sets in that
file system path location. A location can include SAS data sets, text files, Excel spreadsheets, and
delimited files.
When you click on a data set to select that table, you will be presented with the table information on a
Details tab that gives you the metadata about the columns including the names of the columns and their
data types, as shown in Figure 4.4.
If you select the Sample Data tab, you will see a sample of 100 rows of data so that you can review and see
what your data looks like and notice any inconsistencies that may need to be addressed before it can be
used for analytics. The Profile tab enables you to generate some basic metrics about the table.
In the Type field on the wizard, select “Database.” In the Select source type field, you will see a list of
databases. Choose the one that corresponds to the type of database that you want to import, as shown in
Figure 4.5. For example, to select an Oracle database, select Oracle from the list.
At the bottom of the wizard, fill out the fields to add your connection information. The connection
information is unique to each database. If you select another database from the list, then you will get
connection information specific to that database access.
After you have added your connection information, click “Test Connection.” Once the connection is
successful, click “Save.” When you return to the Data Sources tab, you will see your new database
connection listed.
When you double-click on the database connection, you will see the tables that are in your schema.
Clicking on a table will give you the option to view basic details about the columns and their data types as
well as sample data of the first 100 rows.
Import Data
Now that you have created a connection to your data, we will learn how to import a SAS data sets and local
files into the SAS Viya environment and load them into memory.
After you select “Add to Import,” you will see the Import window on the right side of your screen. You can
specify your target location in the CAS environment where you want the distributed data to reside by using
the Target Destination option. You can change the Target table name as well. If you want to run this as a
job later on, you can select “Replace file” so that you can create a job and schedule it for later.
Because we are loading a SAS data set, you will get certain file specifications such as encryption
information. You can also use the Filter Rows so that instead of bringing in all of the data, you apply filters
to the data coming in as shown in Figure 4.7.
You can also select the column filter if you enable column selection in the “Select Columns” tab. By
default, all of the columns are selected. You can double-click columns or use the navigation arrows to pull
columns back to the available columns area, which removes them from the import, as shown in Figure 4.8.
The number of columns that you will be importing is indicated in the “Select Columns” header.
58 Exploring SAS® Viya®
After you have finished filtering, select “Import Item.” The import loads the data into your SAS Viya
environment in distributed format, and also loads the data into your Viya memory. After the import is
complete, you can return to the Available tab on the left side of the wizard and you will see the table that
you imported with a lightning bolt, which indicates it is an in-memory table.
In this example, we are importing an Excel spreadsheet, so the Import tab will show certain file
specifications that are available for Excel spreadsheets, as depicted in Figure 4.10. The options in the top
half of the tab are the same as the previous example of importing a SAS data set, but you can also specify
which sheets you want to import, indicate a header row, and limit the range of imported columns. When
you have finished choosing all of your option, select “Import Item.”
Chapter 4: Data Management in SAS Viya 59
Perform Actions
After you import an item successfully, you will see a green notification at the top of the screen. If you click
on “Action” in the notification, you will see options for other operations that you can do in your SAS Viya
environment, including
● Prepare Data – allows you to cleanse and wrangle data in SAS Data Studio
● Explore and Visualize Data – opens SAS Visual Analytics
● Build Models – lets you build models from the data
● Explore Lineage – opens SAS Lineage Viewer to explore object relationships
You can also access these features from the Actions drop-down menu in the upper right corner of your
screen. From this one SAS Data Explorer application, you can see the interoperability in the SAS Viya
Environment with applications that make seamless navigation possible. You will learn more about SAS
Data Studio and SAS Lineage Viewer later in this chapter.
Create Jobs
In a previous section, we mentioned creating jobs that can be scheduled in a production environment. To
create a job, go to the Import tab and right-click on any import job that you have created. Select “Create
job” as shown in Figure 4.11.
60 Exploring SAS® Viya®
In the Create Job window that appears, you will see a default job name with a unique ID at the end. You
have the ability to change this job name if you want and enter in a description. After you click OK to create
the job, you can then go into the SAS Viya Environment Manager under the scheduling page and schedule
that job to run based on specific triggers. Later on, you can go into SAS Job Monitor and check the status
of those jobs as they run.
Profile Data
In this section, you will learn how to use SAS Data Explorer to profile data in a SAS Viya environment to
determine data anomalies and inconsistencies. We will be using and reviewing related profile reports such
as descriptive measures and frequency distributions.
In a previous section, you learned how to import data. For each table, on the Details tab you are able to see
metadata about that table. On the Sample Data tab, you can see a sample of the first 100 rows. On the
Profile tab, you can run a report to see the columns and generate table metrics about values that are
Unique, Null, Blank Counts, Pattern Counts, and statistics including mean, median, mode, standard
deviation, and standard error. See Figure 4.12.
If you want to review some of the data further, you can click on a column name in the Profile tab and bring
up column descriptions with descriptive metrics including Mode, Min, and Max as shown in Figure 4.13. If
you have redundancy in your data, then that may indicate issues. In the Pattern Distribution section, you
can see the various patterns that currently exist in your data. The Frequency Distribution graph allows you
to hover over the different bars to see certain values. On the far right-hand side, you will see some basic
column information such as data type and data length. Clicking on the “Report” link at the top will take
you back to the original Profile tab.
As you cleanse the data and create business rules, you may want to run the profile again with new business
rules in place that cleans your data on the fly through machine learning or manual data cleansing processes.
SAS Data Explorer gives you the ability to compare reports on the same table. Click on the circular arrow
icon on the far right of the profile tab to view and compare versions of each report.
When you first enter SAS Data Studio, you have the option to create a New Plan or open an existing plan.
After you create a New Plan and choose your data from available tables loaded in CAS memory, on the left
side of the Data Studio window you will be presented with a list of Column Transforms, Custom
Transforms for submitting custom code such as CASL or DATA steps, Data Quality Transforms, Multi-
input Transforms, and Row Transforms. See Figure 4.14.
62 Exploring SAS® Viya®
Column Transforms
Let’s look at an example of a column transform that can be performed in SAS Data Studio.
Split
If you have a table with a column that contains email addresses, you may want to break the email addresses
into two separate fields: sender and domain. In order to do this, we will use the Split option under Column
Transforms. Double-click “Split” to add the Split to your work palette, as shown in Figure 4.15.
Chapter 4: Data Management in SAS Viya 63
You can then select the column in the table that you want to perform the split on in the Source Column
Field. You can split the data on a delimiter or other options in the Split data field. You can then choose the
delimiter in the Delimiter field. In this example, we would choose the Source Column Email, split the data
on a delimiter, and choose Other as the type of the delimiter, then specify the @ sign as the delimiter. The
split will create two new columns, one for the left and one for the right. If you want to change the names of
the new columns, click on the “Options for new columns” link where you can change the name, data type,
length, or apply a label or SAS format.
When you have chosen all of your options, click “Run” in the upper right. You will see your changes
applied in the table at the bottom of the screen. At this point, you have been wrangling the data in-memory
without having to create a new table.
Custom Transforms
You can continue to transform your data by adding another step. In this section, we will look at the two
types of custom transforms that are available in Data Studio.
Calculated Column
In this example, we will add another transformation by clicking on “Calculated column” and copying a
SAS IFC function into the Expression box, as shown in Figure 4.16. The IFC function basically does a
string comparison that says if CUST_ID is greater than 250, then I will create a new column.
64 Exploring SAS® Viya®
We need to create a new column to house the new data, so select the “Create new column” option below
the Expression field and re-name the column “Customer_Status”.
After you select Run, the Split Transform will run first, then the Calculated Column will run. You will see
the new column added to the table at the bottom of the screen.
SAS Code
What if you have existing SAS DATA step code or new CAS language programming code (CASL) you
want to apply? You can double click on Code in the Custom Transforms section. It adds another step that
allows you to specify DATA step code or CASL to call CAS action sets, such as impute missing values.
In this example, we will use DATA step code by copying and pasting existing code into the editor window
as shown in Figure 4.17. The code adds a new column called Account Description based on some criteria
from ACCT_TYPE.
As a brief aside, if you look at the code in Program 4.1, you will see that everything from the length code
down to the run is the same as SAS 9. What is different is the first two lines of code. In the first line, you
specify the variables for the output table and your CAS library that is being referenced. In the second line,
you specify the variables for the table you are reading in. These variables are needed because CAS works
in a distributed environment and it creates intermediate tables. Data Studio will keep track of all of these
intermediate tables for you.
Program 4.1
data {{_dp_outputTable}} (caslib={{_dp_outputCaslib}});
set {{_dp_inputTable}} (caslib={{_dp_inputCaslib}});
length ACCT_DSC varchar(20);
if ACCT_TYPE=”” then ACCT_DESC=”Unknown”;
if ACCT_TYPE=”SAV” then ACCT_DESC=”Savings”;
if ACCT_TYPE=”CHK” then ACCT_DESC=”Checking”;
if ACCT_TYPE=”MM” then ACCT_DESC=”Money Market”;
run;
If you have existing DATA step code from SAS 9 that you want to leverage in Data Studio, just use the
variables in place of what you are currently using in your existing code.
After you click run, you can review the results in the table at the bottom of the screen. As seen in Figure
4.18, a new column called ACCT_DESC was created based on the criteria specified in Program 4.1.
Standardize
The Standardize data quality transform allows you to use the SAS Quality Knowledge Base to standardize
your data into a consistent form. When you profiled your data earlier in SAS Data Explorer, you may have
noticed that your data has some anomalies that need to be corrected. In this example, the data for State has
both 2-byte codes and full state names.
Double-click Standardize in the Data Quality Transforms section. Then you can select your source column.
The next field allows you to specify any name you want for your new column. In the next field, Locale,
you can choose the locale for the SAS Quality Knowledge Base you want to use. The SAS Quality
Knowledge base with its definitions and algorithms support over 40 different locales from around the
world.
66 Exploring SAS® Viya®
In the next field, Definitions, there are a lot of different definitions that are available for different data
types. For this example, we will select State/Province (Abbreviation) because we want all of our State
values to be abbreviated. You can also specify length and options for new columns, similar to the options
in other transforms previously discussed.
After you click Run and review the results, you can see that in the new column, STATE_STND, the values
for STATE have been standardized into the 2-byte code, as shown in Figure 4.19.
Also in Data Studio, you can look at the Profile and Metadata column information about this table and
about the new fields that you have created.
Parsing
Sometimes you want to break data down into its various semantic components or “parse” it so that you can
analyze the individual components. For example, you may want to only analyze the ZIP codes from a full
address or only analyze the year from a DD/MM/YY date.
The Parse transform is similar to the Standardize transform because it also uses the SAS Quality
Knowledge base to extract information. In this example, we will parse data in a column called PHONE.
Double click on Parsing in the Data Quality Transforms section. Select the column name from the Source
Column field, as well as the Locale you want to use. In the definition, we will select Phone because we are
parsing a phone number.
Next, you will see the individual tokens into which your data type can be parsed. For a phone number, we
will select Area Code and Base Number from the list of tokens, as shown in Figure 4.20. This will break
the data down into two fields.
Chapter 4: Data Management in SAS Viya 67
After you click Run, you will see two new columns titled PHONE_AreaCode and PHONE_BaseNumber
that contain the parsed information.
Gender Analysis
Let’s look at one final example of a data quality transform. The Gender Analysis transform allows you to
know the gender associated with the name of an individual based on the information in the SAS Quality
Knowledge Base. The algorithm will determine whether a data value is male, female, or unknown. This
may be useful in marketing or clinical trial scenarios.
Double-click on Gender Analysis in the Data Quality Transform section. Select the column name from the
Source Column field, choose the name of your new column, and choose the Locale you want to use. In the
definition, we will select Name. Select Run to complete the transform. As you can see in Figure 4.21, a
new column is added with F, M, and U values based on the names in the NAME column.
Saving Results
Once you have completed all of your transforms and your data is how you expect it to be, you can save this
data preparation plan and the results to an actual table.
Click on the Save button in the top right next to the Run button. You can give your data plan a title and
also save the table with a new name or replace an existing table by using the options as shown in Figure
4.22. You can also specify where you want to save the actual CAS table.
If you want to save and then create a job from this data preparation plan, you can click on the three vertical
dots next to the Save button and select “Create job” from the drop-down menu. This will enable you to
create a reusable job that you can run through the scheduling interface and environment manager.
The first step to explore lineage is to open up a Search for subjects by clicking on File in the upper left
corner of the Lineage Viewer screen and selecting “Search for subjects.” This will enable you to search the
lineage repository for data objects and assets. In a previous section, we created a data preparation plan
called “Cleanse and Aggregate,” so in this example, we will search for that data preparation plan as shown
in Figure 4.23 by typing into the search box and clicking the magnifying glass search button.
Chapter 4: Data Management in SAS Viya 69
The search results will show you any data objects such as data preparation plans, tables, data sources,
schemas, and files that meet your search criteria. Clicking on a results entry will open the visual
environment to enable you to view relationships, govern the data, and understand what happens if you
remove a table.
Details Tab
In the visual environment, you can click on an object and then open the details tab on the right side of your
screen. The Details tab tells you what the object type is, the URI, the data source, when it was modified, as
well as the relationships of that object to other tables and files, as shown in Figure 4.24.
Another way to view the relationships between objects is to click on the + sign in the upper right corner of
an object. This expands the relationships in the visual interface. If you want even more information about
the relationships between objects, you can select the “Highlight Path” icon at the top of the screen.
70 Exploring SAS® Viya®
You can create specialized views by clicking on the New button, which opens a dialog to create a new
view based on the relationship types and object types that you want to be displayed in the view. After you
Chapter 4: Data Management in SAS Viya 71
create a new view, you can use the checkmarks in the Manage View Filters tab to change the custom view
to be the default view.
If you have a specific object that you want to see only the relationships based on that data, you can select
that object and click the “View Lineage” icon at the top of the screen. This will create a new view just for
that specific object. An easy way to reset the view is to click on the View drop-down menu in the upper left
of the screen and choose “Reset layout.”
Resources
This chapter is based on the “SAS Data Management” videos in “SAS® Viya® Enablement,” a free course
available from SAS Education.
You may find the following documentation helpful as you learn more about data management in SAS®
Viya®:
Access free tutorials and other data preparation training resources from the SAS Data Preparation Learning
Center.
For 20% off these e-books, visit sas.com\books and use code WITHSAS20.
sas.com/books
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2019 SAS Institute Inc. All rights reserved. M1913158 US.0419