Loading...

Improving Query Performance in Advanced SAS

There are several ways to improve query performance, including the following:
·         using indexes and composite indexes
·         using the keyword ALL in set operations when you know that there are no duplicate rows, or  when it does not matter if you have duplicate rows in the result table
·         omitting the ORDER BY clause when you create tables and views
·         using in-line views instead of temporary tables (or vice versa)
·         using joins instead of subqueries
·         using WHERE expressions to limit the size of result tables that are created with joins
·         using either PROC SQL options, SAS system options, or both to replace a PUT function in a query with a logically equivalent expression
·         replacing references to the DATE, TIME, DATETIME, and TODAY functions in a query with their equivalent constant values before the query executes
·         disabling the remerging of data when summary functions are used in a query
Query Performance in SAs

Using Indexes to Improve Performance
Indexes are created with the CREATE INDEX statement in PROC SQL or with the MODIFY and INDEX CREATE statements in the DATASETS procedure. Indexes are stored in specialized members of a SAS library and have a SAS member type of INDEX. The values that are stored in an index are automatically updated if you make a change to the underlying data.
Indexes can improve the performance of certain classes of retrievals. For example, if an indexed column is compared to a constant value in a WHERE expression, then the index will likely improve the query's performance. Indexing the column that is specified in a correlated reference to an outer table also improves a subquery's (and hence, query's) performance. Composite indexes can improve the performance of queries that compare the columns that are named in the composite index with constant values that are linked using the AND operator. For example, if you have a compound index in the columns CITY and STATE, and the WHERE expression is specified as WHERE CITY='xxx' AND STATE='yy', then the index can be used to select that subset of rows more efficiently. Indexes can also benefit queries that have a WHERE clause in this form:
... where var1 in (select item1 from table1) ...
The values of VAR1 from the outer query are found in the inner query by using the index. An index can improve the processing of a table join, if the columns that participate in the join are indexed in one of the tables. This optimization can be done for equijoin queries only--that is, when the WHERE expression specifies that table1.X=table2.Y.
Using the Keyword ALL in Set Operations
Set operators such as UNION, OUTER UNION, EXCEPT, and INTERSECT can be used to combine queries. Specifying the optional ALL keyword prevents the final process that eliminates duplicate rows from the result table. You should use the ALL form when you know that there are no duplicate rows or when it does not matter whether the duplicate rows remain in the result table.
Omitting the ORDER BY Clause When Creating Tables and Views
If you specify the ORDER BY clause when a table or view is created, then the data is always displayed in that order unless you specify another ORDER BY clause in a query that references that table or view. As with any sorting procedure, using ORDER BY when retrieving data has certain performance costs, especially on large tables. If the order of your output is not important for your results, then your queries will typically run faster without an ORDER BY clause.
Using In-Line Views versus Temporary Tables
It is often helpful when you are exploring a problem to break a query down into several steps and create temporary tables to hold the intermediate results. After you have worked through the problem, combining the queries into one query by using in-line views can be more efficient. However, under certain circumstances it is more efficient to use temporary tables. You should try both methods to determine which is more efficient for your case.
Comparing Subqueries with Joins
Many subqueries can also be expressed as joins. Generally, a join is processed at least as efficiently as the subquery. PROC SQL stores the result values for each unique set of correlation columns temporarily, thereby eliminating the need to calculate the subquery more than once.
Using WHERE Expressions with Joins
When joining tables, you should specify a WHERE expression. Joins without WHERE expressions are often time-consuming to evaluate because of the multiplier effect of the Cartesian product. For example, joining two tables of 1,000 rows each without specifying a WHERE expression or an ON clause, produces a result table with one million rows.
PROC SQL executes and obtains the correct results in unbalanced WHERE expressions (or ON join expressions) in an equijoin, as shown here, but handles them inefficiently:
where table1.columnA-table2.columnB=0
It is more efficient to rewrite this clause to balance the expression so that columns from each table are on alternate sides of the equals condition:
where table1.columnA=table2.columnB
PROC SQL sequentially processes joins that do not have an equijoin condition evaluating each row against the WHERE expression: that is, joins without an equijoin condition are not evaluated using sort-merge or index-lookup techniques. Evaluating left and right outer joins is generally comparable to, or only slightly slower than, a standard inner join. A full outer join usually requires two passes over both tables in the join, although PROC SQL tries to store as much data as possible in buffers. Thus for small tables, an outer join might be processed with only one physical read of the data.

VirtualNuggets 4973440053164035657

Post a Comment

emo-but-icon

Home item

Blog Archive

Popular Posts

Random Posts

Flickr Photo