The Process of Tuning DB2 for AIX - IBM

Understanding the DB2 UDB architecture and all its implications is quite challenging. Moreover, IBM provides over eighty parameters that impact the tuning of ...


un extrait du document



L.L. Bean Inc.
DB2 Performance Recommendations
Version 0.1,  TIME \@ "MMMM d, yyyy" April 13, 2009

































Shelton Reese Database Consultant
Table of Contents

 TOC \o "1-2" \h \z  HYPERLINK \l "_Toc182885741" Table of Contents  PAGEREF _Toc182885741 \h 2
 HYPERLINK \l "_Toc182885742" Introduction  PAGEREF _Toc182885742 \h 2
 HYPERLINK \l "_Toc182885743" The Process of Tuning LL Bean’s DB2 for AIX 5.5  PAGEREF _Toc182885743 \h 3
 HYPERLINK \l "_Toc182885744" Operating system settings  PAGEREF _Toc182885744 \h 4
 HYPERLINK \l "_Toc182885745" Performance fundamentals  PAGEREF _Toc182885745 \h 4
 HYPERLINK \l "_Toc182885746" Updating catalog statistics  PAGEREF _Toc182885746 \h 7
 HYPERLINK \l "_Toc182885747" Monitoring and tuning database configuration parameters  PAGEREF _Toc182885747 \h 9
 HYPERLINK \l "_Toc182885748" Buffer pool size  PAGEREF _Toc182885748 \h 9
 HYPERLINK \l "_Toc182885749" Log buffer size (LOGBUFSZ)  PAGEREF _Toc182885749 \h 12
 HYPERLINK \l "_Toc182885750" Application heap size (APPLHEAPSZ)  PAGEREF _Toc182885750 \h 13
 HYPERLINK \l "_Toc182885751" Sort heap size (SORTHEAP) and sort heap threshold (SHEAPTHRES)  PAGEREF _Toc182885751 \h 13
 HYPERLINK \l "_Toc182885752" Number of agents (MAXAGENTS, NUM_POOLAGENTS and NUM_INITAGENTS)  PAGEREF _Toc182885752 \h 15
 HYPERLINK \l "_Toc182885753" Locks (LOCKLIST, MAXLOCKS and LOCKTIMEOUT)  PAGEREF _Toc182885753 \h 17
 HYPERLINK \l "_Toc182885754" Maximum number of active applications (MAXAPPLS)  PAGEREF _Toc182885754 \h 19
 HYPERLINK \l "_Toc182885755" Number of asynchronous page cleaners (NUM_IOCLEANERS)  PAGEREF _Toc182885755 \h 20
 HYPERLINK \l "_Toc182885756" Number of I/O servers (NUM_IOSERVERS)  PAGEREF _Toc182885756 \h 22
 HYPERLINK \l "_Toc182885757" Number of commits to group (MINCOMMIT)  PAGEREF _Toc182885757 \h 22
 HYPERLINK \l "_Toc182885758" AIX Disk Layout Considerations  PAGEREF _Toc182885758 \h 24
 HYPERLINK \l "_Toc182885759" Conclusion  PAGEREF _Toc182885759 \h 25
 HYPERLINK \l "_Toc182885760" Appendix: DB2_STATUS (KORN SHELL Script to Perform the Configuration Update)  PAGEREF _Toc182885760 \h 25



Introduction
L.L.Bean, Inc. is a privately held mail-order and retail company based in Freeport, Maine, specializing in clothing and outdoor recreation equipment. Its annual sales were USD $1.47 billion in 2005. LL Bean’s Internet interface requires use of multiple technology layers. Keeping these applications performing optimally is an intense challenge because of the scope and technological diversity of the application.
Understanding the DB2 UDB architecture and all its implications is quite challenging. Moreover, IBM provides over eighty parameters that impact the tuning of the database manager or the underlying databases. Because the parameters interact in complex ways, it is easy to adjust a parameter to address a particular performance issue, only to have unforeseen detrimental results in another area of the database. At LL Bean the key problem is Database Configuration problems causing the overall performance to be negatively affected. All numbers used in this document were collected from the ECOMP DB2 instance. All software has been delivered via the support directory on the production server.
IBM was brought in to resolve the performance issues that LL Bean was experiencing. As part of the effort to resolve the performance problems, I installed and executed seven pieces of software on the production “ecomp” server at LL Bean, as follows.
db2_status.ksh -d --> this script waits for the sampling time and then gathers all the DB2 snapshot information into the IBM snapshot tables. The script provides the following data for reviw:
The hit ratio for the data and index pages of every bufferpool.
The hit ratio for the catalog cache.
The hit ratio for the package cache.
The number of sorts serviced and number of sorts rejected.
The average elapsed time for each sort.
The update command to update the configuration to proper settings
aix_check.ksh . This script reads the AIX configuration and reports on any problem that might be found with possible suggestions.


The Process of Tuning LL Bean’s DB2 for AIX 5.5
This feedback is organized in three sections of this document, as shown below. Within these sections, I will provide the approach I used to identify problems and the actions I suggest for solving them.
 HYPERLINK \l "_Operating_system_settings" Fundamentals to improve LL Bean’s database performance based on operating system settings, which provides essential recommendations based on the analysis of documents and snapshots taken 11/11/07 and 11/14/07. All documentation is stored on the LL Bean ~/support/snapout directory. Based on runtime timestamp for reference.
 HYPERLINK \l "_Updating_catalog_statistics" Updating catalog statistics, which emphasizes the importance of collecting and maintaining up-to-date database statistics, the lack of which is often the source of many of the performance problems.
HYPERLINK \l "_Monitoring_and_tuning_database conf"Monitoring  HYPERLINK \l "_Monitoring_and_tuning_database conf" and tuning  HYPERLINK \l "_Monitoring_and_tuning_database conf" LL Bean’s database  HYPERLINK \l "_Monitoring_and_tuning_database conf" configuration  HYPERLINK \l "_Monitoring_and_tuning_database conf" parameters, which describes a list of database manager parameters and database parameters in the order of importance. Normally it's not necessary to go through the whole list to achieve the performance goals. I only recommend that several of them be adjusted to gain performance improvement.
With these procedures, IBM can get LL Bean’s Internet application up and running with verifiable and reasonably good performance.

Operating system settings
Recommendation
The following settings are recommended to allow maximum resource usage of AIX by DB2.
CPU, Memory & Kernel
ulimit
'ulimit' defines file descriptors limits. It specifies the number of open files permitted. If this value is set too low, a memory allocation error will occur on AIX and a too many files open will be logged to the to the stderr log file. Set this value higher than the default system value. For large multi-processor machines, set to unlimited.
How to change the parameter
# ulimit -a // to query # ulimit -n // to set to a new value
ulimit -n -1 // to set it to unlimited
Maximum Number of Processes Per User
'maxuproc' defines the max number of processes each AIX user can 'fork'. If it is set too low, you may not be able to create high number of database agents necessary to handle connections from your application. So it is recommended to set this to at least 4096.
How to change the parameter

Set the AIX maxuproc (maximum number of processes per user) device attribute to 4096 by entering the command:

chdev -l sys0 -a maxuproc=’4096’
Investigative Steps
The value set by AIX (1024) has been found to be too low for some large scale database environments, which causes DB2 to generate an SQL error message "SQL1402N - Unable to authenticate user due to unexpected system error."
To check the value run:
lsattr -l sys0 -E | grep maxuproc
Shared Memory Segments
The 'EXTSHM' environment variable defines the max number of memory segments shared by all user-mode applications. Some database servers, such as IBM DB2, relies on this environment variable to support large workloads. To set it, add the following lines to a user's profile:
EXTSHM=ON export EXTHSM

Network
In high volume environment, TCP layer on AIX should be tuned to the similar values listed below:
Investigative Steps
To display a list of current tunable values, use:
/usr/sbin/no -a
To make TCP changes permanent, add the following lines into the /etc/rc.net file:
/usr/sbin/no -o sb_max=6192000
/usr/sbin/no -o tcp_sendspace=4096000
/usr/sbin/no -o tcp_recvspace=4096000
/usr/sbin/no -o udp_sendspace=65536
/usr/sbin/no -o udp_recvspace=655360
/usr/sbin/no -o rfc1323=1
/usr/sbin/no -o ipqmaxlen=150
/usr/sbin/no -o clean_partial_conns=true

Ethernet Device Driver Properties
To display a list of current ethernet adapter device driver settings such as MTU, IPs, etc., use:
lsattr -E -l en2
where en2 is the adapter name - chances are you have several, en0, en1, etc.
To change the adapter network settings, one way is to run the command line tool chdev:
# ifconfig en0 detach
chdev -l ent0 -a tx_que_size=8192
# ifconfig en0 hostname up where tx_que_size is 8192

Disk I/O
Tune FILE SYSTEM Performance
The following file system mount option settings allow you to take advantage of caching, throughput optimization, and performance of UNIX attached external storage device, like a SAN.
Update the /etc/filesystems, and modify the wanted file system section(s) to use 'dio':
/mydb2:
dev = /dev/fwvol
vfs = jfs2
log = vjfsxxx
mount = true
options = dio,rw
account = false

Enabling file system direct I/O, on AIX, allows the file system buffer cache to be bypassed which eliminates the double buffering of data which can adversely affect file system I/O performance. For changes to take effect, you may run 'mount -a', if the file system you are modifying is not mounted. If the file system is mounted it will need to be unmounted and then remounted or a system reboot will need to be executed.

Performance fundamentals
Following are the findings of my analysis regarding the state of LL Bean’s system and the parameters that are affecting performance.
There is enough memory at the current amount of 45 GB.
ECOMP a 64-bit system, uses at least 3.7 GB of RAM per CPU, up to 8 GB per machine, to support the buffer pools, DB2 agents, and other shared memory objects required for a large number of concurrent users. (See Buffer Pool Size (BUFFPAGE) later in this document for more information on buffer pools.)
In the VMSTAT Report there are three columns of particular interest. The key figures for CPU utilization (in percentage) are shown VMSTAT entitled us, sy, and id, with "id" being the percentage of CPU that is idle. A value below 10% is great. For LL Bean’s multiple CPUs, If this average is over 20% then the db2 configuration would DBHEAP would be updated because memory may be needed to support applications that run locally without stored procedures.
The amount of memory required by each user connection into the database (that is, a DB2 agent) depends on the nature of the SQL statements performed by the LL Bean application -- such as the number of concurrent cursors opened and the amount of sorting and temp space required. For the LL Bean application, there should be less sorting and temp space required and only a handful of concurrent cursors opened at a time.
Rule of thumb: Use a minimum of 1 MB for AIX for each DB2 agent. If fenced stored procedures are used, then each user connection has two DB2 agents, in addition to the memory required to run the stored procedure application.
LL Bean has sufficient I/O handling capability.
There must be enough disk arms to ensure sufficient I/O parallelism to support a high volume of concurrent transactions. There should be at least 5 to 10 disks per CPU for a moderate workload, and 20 disks for a high-I/O OLTP workload. The operating system (including paging space), DB2 logs, and DB2 table spaces should each have their own dedicated disks. There should be multiple disks for the DB2 logs, for tables, and for indexes. Use the AIX_CHECK script to check how the disk bandwidth is doing.
The proper way to estimate the I/O handling capability that is needed for good performance is actually to prototype the transactions and find out how many I/Os are required per transaction, and how many transactions per second are required. Then find out the I/O rate for the disk controller and the disk subsystem to help determine how many controllers and disks are required. . Perhaps the most two most important columns in the AIX_CHECK report are the %tm_act and %tps columns. %tps is the percentage of time spent waiting for transactions to complete, and is a measure of how many jobs are contending to use the hard disk (eg. high is bad), while %tm_act is the percent of time the hard disk is busy (eg. one high device relative to the other devices is bad). At LL Bean “hdisk 35” had reports above 70%
There is sufficient network bandwidth.
There must be enough network bandwidth to support the workload. Make sure that the network or any intermediate hubs are not a bottleneck. This is especially significant when supporting access from remote location. For example, a T1 line supports 1.544 Mbit/sec, which is only 0.193 MB/sec, whereas a typical 10 Mbit/sec Ethernet LAN can support 6x the throughput at 1.25 MB/sec. I used the command netstat to monitor the traffic volumes on the connections. I am not an expert in all the AIX parameters, but a quick look at the error parameters such as Collision Errors gives me an idea of the overall health of the LL Bean network. From the NMON I suggested a number of TCPIP changes.
Index your table columns appropriately.
Ensure that columns that are joined in queries have indexes.
It can improve performance if columns involved in ORDER BY and GROUP BY are indexed.
Frequently accessed data can also be included within an index as INCLUDED columns.
Use the Index Advisor (also known as Index Wizard from the DB2 Control Center) to help determine a good set of indexes to use, based on the tables and SQL statements that you use.
Ensure that your application holds locks for the shortest time possible.
When a user operation involves multiple interactions, each interaction should have its own transaction to commit and should free up all locks before returning activity to the user. Keep the duration of a transaction as short as possible by starting its first SQL statement (which starts a transaction) as late as possible, and its updates (inserts, updates, and deletes, which use exclusive locks) as close to the commit stage as possible.
Use of the DB2 registry parameter DB2_RR_TO_RS can improve concurrency by not locking the next key of the row that was inserted or updated. This can be used if the isolation level RR (Repeatable Read) is not used by any programs that operate on the same set of tables. Use DB2 Snapshot to monitor the number of deadlocks and lock waits. Use the DB2_STATUS.KSH script to keep up to date in this area.
Use stored procedures or compound SQL to minimize the network cost.
Minimizing the number of network trips for your SQL statements will save on network latency and context switches, which can result in the application holding onto locks for a shorter period of time. Generally, a stored procedure should be used when an SQL transaction has more than 4 or 5 statements.
On the other hand, if there is some complicated CPU-intensive processing involved in the application logic, leaving this in a stored procedure running on the database server can use up excessive CPU cycles on the database server at the expense of database operations. In this case, either do not use a stored procedure, or execute part of the logic in the client side and execute the rest in a stored procedure.
Use SQL efficiently.
In general, don't use multiple SQL statements where one will do. When you provide more detailed search conditions by having more predicates in a query, the optimizer has a chance to make better choices. You should also be selective in your query so that the database does not return more rows and columns than you need. For example, use SQL to filter the rows you want; don't return all the rows and then require the application to do the filtering.
Analyze the access plan.
Use Visual Explain or db2exfmt to analyze each SQL statement. Make sure appropriate indexes are used to minimize the rows that have to be fetched internally when selecting and joining tables.



Updating catalog statistics
Background
The RUNSTATS utility updates statistics in the system catalog tables to help with the query optimization process. Without these statistics, the database manager could make a decision that would adversely affect the performance of an SQL statement. The RUNSTATS utility allows you to collect statistics on the data contained in the tables, indexes, or both tables and indexes. Use the RUNSTATS utility to collect statistics based on both the table and the index data to provide accurate information to the access plan selection process in the following situations:
When a table has been loaded with data, and the appropriate indexes have been created.
When a table has been reorganized with the REORG utility.
When there have been extensive updates, deletions, and insertions that affect a table and its indexes. ("Extensive" in this case may mean that 10 to 20 percent of the table and index data has been affected.)
Before binding application programs whose performance is critical.
When you want to compare new statistics with previous statistics. Running statistics on a periodic basis enables you to discover performance problems at an early stage.
When the prefetch quantity is changed.
When you have used the REDISTRIBUTE NODEGROUP utility.
When optimizing SQL queries, the decisions made by the SQL compiler are heavily influenced by the optimizer's model of the database contents. This data model is used by the optimizer to estimate the costs of alternative access paths that can be used to resolve a particular query. A key element in the data model is the set of statistics gathered about the data contained in the database and stored in the system catalog tables. This includes statistics for tables, nicknames, indexes, columns, and user-defined functions (UDFs). A change in the data statistics can result in a change in the choice of access plan selected as the most efficient method of accessing the desired data.
Examples of the statistics available which help define the data model to the optimizer include:
The number of pages in a table and the number of pages that are not empty.
The degree to which rows have been moved from their original page to other (overflow) pages.
The number of rows in a table.
Statistics about individual columns such as the number of distinct values in a column.
The degree of clustering of an index; that is, the extent to which the physical sequence of rows in a table follows an index.
Statistics about the index such as the number of index levels and the number of leaf pages in each index.
The number of occurrences of frequently used column values.
The distribution of column values across the range of values present in the column.
Cost estimates for user-defined functions (UDFs) in this case you don’t have them.
RUNSTATS can help you determine how performance is related to changes in your database. The statistics show the data distribution within a table. When used routinely, RUNSTATS provides data about tables and indexes over a period of time, thereby allowing performance trends to be identified for your data model as it evolves over time. Rebind applications that use static SQL after using RUNSTATS so that the query optimizer can choose the best access plan given the new statistics. However, for applications using dynamic SQL (e.g. most vendor applications) rebinding is not necessary since the statement will be optimized based on the statistics at run time. When statistical information about tables is not accurate, it may cause performance problems. In a worst-case scenario, a particular SQL statement may cause DB2 to use a table scan instead of an index scan. The statistics will be updated in the db2service program included with the weekend maintenance scripts.
How to update the statistics
Statistics for objects are updated in the system catalog tables only when explicitly requested. There are several ways to update some or all of the statistics:
Using the RUNSTATS (run statistics) utility.
Using LOAD, with statistics collection options specified.
Coding SQL UPDATE statements that operate against a set of predefined catalog views.
Using the "reorgchk update statistics" command.
When you do not exactly know all the table names, or there are too many, the easiest way to do RUNSTATS is to use the "db2 reorgchk update statistics" command. The exact script looks like this with ECOMP substituted for DB_NAME:
#!/bin/ksh
for i in `db2 list db directory|grep -B 4 -i Indirect|grep name|grep Database| awk '{print $4}'`
do
echo 'DBname = ' $i
DBname=$i
db2 connect to $i
db2 -x "select rtrim(TABSCHEMA) || '.' || TABNAME from syscat.tables where stats_time < (current timestamp - 7 days) and type='T'" > runstats.temp
if [ "$?" -eq 1 ];then
echo "No tables for database $DBname"
continue
fi
while read line_by_line
do
db2 "reorgchk update statistics on table ${A}"
db2 commit
done < runstats.temp
rm -f runstats.temp
db2rbind $DBname -l bind.log
done
echo Runstats ended at `date`
The script above does not require table names. This one command performs RUNSTATS on all tables.
Note: If you know the name of the table and to avoid having large numbers of tables that may take a long time to complete, it's preferable to do RUNSTATS on each table one at a time. The command looks like the following: db2 -v runstats on table TAB_NAME and indexes all This will collect statistics by table and all indexes (basic level).
Checking to see if RUNSTATS has been run
One quick way to see whether RUNSTATS has been performed on your database is to query some system catalog tables. For example, as shown in the script above, you can run this command:
db2 -v "select tbname, nleaf, nlevels, stats_time from sysibm.sysindexes"
If RUNSTATS has not yet been run, you will see "-1" for the nleaf and nlevels columns, and a "-" for the stats_time column. These columns contain real numbers if RUNSTATS has been run, and the stats_time column will contain the timestamp when RUNSTATS ran. If you think the time shown in stats_time is too old, it's time to do runstats again. Based on the analysis for runstats attached to this document, I re