Buy Me a Coffee

Tuesday, August 13, 2013

Potential Bottlenecks: SQL Performance



·         How much Page File does my Server actually use?
·         When to add more CPU cores?
·         Do we suffer from Disk latency?
·         Or maybe SQL slowness is due to application issues…?
In this post, I try to detail several perfmon counters that will help to locate any performance bottlenecks related to hardware or application, if any.
BTW, here is my sorting list for reasons to performance issues:
1. Lack of a DBA.
2. DBA with lack of understanding of SQL Server concepts.
3. Lack of communication between DBA and Developers.
4. Lack of tuning. It only takes one bad query to take an entire system to its knees
5. TempDB
6. Poor security – a server with 30 sysadmins eager to do ‘work’
7. Storage: disk latency due to lack of spindles, misconfigured Raid, poor controllers, disk misalignment etc.
8. Memory shortage
9. Lack of CPU cores

What is the minimum amount of memory that SQL Server need?  Unfortunately there is no way to calculate out what the ideal amount of RAM for a given instance of SQL Server might actually be since SQL Server is designed to cache data in the buffer pool, and it will typically use as much memory as you can give it.  One of the things to keep in mind when you are looking at reducing the RAM allocated to a SQL Server instance is that you will eventually get to a point where the lower memory gets traded off for higher disk I/O access in the environments.
If you need to figure out the ideal configuration for SQL Server memory in an environment that has been over provisioned the best way to try to go about doing this is start off with a baseline of the environment and the current performance metrics.  Counters to begin monitoring would include:
·         SQL Server:Buffer Manager\Page Life Expectancy
·         SQL Server:Buffer Manager\Page reads/sec
·         Physical Disk\Disk Reads/sec
Typically if the environment has excess memory for the buffer pool, the Page Life Expectancy value will continue to increase.  At the same time, the number of SQL Server:Buffer Manager\Page reads/sec will be very low which will also correspond to a low value for Physical Disk\Disk Reads/sec. 
* One comment regarding Page Life Expectancy threshold: the recommended value, by Microsoft is >300. Theproblem with this fixed value is that it was determined, ~10 years ago, when servers generally had 4GB of RAM installed in them. This performance counter tells you the number of seconds, at the current point in time, a page will remain in memory without being referenced (so a value of 300 means your buffer pool is flushing every 5 minutes).
Instead, use an adaptive formula like (DataCacheSizeInGB/4GB*300) (for example:  for SQL configured to use 12GB RAM the PLE value should be: 12GB/4GB*300=900.
Be aware: In many cases the issue is not lack of RAM, but, that a specific index is using all of the buffer pool and causing buffer pool flushing to occur on a constant basis resulting in a low Page Life Expectancy for the server.

Another fine tuning can be initiated on the NTFS file system level:
NTFS File System Performance Best Practices for SQL Server
  • Disable Indexing on disk volumes
  • Disable generation of 8.3 names (command: FSUTIL BEHAVIOR SET DISABLE8DOT3 1)
  • Disable last file access time tracking (command: FSUTIL BEHAVIOR SET DISABLELASTACCESS 1)
  • Keep some space empty (let us say 15% for reference) on drive is possible
  • Use SSD on Data disks (not logs disks) as they supply high performance for random I/O operations.  One thing to keep in mind with SSDs is that when they fail, your data is GONE! There is no recovery like there is with Magnetic disk. So make sure you use some sort of RAID.
! Note: Before practicing any of the above suggestion, consult your IT Infrastructural Admin, applying the suggestion without proper testing can only damage your system.

The information below is to be used as a guide:
Memory Bottleneck Analysis:
Object: - Memory
Counter: - Available Mbytes
Preferred Value: - > 20MB

Object: - Memory
Counter: - Pages/Sec
Preferred Value: - < 50
Description: - Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.
Object: - Paging File
Counter: - %Usage
Preferred Value: - < 70%
Description: - The amount of the Page File instance in use in percent.

Object: - SQL Server:Buffer Manager
Counter: - Page Life Expectancy *

Preferred Value: - > 300
Description: - This performance monitor counter tells you, on average, how long data pages are staying in the buffer. If this value gets below 300 seconds, this is a potential indication that your SQL Server could use more memory in order to boost performance.

Object: - SQL Server:Buffer Manager
Counter: - Lazy Writes/Sec
Preferred Value: - < 20
Description: - This counter tracks how many times a second that the Lazy Writer process is moving dirty pages from the buffer to disk in order to free up buffer space. If it is zero, this indicates that your SQL Server's buffer cache is plenty big and SQL Server doesn't have to free up dirty pages. If this value is high, then a need for more memory is indicated.
Object: - SQL Server:Buffer Manager
Counter: - Page reads/sec 
**
Preferred Value: - < 90
Description: - Number of physical database page reads issued. 80 – 90 per second is normal, anything that is above indicates indexing or memory constraint.

Object: - SQL Server:Buffer Manager
Counter: - Page writes/sec
Preferred Value: - < 90
Description: - Number of physical database page writes issued. 80 – 90 per second is normal, anything more we need to check the lazy writer/sec and checkpoint counters, if these counters are also relatively high then, it’s memory constraint.

Object: - SQL Server:Buffer Manager
Counter: - Buffer Cache hit ratio
Preferred Value: - > 90%
Description: - Percentage of pages that were found in the buffer pool without having to incur a read from disk.

Object: - SQL Server:Buffer Manager
Counter: - Target Server Memory(KB)
Preferred Value: -
Description: - Total amount of dynamic memory the server can consume.

Object: - SQL Server:Buffer Manager
Counter: - Total Server Memory(KB)
Description: - Total amount of dynamic memory (in kilobytes) that the server is using currently

Disk Bottleneck Analysis:
Object: - PhysicalDisk
Counter: - Avg. Disk Sec/Read
Preferred Value: - < 8ms
Description: - Measure of disk latgency. Avg. Disk sec/Read is the average time, in seconds, of a read of data from the disk.
More Info:
Excellent < 08 Msec ( .008 seconds )
Good < 12 Msec ( .012 seconds )
Fair < 20 Msec ( .020 seconds )
Poor > 20 Msec ( .020 seconds )

Object: - PhysicalDisk
Counter: - Avg. Disk sec/Write
Preferred Value: - < 8ms
Description: - Measure of disk latency. Avg. Disk sec/Write is the average time, in seconds, of a write of data to the disk.

Object: - PhysicalDisk
Counter: - Avg. Disk Read Queue Length
Preferred Value: - < 2 * spindles
Description: - Avg. Disk Read Queue Length is the average number of read requests that were queued for the selected disk during the sample interval.
More Info:
< (2+ no of spindles) Excellent
< (2*no of spindles) Good
< (3* no of spindles) Fair
Note: If the disk has say 20 disk and it is RAID 10 then no. of spindles = 20/2 = 10. If it is RAID 5 then the no. of spindles = no of disks = 20.

Object: - PhysicalDisk
Counter: - Avg. Disk Write Queue Length
Preferred Value: - < 2 * spindles
Description: - Avg. Disk Write Queue Length is the average number of write requests that were queued for the selected disk during the sample interval.

Processor Bottleneck Analysis:

Object: - Processor
Counter: - %Processor Time
Preferred Value: - < 80%
Description: - % Processor Time is the percentage of elapsed time that the processor spends to execute a non-Idle thread.

Object: - System
Counter: - Processor Queue Length
Preferred Value: - < 4 per CPU

<= 4 per CPU Excellent
< 8 per CPU Good
< 12 per CPU Fair

Object: - SQLServer:Access Methods
Counter: - Full Scans / sec  ***

Preferred Value: - < 1
Description: - If we see high CPU then we need to investigate this counter, otherwise if the full scan are on small tables we can ignore this counter. Values greater than 1 or 2 indicates that we are having table / Index page scans. We need to analyze how this can be avoided.

Object: - SQLServer:Access Methods
Counter: - Page Splits/sec
Preferred Value: - < 20
Description: - Interesting counter that can lead us to our table / index design. This value needs to be low as possible. If you find out that the number of page splits is high, consider fixing index.
Overall SQL Server Bottleneck Analysis:
Object: - SQLServer:General Statistics
Counter: - User Connections
Preferred Value: -
Description: - The number of users currently connected to the SQL Server.
Object: - SQLServer:SQL Statistics
Counter: - Batch Requests/Sec
Preferred Value: - < 1000
Description: - Over 1000 batch requests per second indicate a very busy SQL Server.

Transaction Management:

Object: - SQL Server:Locks
Counter: - Number of Deadlocks/sec
Preferred Value: - < 1
Description: - The number of lock requests that resulted in a deadlock.

Object: - SQL Server:Locks
Counter: - Lock Requests/sec
Preferred Value: - < 1000
Description: - Number of requests for a type of lock per second. Lock requests/sec > 1000 indicates that the queries are accessing large number of rows then it’s an indication of blocking.
Object: - SQL Server:Locks
Counter: - Average Wait Time (ms)
Preferred Value: - < 500
Description: - This is the average wait time in milliseconds to acquire a lock. Lower the value the better it is. If the value goes higher then 500, there may be blocking going on.

* use an adaptive formula like (DataCacheSizeInGB/4GB*300) (for example:  for SQL configured to use 12GB RAM the PLE value should be: 12GB/4GB*300=900.
Be aware: In many cases the issue is not lack of RAM, but, that a specific index is using all of the buffer pool and causing buffer pool flushing to occur on a constant basis resulting in a low Page Life Expectancy for the server.
Be aware: buffer cache hit ratio 99+ but page life expectancy very low indicates lots of IO but the IO subsystem is keeping up. Need to check tempDB issues, backup-restore operation, moving as much data out to history tables/databases.
**This counter represents actual physical page reads from disk. Because physical I/O is expensive, you may be able to minimize the cost, either by using a larger data cache, intelligent indexes, and more efficient queries, or by changing the database design. Can be missing indexes that are resulting in table scans that are driving up CPU usage and I/O.
***High value for Full Scans/sec?  Missing indexes or/and Too many rows requested. Queries with missing indexes or too many rows requested will have a large number of logical reads and an increased CPU time

The more index you have, the less full scan you need.


NOTE: This post credits belongs to Ariel Fridman

No comments:

Post a Comment