· How much Page File
does my Server actually use?
· When to add more
CPU cores?
· Do we suffer from
Disk latency?
· Or maybe SQL
slowness is due to application issues…?
In
this post, I try to detail several perfmon counters that will help to locate any
performance bottlenecks related to hardware or application, if any.
BTW,
here is my sorting list for reasons to performance issues:
1.
Lack of a DBA.
2. DBA with lack of understanding of SQL Server
concepts.
3. Lack of communication between DBA and Developers.
4.
Lack of tuning. It only takes one bad query to take an entire system to its
knees
5.
TempDB
6.
Poor security – a server with 30 sysadmins eager to do ‘work’
7. Storage:
disk latency due to lack of spindles, misconfigured Raid, poor controllers, disk
misalignment etc.
8. Memory shortage
9. Lack of CPU cores
What is the minimum
amount of memory that SQL Server need? Unfortunately there is no way to
calculate out what the ideal amount of RAM for a given instance of SQL Server
might actually be since SQL Server is designed to cache data in the buffer pool,
and it will typically use as much memory as you can give it. One of the things
to keep in mind when you are looking at reducing the RAM allocated to a SQL
Server instance is that you will eventually get to a point where the lower
memory gets traded off for higher disk I/O access in the
environments.
If you need to
figure out the ideal configuration for SQL Server memory in an environment that
has been over provisioned the best way to try to go about doing this is start
off with a baseline of the environment and the current performance metrics.
Counters to begin monitoring would include:
·
SQL Server:Buffer
Manager\Page Life Expectancy
·
SQL Server:Buffer
Manager\Page reads/sec
·
Physical Disk\Disk
Reads/sec
Typically if the
environment has excess memory for the buffer pool, the Page Life Expectancy
value will continue to increase. At the same time, the number of SQL
Server:Buffer Manager\Page reads/sec will be very low which will also correspond
to a low value for Physical Disk\Disk Reads/sec.
* One comment regarding Page Life Expectancy threshold: the recommended
value, by Microsoft is >300. Theproblem with this fixed value is that
it was determined, ~10 years ago, when servers generally had 4GB of RAM
installed in them. This performance counter tells you the number of seconds, at
the current point in time, a page will remain in memory without being referenced
(so a value of 300 means your buffer pool is flushing every 5
minutes).
Instead, use an
adaptive formula like (DataCacheSizeInGB/4GB*300) (for example: for SQL configured to use 12GB RAM the PLE
value should be: 12GB/4GB*300=900.
Be aware: In many
cases the issue is not lack of RAM, but, that a specific index is using all of
the buffer pool and causing buffer pool flushing to occur on a constant basis
resulting in a low Page Life Expectancy for the server.
Another fine tuning
can be initiated on the NTFS file system level:
NTFS
File System Performance Best Practices for SQL
Server
- Disable Indexing on
disk volumes
- Disable generation
of 8.3 names (command: FSUTIL BEHAVIOR SET DISABLE8DOT3 1)
- Disable last file
access time tracking (command: FSUTIL BEHAVIOR SET DISABLELASTACCESS
1)
- Keep some space
empty (let us say 15% for reference) on drive is possible
- Use
SSD on Data disks (not logs disks) as they supply high performance for random
I/O operations. One thing to keep in mind with SSDs is that when
they fail, your data is GONE! There is no recovery like there is with Magnetic
disk. So make sure you use some sort of RAID.
!
Note: Before practicing any of the above suggestion,
consult your IT Infrastructural Admin, applying the suggestion without proper
testing can only damage your system.
The information below is to be used as a
guide:
Memory Bottleneck
Analysis:
Object: - Memory
Counter: - Available
Mbytes
Preferred Value: - > 20MB
Object: - Memory
Counter: -
Pages/Sec
Preferred Value: - < 50
Description: - Pages/sec is the rate
at which pages are read from or written to disk to resolve hard page faults.
This counter is a primary indicator of the kinds of faults that cause
system-wide delays.
Object: - Paging File
Counter: -
%Usage
Preferred Value: - < 70%
Description: - The amount of the Page
File instance in use in percent.
Object: - SQL Server:Buffer
Manager
Counter: - Page Life Expectancy *
Preferred Value: - > 300
Description: - This
performance monitor counter tells you, on average, how long data pages are
staying in the buffer. If this value gets below 300 seconds, this is a potential
indication that your SQL Server could use more memory in order to boost
performance.
Object: - SQL Server:Buffer Manager
Counter: - Lazy
Writes/Sec
Preferred Value: - < 20
Description: - This counter tracks
how many times a second that the Lazy Writer process is moving dirty pages from
the buffer to disk in order to free up buffer space. If it is zero, this
indicates that your SQL Server's buffer cache is plenty big and SQL Server
doesn't have to free up dirty pages. If this value is high, then a need for more
memory is indicated.
Object: - SQL Server:Buffer Manager
Counter: - Page
reads/sec
**
Preferred Value: - < 90
Description: - Number
of physical database page reads issued. 80 – 90 per second is normal, anything
that is above indicates indexing or memory constraint.
Object: - SQL
Server:Buffer Manager
Counter: - Page writes/sec
Preferred Value: - <
90
Description: - Number of physical database page writes issued. 80 – 90 per
second is normal, anything more we need to check the lazy writer/sec and
checkpoint counters, if these counters are also relatively high then, it’s
memory constraint.
Object: - SQL Server:Buffer Manager
Counter: -
Buffer Cache hit ratio
Preferred Value: - > 90%
Description: -
Percentage of pages that were found in the buffer pool without having to incur a
read from disk.
Object: - SQL Server:Buffer Manager
Counter: - Target
Server Memory(KB)
Preferred Value: -
Description: - Total amount of
dynamic memory the server can consume.
Object: - SQL Server:Buffer
Manager
Counter: - Total Server Memory(KB)
Description: - Total amount of
dynamic memory (in kilobytes) that the server is using
currently
Disk Bottleneck
Analysis:
Object: - PhysicalDisk
Counter: - Avg.
Disk Sec/Read
Preferred Value: - < 8ms
Description: - Measure of disk
latgency. Avg. Disk sec/Read is the average time, in seconds, of a read of data
from the disk.
More Info:
Excellent < 08 Msec ( .008 seconds )
Good
< 12 Msec ( .012 seconds )
Fair < 20 Msec ( .020 seconds )
Poor >
20 Msec ( .020 seconds )
Object: - PhysicalDisk
Counter: - Avg. Disk
sec/Write
Preferred Value: - < 8ms
Description: - Measure of disk
latency. Avg. Disk sec/Write is the average time, in seconds, of a write of data
to the disk.
Object: - PhysicalDisk
Counter: - Avg. Disk Read Queue
Length
Preferred Value: - < 2 * spindles
Description: - Avg. Disk Read
Queue Length is the average number of read requests that were queued for the
selected disk during the sample interval.
More Info:
< (2+ no of
spindles) Excellent
< (2*no of spindles) Good
< (3* no of spindles)
Fair
Note: If the disk has say 20 disk and it is RAID 10
then no. of spindles = 20/2 = 10. If it is RAID 5 then the no. of spindles = no
of disks = 20.
Object: - PhysicalDisk
Counter: - Avg. Disk Write Queue
Length
Preferred Value: - < 2 * spindles
Description: - Avg. Disk Write
Queue Length is the average number of write requests that were queued for the
selected disk during the sample interval.
Processor
Bottleneck Analysis:
Object: - Processor
Counter: -
%Processor Time
Preferred Value: - < 80%
Description: - % Processor
Time is the percentage of elapsed time that the processor spends to execute a
non-Idle thread.
Object: - System
Counter: - Processor Queue
Length
Preferred Value: - < 4 per CPU
<= 4 per CPU
Excellent
< 8 per CPU Good
< 12 per CPU Fair
Object: -
SQLServer:Access Methods
Counter: - Full Scans / sec ***
Preferred Value: - <
1
Description: - If we see high CPU then we need to investigate this counter,
otherwise if the full scan are on small tables we can ignore this counter.
Values greater than 1 or 2 indicates that we are having table / Index page
scans. We need to analyze how this can be avoided.
Object: -
SQLServer:Access Methods
Counter: - Page Splits/sec
Preferred Value: -
< 20
Description: - Interesting counter that can lead us to our table /
index design. This value needs to be low as possible. If you find out that the
number of page splits is high, consider fixing index.Overall
SQL Server Bottleneck Analysis:
Object: - SQLServer:General
Statistics
Counter: - User Connections
Preferred Value: -
Description:
- The number of users currently connected to the SQL Server.
Object: - SQLServer:SQL Statistics
Counter: -
Batch Requests/Sec
Preferred Value: - < 1000
Description: - Over 1000
batch requests per second indicate a very busy SQL
Server.
Transaction Management:
Object:
- SQL Server:Locks
Counter: - Number of Deadlocks/sec
Preferred Value: -
< 1
Description: - The number of lock requests that resulted in a
deadlock.
Object: - SQL Server:Locks
Counter: - Lock
Requests/sec
Preferred Value: - < 1000
Description: - Number of
requests for a type of lock per second. Lock requests/sec > 1000 indicates
that the queries are accessing large number of rows then it’s an indication of
blocking.
Object: - SQL Server:Locks
Counter: - Average
Wait Time (ms)
Preferred Value: - < 500
Description: - This is the
average wait time in milliseconds to acquire a lock. Lower the value the better
it is. If the value goes higher then 500, there may be blocking going
on.
* use an adaptive formula like
(DataCacheSizeInGB/4GB*300) (for example:
for SQL configured to use 12GB RAM the PLE value should be:
12GB/4GB*300=900.
Be aware: In many
cases the issue is not lack of RAM, but, that a specific index is using all of
the buffer pool and causing buffer pool flushing to occur on a constant basis
resulting in a low Page Life Expectancy for the server.
Be aware: buffer cache hit ratio 99+ but page life
expectancy very low indicates lots of IO but the IO subsystem is keeping up.
Need to check tempDB issues, backup-restore operation, moving as much data out
to history tables/databases.
**This
counter represents actual physical page reads from disk. Because physical I/O is
expensive, you may be able to minimize the cost, either by using a larger data
cache, intelligent indexes, and more efficient queries, or by changing the
database design. Can be missing indexes that are resulting in table scans that
are driving up CPU usage and I/O.
***High value for
Full Scans/sec? Missing indexes
or/and Too many rows requested. Queries with missing indexes or too many rows
requested will have a large number of logical reads and an increased CPU
time
The more index you have, the less full scan you
need.
NOTE: This post credits belongs to Ariel Fridman