We currently are running a SLES 11 SP4 with KVM in a HA cluster.
One of the DomUs is a Windows 2012 R2 Standard server with MS SQL 2014 (mirrored to another DomU Windows 2012 R2 standard server with MS SQL).
Only on Friday mornings between 1:00am and 8:00am (once or several times) does the MS SQL database go offline for a short time (seconds) then goes back online.
Check the backup, scheduled tasks, SQL plans and nothing is running at the same time as the MS SQL db going offline then online.
Wondering if there is a setting to set a higher memory, processor, storage priority to the DomU so the DomU keeps the SQL db up.
Wondering if there is a setting to set a higher memory, processor, storage priority to the DomU so the DomU keeps the SQL db up.
I’m wondering what precisely you mean with “memory, processor, storage priority”.
If it’s about assigning more memory / CPU to the guest dynamically, probably one of the results from http://www.lmgtfy.com/?q=KVM+increase+CPU+and+RAM+during+runtime will help. Storage, otoh, is a different story - that would depend on the backend capabilities and it’s rather unlikely that you’ll find an easy way to prioritize storage access per guest.
What, other than “the MS SQL database go offline for a short time” do you see - is there a high load on the host, significant I/O during that period, is the guest reachable via ICMP during those DB outages, and how at all do you see that the DB is offline? IOW: What other bottle neck analysis results are there?
I’m wondering what precisely you mean with “memory, processor, storage priority”.
Its easy enough to add memory, processor (CPUs) and storage to the DomU. What I’m looking for is set the priority of the DomU so the memory, processor and storage ( and networking) would get a higher priority/ higher performance (compared to the other DomUs) on the host. So this one DomU would not be starved of any of the resources used.
When I say “the MS SQL database go offline for a short time”, I see in the DomU Windows event log the MS SQL with “SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file”, The clients show "Logger Event Acct: 9999999998 Stn: 99 Err: (103) Database Err " for 1-3 seconds then it reconnects and all is well until next time the MS SQL server has a hesitation.
So I would like to set a priority for this one DomU to be higher than the other DomUs so the DomU does not hesitate and the MS SQL database does not go offline for a short time.
Except for manual prioritization of the virtualization process on the Dom0, I’m not aware of any way to alter the priority of a DomU, especially not concerning I/O.
Have you had a look at the I/O (and CPU) utilization of Dom0 during that period? It’d be interesting to see which resources are actually delaying the DomU’s I/O request.
If you don’t have any monitoring of these parameters active, i.e. SNMP-based, I’d recommend to start a run of both “vmstat” and “iostat” per cron or manually, logging the output and checking for correlations of high iowait and the symptoms you see in the DomU. “iostat” may help to identify which I/O resourse (disk) is actually loaded during that phase, at least as long as it is represented by a Dom0 block device (local disks, SAN LUNs).
This may help to pin-point the actual root cause of the high delay experienced by your DomU - be it one of the DomUs, the Dom0 or some other component.
If the delaying resources are remote to your Dom0, you may also need to verify the remote service offering the resource(s) and the access path to that resource.
If you find that it’s some local disk that’s under heavy load, you might want to look into a faster storage solution (SSD caching / faster disks / …, always depending on the actual use case) - but those discussions would have to wait until the actual bottle neck is identified.
Oh, and a shot into the sky at night : You’re not by chance running some (MD-)RAID5/6 underneath the DomUs virtual disk, which is getting scrubbed Friday mornings?
Initially the MS SQL mirror was set to high availability with automatic failover. Then Friday morning the MS SQL did an automatic failover. Since the client software doesn’t know how to failover, the MS SQL mirror was set to high safety (manual failover required), but on the next Friday morning the MS SQL system still hesitated enough for the client to loose connection for a few seconds. Finally I remove the MS SQL mirroring and this Friday morning the MS SQL system did not hesitate and the client stayed online with the SQL db.
The app connect to several MS SQL dbs, 2 main constantly accessed (these are the DBs which get disconnected and logged in the app log file) and 5 other support DBs.
So it seems the VM with MS SQL mirroring is more of a load the system has trouble with.