Misbehaving RT applications

Hi,

We have a HP Proliant DL980 g7 (4x 6core CPUs) server running SLES 11 SP2 + SLERT. The OS is configured to use CPU sets, the first 4 cores reserved for the OS, and the application running in the remaining 20 cores.

If we have an application running with real-time priority, it is possible for the application to cause the server to stop responding. We can still ‘ping’ the server but it is not possible to interact with the server via the console and it is not possible to SSH into the machine.

My impression was that separating the application on to different CPUs than the OS would prevent an application from hindering the OS.

What am I missing?

Thanks, Jason

It’s possible for an application to effectively inhibit a system’s ability
to do something else even if it is not consuming all CPUs directly, at
least with non-RT systems, and I suspect the same is true in RT-line. If
the four cores dedicated to the OS are busy doing things that support the
application, for example, then that may be the case.

I’ve seen cases where a runaway process has taken all RAM and is now
trying to get the virtual memory from swap as well on a system with too
much swap, and even though the process is single-threaded (so it only gets
one core out of sixteen) the system is effectively useless until the
OUt-Of-Memory (OOM) killer takes over and nukes the lousy thing. Tying up
the hard drive is probably one of the easiest, minimal-effort things I can
do to lock up a system, particularly if I happen to be doing it while
using up the majority of system memory. For this reason, things like
ulimit exist to prevent using more resources than necessary, as do
cgroups, and perhaps this is what your application is using.

If you are SSH’d into the system, or if you go to the system console, can
you interact that way, meaning it is just new connections that fail?
Could you create a script to watch system resources (I/O, memory usage,
CPU utilization) while the script is running so you can gather statistics
even while disconnected or unable to interact with the system?


Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below…