Some Explanations

Starvation:

Some explanations:

“Starvation is a multitasking-related problem, where a process is perpetually denied necessary resources. Without those resources, the program can never finish its task.

The fault lies in the scheduling algorithm. The scheduling algorithm, which is part of the kernel, is supposed to allocate resources equitably; that is, the algorithm should allocate resources so that no process perpetually lacks necessary resources.

Starvation is related to deadlock. Deadlock occurs when two programs each hold resources the other needs to finish, and neither is willing to give them up. Starvation occurs when one program holds resources the other needs, but is unwilling to give them up.”

“Starvation is when the Java runtime (JVM) doesn't allocate time to a thread to execute. This may be due to a poor scheduling algorithm (like green Threads under Solaris, where a for loop from 1 to 1 million doing something CPU intensive wouldn't yield the CPU under Solaris but would under Windows), poor programming practice (not returning from the paint() method in an applet), or a hostile attack (like hitting a host with a denial of service attack where the CPU is busy outside the Java process).”

“A more common reason for starvation is poor thread priority strategies. If one thread has a high priority, is very busy, and never blocks, then all threads with a lower priority will starve.

To avoid thread starvation, a useful (if somewhat counter-intuitive) rule of thumb is: give threads with a lot to do (e.g. calculator threads) low priorities; and give threads with a little to do (e.g. IO-bound threads) high priorities. That way the high-priority threads wake up, do their thing, and go back to sleep, allowing the low-priority threads time to grind through their tasks. Using the Java IO libraries, this blocking and waking happens automatically, so all you have to do is set the priority for your threads at the outset.”

Deadlock:

a state where none of the constituent processes of a system can agree on how to proceed, so (part of) the systems stops running.

There are 3 ingredients necessary to create make a deadlock possible:

1)Mutual exclusion: only one process at a time can use a specific resource.

2)Hold and wait: a process can keep allocated resources while waiting for other resources to become available.

3)No preemption: a resource will only be released if the process holding it does so voluntarily (there is no way to force a release).

If these ingredients are present in the system, a deadlock will arise in this scenario:

What to do:

Take away one of the ingredients.

Make sure the scenario cannot arise.

Avoid deadlock by not granting access to a resource if this can lead to a deadlock.

Detection: always grant resource access but periodically check for deadlocks and resolve them when present.

The success of these last three options depends on if you have thought of all possible situations.

Checking every possible state of the system is the only way to be sure that a deadlock situation will not arise. But, given that the number of states of a parallel system usually grows exponentially with the number of processes, this task quickly becomes too great to handle. Experimental tests to try to find deadlock would reveal any obvious problems, but there might be deadlocks that require many years of running time to appear which we would never detect. You could attempt to construct a mathematical proof of deadlock-freedom, but you would soon discover that, even for small programs, this is often extremely difficult and time-consuming. The problem with all these approaches is that the deadlock issue has been left to the end of the software development process, when it is really too late. Design rules are needed, which may be applied a priori: rules which guarantee deadlock-freedom, are not too restrictive, and are easy to follow.

Deadlock example:

Five philosophers sit around a table. Each has a fork to his left. An everlasting bowl of spaghetti is placed in the middle of the table. A philosopher spends most of his time thinking, but whenever he is hungry he picks up the fork to his left and plunges it into the bowl. As the spaghetti is very long and tangled he requires another fork to carry it to his mouth, so he picks up the fork to his right as well. If, on attempting to pick up either fork, he should find that it is already in use he simply waits until it becomes available again. When he has finished eating he puts down both forks and continues to think.

There is a serious flaw in this system, which is only revealed when all the philosophers become hungry at the same time. They each pick up their left-hand fork and then reach out for their right hand fork, which is not there - a clear case of deadlock as shown in Figure 1 below:

Solution:

Rules of varying complexity have been devised to tackle this problem. The simplest is to allocate to each resource a unique integer priority. Then deadlock may be avoided by ensuring that no user process ever tries to acquire a resource with higher priority than one it already holds. In the case of the Dining Philosophers we could label the forks from zero to four, clockwise around the table. Four out of the five philosophers would then have a fork of higher priority to their left than their right, and so their behaviour would conform to the rule. The fifth, however, would have fork number zero to his left and fork number four to his right, so he would break the rule. If he were to modify his behaviour to always pick up the fork to his right first, the risk of deadlock would be removed. This example illustrates the power of using design rules to prevent pathological behaviour.

Race condition:

A race condition (or race hazard) is a flaw in a system or process where the output exhibits unexpected critical dependence on the relative timing of events. The term originates with the idea that two signals are racing each other to be the first to cause the output. A common example is two processes that try simultaneous to change shared data. Information is lost and the outcome of the operations cannot be predicted.

Race condition examples:

Global variable counter

Process 1:Process 2:

variable temp1;variable temp2:

temp1 = counter;temp2 = counter;

change_value1(temp1);change_value2(temp2);

counter = temp1;counter = temp2;

To prevent this, these two processes must be synchronized or the use of the shared data must be regulated with a semaphore.

Example 2:

If one process writes to a file while another is reading from the same location then the data read may be the old contents, the new contents or some mixture of the two depending on the relative timing of the read and write operations.

A common remedy in this kind of race hazard is file locking. A more cumbersome remedy is to reorganize the system such that a certain process (running a daemon or the like) is the only process that has access to the file, and all other processes that need to access the data in that file do so only via interprocess communication with that one process.
Priority Inversion

When tasks share resources, as they often do, strange things can and will happen. Priority inversions can be particularly difficult to anticipate. Here's an introduction to priority inversions and a pair of techniques you can use to avoid them.

Most commercial real-time operating systems (RTOSes) employ a priority-based preemptive scheduler. These systems assign each task a unique priority level. The scheduler ensures that of those tasks that are ready to run, the one with the highest priority is always the task that is actually running. To meet this goal, the scheduler may preempt a lower-priority task in mid-execution.

Because tasks share resources, events outside the scheduler's control can prevent the highest priority ready task from running when it should. If this happens, a critical deadline could be missed, causing the system to fail. Priority inversion is the term for a scenario in which the highest-priority ready task fails to run when it should.

Resource sharing

Tasks need to share resources to communicate and process data. This aspect of multi-threaded programming is not specific to real-time or embedded systems.

Any time two tasks share a resource, such as a memory buffer, in a system that employs a priority-based scheduler, one of them will usually have a higher priority. The higher-priority task expects to be run as soon as it is ready. However, if the lower-priority task is using their shared resource when the higher-priority task becomes ready to run, the higher-priority task must wait for the lower-priority task to finish with it. We say that the higher-priority task is pending on the resource. If the higher-priority task has a critical deadline that it must meet, the worst-case "lockout time" for all of its shared resources must be calculated and taken into account in the design. If the cumulative lockout times are too long, the resource-sharing scheme must be redesigned.

Since worst-case delays resulting from the sharing of resources can be calculated at design time, the only way they can affect the performance of the system is if no one properly accounts for them.

Priority inversions

The real trouble arises at run-time, when a medium-priority task preempts a lower-priority task using a shared resource on which the higher-priority task is pending. If the higher-priority task is otherwise ready to run, but a medium-priority task is currently running instead, a priority inversion is said to occur.

Figure 1 Priority inversion timeline

This dangerous sequence of events is illustrated in Figure 1. Low-priority Task L and high-priority Task H share a resource. Shortly after Task L takes the resource, Task H becomes ready to run. However, Task H must wait for Task L to finish with the resource, so it pends. Before Task L finishes with the resource, Task M becomes ready to run, preempting Task L. While Task M (and perhaps additional intermediate-priority tasks) runs, Task H, the highest-priority task in the system, remains in a pending state.

Many priority inversions are innocuous or, at most, briefly delay a task that should run right away. But from time to time a system-critical priority inversion takes place. Such an event occurred on the Mars Pathfinder mission in July 1997. The Pathfinder mission is best known for the little rover that took high-resolution color pictures of the Martian surface and relayed them back to Earth.

The problem was not in the landing software, but in the mission software run on the Martian surface. In the spacecraft, various devices communicated over a MIL-STD-1553 data bus. Activity on this bus was managed by a pair of high-priority tasks. One of the bus manager tasks communicated through a pipe with a low-priority meteorological science task.

On Earth, the software mostly ran without incident. On Mars, however, a problem developed that was serious enough to trigger a series of software resets during the mission. The sequence of events leading to each reset began when the low-priority science task was preempted by a couple of medium-priority tasks while it held a mutex related to the pipe. While the low-priority task was preempted, the high-priority bus distribution manager tried to send more data to it over the same pipe. Because the mutex was still held by the science task, the bus distribution manager was made to wait. Shortly thereafter, the other bus scheduler became active. It noticed that the distribution manager hadn't completed its work for that bus cycle and forced a system reset.

This problem was not caused by a mistake in the operating system, such as an incorrectly implemented semaphore, or in the application. Instead, the software exhibited behavior that is a known "feature" of semaphores and intertask communication. In fact, the RTOS used on Pathfinder featured an optional priority-inversion workaround; the scientists at JPL simply hadn't been aware of that option. Fortunately, they were able to recreate the problem on Earth, remotely enable the workaround, and complete the mission successfully.

Workarounds

Research on priority inversion has yielded two solutions. The first is called priority inheritance. This technique mandates that a lower-priority task inherit the priority of any higher-priority task pending on a resource they share. This priority change should take place as soon as the high-priority task begins to pend; it should end when the resource is released. This requires help from the operating system.

The second solution, priority ceilings, associates a priority with each resource; the scheduler then transfers that priority to any task that accesses the resource. The priority assigned to the resource is the priority of its highest-priority user, plus one. Once a task finishes with the resource, its priority returns to normal.

A beneficial feature of the priority ceiling solution is that tasks can share resources simply by changing their priorities, thus eliminating the need for semaphores:

void TaskA(void)

{

...

SetTaskPriority(RES_X_PRIO);

// Access shared resource X.

SetTaskPriority(TASK_A_PRIO);

...

}

While Task A's priority is elevated (and it is accessing shared resource X), it should not pend on any other resource. The higher-priority user will only become the highest-priority ready task when the lower-priority task is finished with their shared resource.

While not all of us are writing software for missions to Mars, we should learn from past mistakes and implement solutions that don't repeat them. Many commercial RTOSes include support for either priority inheritance or priority ceilings. Just make sure you enable one.

Priority inversion example: Priority Inversion and the Mars Pathfinder

The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface. Successes included its unconventional "landing" -- bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web. But a few days into the mission, not long after Pathfinder started gathering meteorological data, the spacecraft began experiencing total system resets, each resulting in losses of data. The press reported these failures in terms such as "software glitches" and "the computer was trying to do too many things at once".

Pathfinder contained an "information bus", which you can think of as a shared memory area used for passing information between different components of the spacecraft. A bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes).

The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue.

The spacecraft also contained a communications task that ran with medium priority.

Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the(medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running. After some time had passed, a watchdog timer would go off, notice that the data bustask had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.

This scenario is a classic case of priority inversion.

Temporal Logic Assertions for the Detection of Priority Inversion

Let:

HIGHPriorityTaskBlocked() represent a situation where the information bus thread is blocked by the low priority meteorological data gathering task.