Memory Consistency Introduction Tutorial 3/19/2004
Memory Consistency Tutorial
This is the stuff that causes poor little programmer’s brains to go insane and turn into mush and make them want to go and sit out in a cow field all day.
I think that memory consistency came about as follows:
Head design engineer: “So we’ve finished the new super-fast DSM supercomputer and now we’re ready to go into production. Just in time, we should make millions.”
Student designer: “I think we forgot something. You know how when the memory gets written to we have to make sure all the processors know about it before any one continues because maybe they might use the old data because they’re so fast. Well, I think we didn’t put in an acknowledge signal on the bus lines. We’ve got to go back to redesign and testing.”
Head design engineer: “Hmmm, we’ve run out of time, any more extensions on the deadlines and we might as well sell it as a windows clone. I’ll tell you what we’ll do, wrap it up and ship it out and we’ll just call it a feature.”
The ideal of memory consistency is that in a multiprocessor environment where memory could be anywhere it is desirable that a memory location is recognized as having the same data value everywhere at any particular time. The reality is that different processors could be using the same piece of memory in different ways at the same moment unless some kind of check mechanism is in place to maintain a desired order that all processors can respect. The trick is how to do this and still get some kind of decent execution rate that can be competitively feasible. So the subject of memory consistency is really about how certain things can be allowed to happen out of sequential order but only to the extent that the important or critical items do not get affected.
At the beginning of the course we compared a pipelined system with an unpipelined one. Then we found out that the ideal case had certain hazards that caused it to slow things down a bit. The processor still executed a program in a sequential fashion from the point of view of the user, but internally there was a bunch of checks and flags to keep track of the order of the data and sometimes it actually prevented an instruction from continuing until the proper data was available. We even got to the point of looking at computers that did out of order instruction execution, where later instructions in the path actually finished before previous instructions. There is a similarity here with the multiprocessor consistency models. The whole idea of having multiprocessors is to get concurrency as much as is possible. The sequential aspect of programming still exists to the extent that the programmer needs to have a sensible order in maintaining data that is affected by various processors but the tendency of a processor is just to continue processing unless something causes it to pause. Added to this is the possibility of a processor suspending its operation for some unknown reason (such as in a multi-process environment or due to some I/O situation that arises.) All these things can give rise to the fact that the processes executing on the various processors cannot be guaranteed to run in the same instruction order across the various processors, which means each run of the program could have a different result, often the completely wrong result.
The individual processors still work in much the same way as we have already studied in this course. A program is executed in a sequential fashion even if underneath it is taking advantage of advanced techniques. In a uniprocessor it is possible to keep track of the data flow with the use of data forwarding techniques, but this becomes much more complicated when the producer and consumer instructions are on different processors. Sometimes instructions need to be ordered so that writes from one processor can be read on another processor only after the data is produced, even communicating when the data is produced can be a challenge.
So some kind of synchronization mechanism must be instituted to provide order in the sequence of global multiprocessor events so that data integrity can be maintained. This is the purpose of the locks/unlocks, semaphores, barriers and other synchronization techniques that most students at this level have confronted from time to time. In the simplest sense we can make a program that runs as various processes on separate processors and where-ever there is a chance that data could be affected by more than one source we could put in a synchronization mechanism or define an area of the program as a critical section, thereby not allowing any other processes to change the critical data during this time. This is OK to get a bit of concurrency out of an algorithm and hence be able to get some speedup on the overall execution time of the complete task.
With a global shared memory space we have the opportunity to be able to squeeze out a bit more concurrency by allowing many memory operations to permit concurrency as long as they do not seriously affect the execution of the global task at hand. This little bit of squeezing actually translates into huge time savings by not having to have everything done in a sequential manner after previous operations have completed completely and are recognized as having completed on every processor.
In my investigation of this topic I found three areas that presented themselves. The first was the conceptual aspect of memory consistency which I kind of understood from when I first went through the course, although primarily only in respect to the sequential model and the release consistency model. On further investigation I found two questions kept appearing, one how does the computer actually implement the concepts and the other how does the programmer get affected by the details.
Memory consistency is actually what the models are not. The concept is based on the idea of maintaining consistency in a wild environment but the hardware is creating shortcuts that bypass the confidence of the data. It is the programmer or operating system that has to put in safety features to ensure the integrity of the critical data. There is a set of procedures and hardware instructions (hardware primitives) that are made available by the compiler to prevent mistakes from occurring.
In the textbook there are a number of models that are described of which two of them we pay particular attention to. The diagram summarizing the models, figure 8.40 on page 717, is only in the second edition and the third edition had a lot of the description removed as well. Be aware that the sequential model is one step removed from the strict consistency model which is not mentioned in either book but is the most basic kind of memory consistency.
The models are well described in the reference http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
The strict consistency model is what you would expect memory consistency to be; the read of some data value will always be that stored by the most recent write in whatever processor. This corresponds to our cache coherence protocol studied previously, when a write occurs all the caches are invalidated and the next read on another processor will be a read miss which will obtain the newly written value. There is still a problem that the read could happen before the write and it may be desirable to make sure that the read occurs after the write. So synchronization still has to be used by the programmer for sections that have to be done in a desired order. The disadvantage of this simple model is that each memory reference has an order and an acknowledgement structure whether it needs it or not. This leads to unnecessary delays that multiply as the number of processors increase.
The reference uses these examples:
W is write, R is read, x is a data
location, the number is the value
read or written.
P1: W(x)1
------Valid
P2: R(x)1 R(x)1
P1: W(x)1
------Valid
P2: R(x)0 R(x)1
P1: W(x)1
------Not valid
P2: R(x)0 R(x)1
The first two are valid and the last one is not because when the write happens all the processors will recognize the new value immediately (in respect to the temporal sequence of events, which means it may incur a long delay to ensure it happens this way). The situation cannot occur of the third case where the old value exists and is read by another processor after the writing processor makes the change. The processors still operate asynchronously, but the memory does not allow for events to be overlapped.
The sequential consistency model changes the rules slightly by allowing memory operations to overlap but only as long as the operations are seen in the same order by all processors. So the examples above are all valid because in the third example the first read of P2 occurred before the propagation of the change of P1 was realized by P2. It may seem weird but if the execution of the global task is not upset by this inconsistency, it’s obvious that the time savings could be considerable. The examples in the reference show the effects with 4 processors:
P1: W(x)1
------Valid
P2: R(x)1 R(x)2
------
P3: R(x)1 R(x)2
------
P4: W(x)2
P1: W(x)1
------Valid
P2: R(x)2 R(x)1
------
P3: R(x)2 R(x)1
------
P4: W(x)2
P1: W(x)1
------Not valid
P2: R(x)1 R(x)2
------
P3: R(x)2 R(x)1
------
P4: W(x)2
I interpret this to mean that whichever processor P1 or P4 that does the write first is recognized by all processors as having done it first regardless of whether another write was done in between. I think they show it this way to emphasize the order of the writes is asynchronous. It may be easier to see it by putting a slight order to the writes.
P1: W(x)1
------Valid
P2: R(x)1 R(x)2
------
P3: R(x)1 R(x)2
------
P4: W(x)2
Or
P1: W(x)1
------Valid
P2: R(x)2 R(x)1
------
P3: R(x)2 R(x)1
------
P4: W(x)2
This does not necessarily mean that the order has to be maintained, for example this could also happen:
P1: W(x)1
------Valid
P2: R(x)2 R(x)2
------
P3: R(x)2 R(x)2
------
P4: W(x)2
This reference assumes for the example that the data is always related to the same location. It may be easier to see the value if different locations are used:
P1: W(x)1
------Valid
P2: R(x)1 R(y)2
------
P3: R(x)1 R(y)2
------
P4: W(y)2
Here the change in y will only occur after the change in x has gone through. The value of this can become more clear in this other reference paper which is frequently used http://citeseer.nj.nec.com/adve95shared.html .
Here an example is used that shows the complication nicely
P1 P2 P3
A = 1
IF(A == 1)
B = 1
IF(B == 1)
Register = A
By sequential consistency rules P3 will see the value of A as being what P1 changed it to if the IF statement of B == 1 is successful.
It’s worth mentioning at this point that the reference http://citeseer.nj.nec.com/81907.html mentions that some of the hardware enhancements that have become popular are not allowed with this model. One technique is worth mentioning, it is a useful technique to buffer the write in a processor so that the reads can get priority on the bus. This makes the processor seem to go faster because the writes habitually are just housekeeping types of functions, if a read needs a value in the write buffer the processor can easily check if the address is in the list. If there are a lot of processors which have these lists the search becomes much more complex. Dealing with one processor maybe this enhancement does not cause any problem, but with multiple computers with distributed global memory it would be possible for different processors to have a memory location that has a desired order to it according to one processor but another processor has a shorter write buffer list and hence does the write out of order.
Another hardware enhancement is data pre-fetching. This is a feature where there is a look-ahead capability in the instruction pipelining that allows operands to be calculated and the data obtained before it is needed. This is OK for one processor, but in a multiprocessor environment if another processor changes the value of the data but the original instruction has already obtained the data from the location then you either have to put in extra cancellation operations or cause the data to be held back for this instance. Either way it becomes complicated and rather than lose the enhancement capabilities by adhering to the consistency model it may be better to relax the model even more and just use the synchronization instructions to protect critical data. It’s an interesting note here that these are hardware enhancements that we did not cover in the content of this course but they are easily explainable because of the background you have developed during the study of the subject. At the beginning of the course the topic of pipelining was a vague concept and now the value of it and the problems that come with it are like second nature.
The real aspect that makes these consistency models useable is that there exist a number of synchronization instructions and procedures that allow the programmer to control certain program orders to be maintained. These hardware primitives, the reference http://citeseer.nj.nec.com/adve95shared.html calls them a safety net, provide locks and barriers so that the programmer or operating system can manage the flow of information. There are a number of techniques used, such as the instruction test-and-set and acquire and release, which can enforce the order as it applies to the appropriate memory consistency model. Be aware that the consistency model used is an inherent part of the computer system and the user’s choice is to make use of the advantages of the model or ignore them and force synchronization on everything. The idea is to get the multiprocessor running a task effectively so it doesn’t resemble a snail hitching a ride on top of a turtle. The textbook also discusses various hardware primitives that are commonly used. The figure 8.39 on page717 of the old book describes the kind of consistency model that is used for various computers used in shared multiprocessor environments.