Handling races: non-atomic messages (cont.)
Last time, we saw how to deal with read requests when another message (e.g., an invalidation) arrived while the read was being processed.
Now we want to consider what happens when a write request arrives.
Case 1:ReadX to a block in state UHome-centric approach
- Requester sends ReadX request
- Home responds with data
- Requester sends Ack
- Home closes transaction.
Requester-assisted approach
- Requester sends ReadX request
- Home sends
Case 2: ReadX to block in state S
Home-centric approach
- Requester sends ReadX request
- Home enters
transient state and sends Inv msgs.
- InvAcks must be
- collected at Requester, which notifies Home, or
- collected at Home
- Home closes transaction
Requester-assisted approach
- Requester sends ReadX request to home.
- Home sends Invs and closes the transaction
- InvAcks collected______
- ______
Case 3:ReadX to EM block
Home-centric approach
- Requester sends ReadX request to home
- Home enters transient
state and sends Inv message.
- InvAck must be
- awaited at Requester, which notifies home, or
- awaited at Home.
- Owner flushes block to home and requester.
- Upon receiving the block from owner, home closes transaction
Requester-assisted approach
- Requester sends ReadX request to home.
- Home sends Inv message to owner and closes transaction.
- Owner flushes block to requester.
- Requester buffers/Nacks new requests
Case 4: ReadX to EM block with data race
Is this different from Case 3 for home-centric approach?
For the requester-assisted approach?
- What if the current owner no longer has the block?
- Either it had it in state M and
- or it had it in state E and
- Home cannot close the transaction yet, as it may have to supply the block.
- Hence, it can close the transaction late, after it receives Ack from the owner.
Dealing with Imprecise Directory Information
[§10.5.1] Why does directory information get stale over time?
What’s wrong with notifying the directory?
What problems does stale directory information cause?
- Increased trouble (power consumption, latency) in locating a block
- Storage overhead
- Extra blocks invalidated when directory gets full
- Increase in invalidation traffic
Let’s consider these in order.
Problem 1 is caused by three evictions. After the evictions, list the tags of the directory entries that are incorrect.
When a core C1 wants to fetch C, it hedges because the directory info might be incorrect.
- If the directory info is correct, where should it get the data from?
- If the directory info is incorrect, where should it go for the data?
- Which choice has the least latency?
- Which choice takes the least power?
- Which steps in the diagram illustrate the hazard (in terms of power and/or latency) in making the wrong choice?
- A “compromise” is to look in both places. Is this better from the standpoint of latency and/or power?
Problem 2 (storage overhead) is caused by unneeded directory entries occupying space in the directory. Which directory entries above are unneeded at the end of Step 9?
Problem 3 is illustrated by which steps in the above diagram?
How can it cause an unnecessary cache miss?
Problem 4 is higher invalidation traffic. But why might traffic not be higher when stale directory entries are allowed?
Solihin suggests two antidotes.
- Aggregating notification messages on clean-block purges.
- Predicting when directory blocks are invalid, based on # of cache misses from a particular LLC.
Accelerating thread migration
Ordinarily, a directory keeps track of which processor has cached a copy of a block.
If a thread moves from one processor to another, it will suffer a lot of cold misses.
What are the steps in servicing such a miss?
- P3 references block A, but A is not in its cache (C3).
- So P3 consults the directory, and find that the block is cached in C1.
- It sends a request to C1, which responds by sending a ReplyD to the requester, C3.
Is there a way to avoid repeated references to the directory for each cache block that needs to move?
Solihin suggests adding a level of indirection to the directory.
- Instead of saying the block is cached in C1, it would say that it’s cached in
- Initially, V1 is set to point to , because that’s where the block is cached.
- When the thread migrates to a new processor, the OS adds the new processor’s cache to
- This effectively says that any block cached in C1 can also be cached in
- When a miss occurs, the corresponding line in C1 is consulted, and transfers the block.
- This saves message per miss.
Lecture 24Architecture of Parallel Computers1