Synchronization Methods FOR SCRAMNet+ REplicated

SHARED-Memory Systems

by

Stephen Frank Menke

BEE, Georgia Institute of Technology, 1993

Submitted to the Graduate Faculty of

Arts and Sciences in partial fulfillment

of the requirements for the degree of

Master of Science

University of Pittsburgh

1999
This thesis was presented

by

Stephen Frank Menke

It was defended on

April 28, 1999

and approved by

Rami Melhem, Professor of Computer Science, Committee Member

Mark Moir, Assistant Professor of Computer Science, Thesis Advisor

Daniel Mossé, Associate Professor of Computer Science, Committee Member


Copyright by Stephen Frank Menke

1999
Synchronization Methods FOR SCRAMNet+ Replicated

Shared-memory Systems

Stephen Frank Menke, MS

University of Pittsburgh, 1999

SCRAMNet+ (Shared Common Random Access Memory Network) is a communications network that transparently provides replicated shared-memory via a high-speed fiber-optic ring topology. Such systems combine the ease of programming of shared-memory multiprocessor systems with the distance and heterogeneity of message-passing networks. These features are ideal for a variety of distributed real-time applications.

This thesis explores both blocking and non-blocking synchronization methods in such systems. We first develop a mutual exclusion algorithm, the most common blocking synchronization method, by exploiting unique features of the SCRAMNet+ hardware. Through theoretical and experimental analysis we compare our algorithm to a mutual exclusion algorithm suggested by the manufacturer, Systran Corp. The analysis concludes that our algorithm is both scalable and fair Systran’s algorithm is not. Our algorithm also has faster execution times for any size SCRAMNet+ network.

Although mutual exclusion is the most common method for synchronization, non-blocking methods overcome a number of problems caused by the use of mutual exclusion, such as deadlock. It is well known that strong primitives such as compare and swap (CAS) or load-linked/store-conditional (LL/SC) are required for general non-blocking synchronization. We therefore present and evaluate a CAS algorithm for SCRAMNet+ systems. We validate the algorithm by incrementing a shared-memory counter with the CAS operation. More significantly, we use the CAS algorithm to construct lock-free and wait-free large shared large objects, which are designed to overcome the problems associated with mutual exclusion. We experiment with both lock-free and wait-free versions of a queue to validate the large object implementation on a real system.

Although we used a real system to perform experiments on all the algorithms, it was limited to only two nodes. Therefore, we also built a simulator, based on Augmint, which can model any size SCRAMNet+ network. We used experiments to validate our simulation against our real-world results, which then allowed us to extend our analysis to systems with more than two nodes.

Acknowledgements

1

First and foremost, I would like to thank my future wife Carolyn for her love, patience and support. Both work and school demanded many long hours. However, her encouragement and smile always kept me going.

I would also like to thank my advisor, Mark Moir, for his flexibly and guidance. He has balanced my knowledge by adding the theoretical. I truly believe this will attribute to my career, wherever it may lead.

Thanks, too, to the rest of my committee: Rami Melhem and Daniel Mossé. I am grateful for their flexibility in arranging their schedules for my defense. This also includes the help from Daniel’s group in setting up RT-Mach.

Finally, I would like to thank Systran Corp. for supplying the hardware, software and documentation necessary to complete this thesis. Most importantly, Chris Fought from technical support, whose assistance was key to developing the driver for RT-Mach.

1

Table of Contents

1

1Introduction......

2SCRAMNet+ Hardware......

2.1General Purpose Counter / Global Timer......

2.2Error Correction......

2.3Interrupts......

2.4Write-Me-Last Mode......

3Blocking Synchronization......

3.1Systran’s Mutual Exclusion Algorithm......

3.1.1Acquire......

3.1.2Release......

3.2Our Mutual Exclusion Algorithm......

3.2.1Acquire......

3.2.2Release......

3.2.3ISR0

3.3Theoretical Comparison......

3.4System Experiments......

3.4.1No Contention......

3.4.2Contention......

3.5Simulation Experiments......

3.5.1No Contention......

3.5.2Contention......

3.5.3Polling......

3.5.4Heavy Contention......

3.6Conclusions and Future Work......

4Non-blocking Synchronization......

4.1Compare and Swap......

4.1.1CAS......

4.1.2Read......

4.1.3Analysis......

4.1.4Experiments......

4.2Large Objects......

4.2.1Experiments......

4.2.2Conclusions and Future Work......

5Simulation......

5.1Compile-Time......

5.2Run-Time......

5.2.1Events......

5.2.2Data Movement......

5.2.3Tasks......

5.2.4Threads......

5.2.5Backend......

5.2.6Execution......

5.3SCRAMNet+ Backends......

5.3.1Memory Model......

5.3.2User Events......

5.3.3Write-Me-Last Backend......

5.3.4Interrupt Backend......

5.3.5Polling Backend......

5.4Simulation Parameters......

5.4.1Transit Time......

5.4.2Access Times......

5.4.3Context Switch Time......

5.5Conclusions and Future Work......

6Summary and Conclusions......

Appendix A......

A.1SCRAMNet+ Driver......

A.2SCRAMNET+ API......

A.2.1scr_mem_mm......

A.2.2get_base_mem......

A.2.3scr_csr_read......

A.2.4scr_csr_write......

A.2.5scr_id_mm......

A.2.6scr_acr_read......

A.2.7scr_acr_write......

Appendix B......

B.1Syntax......

B.1.1Augmint Parameters......

B.1.2Backend Parameters......

B.1.3Simulation Parameters......

B.2Experiments......

Bibliography......

1

List of Tables

Table 1 Comparison of average execution times for pair of acquire/relaease operations when the maximum number of nodes equals 256 (s)

Table 2 Average execution time for a read operation (s)......

Table 3 Average execution time for CAS operation (s)......

Table 4 Average execution time for pair of enqueue/dequeue operations for lock-free construction of large objects (s)

Table 5 Average execution time for pair of enqueue/dequeue operations for wait-free construction of large objects (s)

Table 6 Simulator executable directories......

Table 7 Scripts to run simulation experiments......

List of Figures

Figure 1 Systran’s mutual exclusion algorithm......

Figure 2 Our mutual exclusion algorithm......

Figure 3 Comparison of ME algorithms without contention on a real system......

Figure 4 Close-up comparison of ME algorithms without contention on a real system......

Figure 5 Timing sequence of our algorithm’s acquire procedure without contention......

Figure 6 Comparison of ME algorithms with contention on a real system......

Figure 7 Close-up comparison of ME algorithms with contention on a real system......

Figure 8 Comparison of ME algorithms without contention on a simulated system......

Figure 9 Close-up comparison of ME algorithms without contention on a simulated system......

Figure 10 Comparison of ME algorithms with contention on a simulated system......

Figure 11 Close-up comparison of ME algorithms with contention on a simulated system......

Figure 12 Comparison of polling and interrupt versions without contention on a
simulated system......

Figure 13 Comparison of polling and interrupt versions with contention on a
simulated system......

Figure 14 Comparison of all ME algorithms under heavy contention on a simulated system......

Figure 15 Close-up comparison of all ME algorithms under heavy contention on a
simulated system......

Figure 16 Comparison of average execution times for each node under heavy contention......

Figure 17 Semantics of compare and swap......

Figure 18 Compare and swap algorithm......

Figure 19 Timing diagram of our algorithm’s acquire procedure without contention......

1

1Introduction

1

1

This thesis presents and evaluates synchronization mechanisms for SCRAMNet+ (Shared Common Random Access Memory Network) systems. SCRAMNet+ is a communications network geared toward real-time applications, and based on a replicated shared-memory concept [12]. By combining the advantages of shared-memory multi-processors and message passing systems, SCRAMNet+ offers distributed shared-memory with reliable, deterministic and low-latency updates. Thus, SCRAMNet+ has proven to be ideal for many real-time applications [19].

SCRAMNet+ systems offer the benefits of shared-memory multiprocessor, namely ease of programming, low-latency communications, and little or no software overhead for communications [4]. A SCRAMNet+ network consists of up to 256 computers (nodes) each with a SCRAMNet+ network card. The network cards are interconnected through fiber-optic cables in a serial-ring topology. Each network card has dual-ported RAM (Random Access Memory) that can be mapped into the address space of any process on a node. Any write to the dual-ported RAM is transparently replicated to each node, and hence every process, in the network.

In addition to providing a shared-memory abstraction to applications, SCRAMNet+ systems also have the advantages of a message passing system. First, processors can be connected at distances of hundreds or even thousands of meters [4]. In contrast, a typical multiprocessor system is limited to only a few meters. SCRAMNet+ networks can also connect machines with different architectures or operating systems. This might be an advantage, for example, in an industrial control system where the data acquisition and control are run on distributed embedded processors and the graphical interface runs on standard PCs.

Typical distributed systems, such as industrial control systems, require concurrent access to shared data. Usually some synchronization is required to protect the consistency of the data. The most common method used are mutual exclusion algorithms, which protect shared data through the access to a critical section. The semantics of mutual exclusion prevent more that one process from entering a critical section at time, therefore limiting access to the data. The manufacturer, Systran Corp., presents a mutual exclusion algorithm for SCRAMNet+ memory systems in [15]. However, this algorithm has several shortcomings.

  • Its performance is drastically affected by the number of nodes in the system;
  • The solution is not starvation free. It is theoretically possible for one process to repeatedly attempt to acquire the lock but never succeed; and
  • The solution necessarily prioritizes the processes, but does not make any concrete guarantees. Furthermore, the prioritization mechanism is unavoidable and leads to starvation of lower priority nodes.

In this thesis we present our own mutual exclusion algorithm for SCRAMNet+ systems. This algorithm exploits special hardware features of the SCRAMNet+ network and is both fair and starvation free. We compared the two algorithms using both real-system experiments and simulations that compute the average execution time for a pair of acquire/release operations. The results demonstrate that our algorithm has faster execution times both with and without contention, regardless of the network’s size. Although both algorithms are sufficient for synchronization, when one process enters the critical section, any other process desiring access to the shared data must wait indefinitely for that process to exit the critical section.

Recently, significant progress has been made toward efficient lock-free and wait-free implementation of shared objects (e.g. [2, 3, 6, 7, 8, 9]). A shared object is a shared data structure and associated operations. A lock-free implementation of a shared object guarantees that after a finite number of steps of a process p’s operation, some process (not necessarily p) completes an operation on the object. A wait-free implementation guarantees that each operation of a process p completes after a finite number of p’s steps. The result is fault tolerance, meaning some process (lock-free) or the actual process (wait-free) will continue to progress, regardless of the failure of any other process. A mutual exclusion algorithm cannot be either lock-free or wait-free because if a process never exits the critical section, no other process can continue.

In [6] Herlihy defines universal objects that can construct any wait-free object. He assigns a consensus number to each object, where a consensus number of n can implement any wait-free object for up to n processes. Herlihy also proved that CAS (Compare and Swap) is universal and has a consensus number of infinity. Therefore, CAS is an important primitive to implement in a shared-memory system that requires wait-free objects. Given this, we have implemented and evaluated a CAS algorithm for SCRAMNet+ systems. By conducting a simple experiment that used a CAS to increment a shared counter concurrently, we validated the correctness of this algorithm. We also compared experiments with and without contention, and found that our algorithm performs well under contention. However, the contention experiments were only run with two nodes and therefore further testing is needed. Now that we have created an effective CAS primitive for SCRAMNet+ systems, we can construct wait-free objects for such systems.

Herlihy extended his work in [6] by suggesting lock-free and wait free constructions for large shared objects. However, the implementation is inefficient due to the large amount of data being copied – especially when much of the copying may be unnecessary. In [2] Anderson and Moir present a more efficient implementation of lock-free and wait-free constructions for large shared objects. In [5], Filachek furthers their work by implementing and testing their algorithms in simulations. We have furthered the study by porting and testing the algorithms to a SCRAMNet+ system. Our main objective was to validate the operation of the algorithms. We accomplished this task by testing concurrent access to lock-free and wait-free implementations of a queue on an actual system and verifying the consistency of the queue.

The original evaluation of all our algorithms was performed on an actual system consisting of two 266 MHz Pentium II PCs running the RT-Mach operating system. Each PC was equipped with a SCRAMNet+ network card with 2MB RAM interconnected with single-mode fiber optic cables. However, due to the availability and cost of the hardware, we were only able to construct a system with two nodes. This was sufficient for testing the algorithms without contention, but provided little insight to situations with many nodes and heavy contention. Therefore, we designed SCRAMNet+ simulators using Augmint.

Augmint is a fast, execution-driven multiprocessor simulator for Intel x86 architectures [16]. Augmint allows the modification of a library called the backend to implement various memory models. We created three different backend libraries to model different configurations of the SCRAMNet+ system, and then duplicated the original experiments for mutual exclusion in order to compare the simulations to our real world results. The comparison verified the accuracy of our simulators allowing us to continue the simulations with confidence in the results. We then used the simulators to evaluate the mutual exclusion algorithms under heavy contention. The results of these experiments show that Systran’s algorithm fails to guarantee its prioritization scheme. It also shows that by modifying our algorithm to use polling techniques versus interrupts, the resulting algorithm will outperform Systran’s algorithm with heavy contention regardless of the number of nodes in the network.

The remainder of this thesis is organized as follows. We provide an overview of the SCRAMNet+ hardware in Section 2. Section 3 covers blocking synchronization methods for SCRAMNet+ systems. It contains a detailed description of Systran’s and our mutual exclusion algorithms and an analysis of the experiments performed on the real system and on simulations. Section 4 covers non-blocking synchronization methods for SCRAMNet+ systems. It describes a CAS algorithm and analyzes the results of real world experiments. It then presents results of experiments for lock-free and wait-free objects implemented with the CAS algorithm. Section 5 contains an overview of Augmint and a full description of the simulation implementation. In Section 6, we summarize the overall results and conclusions.

2SCRAMNet+ Hardware

1

SCRAMNet+ cards have many configurable features. This section describes the features of interest to this thesis. For more detailed information or a complete listing and explanation of all features, see [12]. To understand the algorithms in this thesis it is first necessary to understand how the SCRAMNet+ network operates.

A SCRAMNet+ node updates the shared-memory on all other nodes by inserting a message on the ring for every write to shared-memory. The message contains the memory offset and value of the word written. When the message is received by another node, the write is replicated by writing the same value to its memory. When the originating node receives its own message, the message is removed from the ring.

Although SCRAMNet+ uses a ring topology, it is essentially a point-to-point network in a ring orientation. That is, a message must be received and retransmitted by each intermediate node to traverse the ring. This introduces a minimum delay of 247 nanoseconds at each node [12]. For our experiments we used the fixed size packet configuration, which according to [12] has a maximum delay of 800 nanoseconds at each node. Therefore, our two-node system should have a round-trip transit time between 494 and 1600 nanoseconds.

2.1General Purpose Counter / Global Timer

SCRAMNet+ cards provide a General Purpose Counter / Global Timer that can measure the round-trip transit time of a message with a resolution of 26.66 nanoseconds. Using this timer the transit time on our two-node network was measured as 1270 nanoseconds, which is within the expected range.

2.2Error Correction

The SCRAMNet+ network has a bit error rate of 10-15, meaning that an error might occur once every 76 days of continuos, 24 hour, 100% bandwidth-saturated network utilization [20]. Although rare, these errors must still be handled. We configured the SCRAMNet+ card in PLATINUM mode to detect and handle any errors. PLATINUM mode can detect and correct two types of errors. First, bit errors are detected with a bit-by-bit comparison of the message once it has returned back to the originating node. Second, a configurable time-out can detect the loss of any originated message. If either type of error occurs, they are corrected by automatically re-transmitting the original message until it is received correctly. Also, once an error has been detected, any new messages from that node are stored in a transmit FIFO and not sent until the message that was in error is received correctly. Therefore, PLATINUM mode guarantees that every message is eventually delivered correctly. SCRAMNet+ cards can also be configured to generate an interrupt on the host whenever an error occurs. We used this interrupt in all of our experiments to generate an error message, however it never occurred.

2.3Interrupts

In addition to interrupting on errors, the SCRAMNet+ network cards can be configured to generate an interrupt whenever a given 32-bit memory word is written. Each 32-bit word in SCRAMNet+ memory has an associated ACR (Auxiliary Control RAM) location that is used to configure this feature. Each ACR can be configured to send interrupts, receive interrupts or both. Although the memory of the cards is replicated on every node, the ACRs are not. Therefore, the interrupt configuration for each word can be different on every node.

Whenever a node writes a 32-bit memory word, the ACR for that word on that node is checked. If it is configured to send interrupts, an interrupt message is generated containing the memory offset of the word written. Whenever a node receives an interrupt message, the ACR for the word written is also checked. If the ACR is configured to receive interrupts, the memory offset for that word is stored in a FIFO (First-In, First-Out data buffer) for the ISR (Interrupt Service Routine) to interrogate. The first entry into the FIFO generates an interrupt on the host and disables the interrupt hardware until re-enabled by the ISR. Any subsequent interrupt messages are inserted in the FIFO without generating an interrupt. The ISR then continually processes the interrupt FIFO until it is empty. This allows the ISR to process multiple interrupts with only one context switch. Once the ISR detects the FIFO is empty, it re-enables the interrupt and exits.