Remote Procedure Mechanism on eCos

Remote Procedure Mechanism on eCos

Project Report

CSE-237B, Fall2004

Abhijit Sachdev

Yenny Rusli

Table of Contents

Table of Contents

1.Introduction

2.Background

2.1RPC

2.2eCos

3.Project Goals

4.System Design

5.Implementation

5.1Transport (TCP v/s UDP)

5.2Single threaded

5.3Stub and Skeleton generation

5.4Implementation Environment

Platform Description

Compiler & ToolChain

Host Platform & Synthetic target

6.Results & Discussions

6.1Micro-Benchmarks

6.2TCP v/s RPM Performance Benchmarks

7.Conclusions & Future Work

8.References

Appendix A. Redboot bootloader

Appendix B. eCos Kernel & Clock

Kernel

Kernel Startup

Clock

1.Introduction

Current trends in computing include increases in both distribution and wireless connectivity leading to highly dynamic, complex environments on top of which applications must be built. The task of designing and ensuring the correctness of applications in these environments is becoming more and more complex. Building a distributed application over embedded devices is not only hard, building it with goals of reliability, scalability and efficiency within the budget and timing constraints is even harder. Add to this the inherent complexities of using low-level and error prone networking APIs.

The goal of much of the research in distributed embedded systems is to provide higher-level abstractions of complex low-level concepts to application programmers, easing the design and implementation of applications. A new and growing class of applications for wireless sensor networks also requires similar complexity encapsulation. Over the last decade several middleware implementations have been used in traditional systems for bridging this gap between the OS (a low-level component) and the application, easing the development of distributed applications. While middleware is very useful for traditional systems, DCOM, CORBA, and other conventional distributed middleware are heavyweight in nature and usually are not appropriate for embedded applications with limited resources (energy, computing power, memory, communication bandwidth, etc), such as the constraints of sensor networks.

In this project, we propose a simple implementation of remote procedure call implementation for networked embedded devices (such as wireless sensor networks). We characterize our implementation of remote procedure mechanism by using eCos on iPAQ and show its effectiveness when compared with a simple application implemented directly over the network transport layer.

2.Background

2.1RPC

We propose a RPC-like mechanism(RPM) due to its simplicityininvoking remote procedures. With RPC, procedures can reside on multiple machines and can communicate with each other in a location transparent manner.

In general, RPC makes distributed computing easier to implement by providing a higher level of abstraction than the TCP/UDP streams. RPC lets programmers focus on the core complexity of distributed computing rather than the issues related to networking like, connection establishment, connection management, multiplexing, dispatching, marshaling, to name a few.

RPC provides client applications with transparent access to a server. Services that the server provides to the client could include computational services, access to a file system, access to part of an operating system in a distributed operating system, or most other functions used in a typical client/server framework. RPC was designed at Sun Microsystems where widely used applications, such as the Network Filesystem (NFS), were built on top of an RPC library.

The crucial requirement of an RPC system is that it provides a client with reliable, transparent access to a server. A remote procedure call to a server looks the same to the client application as a local procedure call. This is a very powerful concept. The application programmer does not need to be aware of the fact that he is accessing a remote computer when executing an RPC function.

2.2eCos

The choice of RTOS for our design and implementation of a simple RPC mechanism is eCos. eCos is an open source Real-time operating system developed by Redhat and the user community. It like other conventional operating systems, seek to reduce the burden of application development by providing convenient abstractions of physical devices and highly tuned implementations of common functions.

eCos has been designed in such a way that a small resource footprint could be constructed. It is extremely configurable and allows developers to select components that satisfy basic application needs, and configure the OS for the specific implementation requirements for the application. eCos uses compile-time control methods, along with selective linking (provided by the GNU linker) to give the developer control of its behavior, allowing the implementation itself to be built for the specific application for which it is intended.

The prototype platform used in this implementation is iPAQ 3650. eCos has also been ported to this platform.

Thus, the smaller footprint size, configurability and portability of eCos make it the RTOS of choice for our implementation.

3.Project Goals

The goal of this project is to implement a simple remote procedure mechanism for networked embedded devices. Essentially, as a proof of concept we prototype our implementation on iPAQ,using eCos as the RTOS. The goal of this prototype implementation then is to address the following issues:

  • Show that this implementation has reasonable footprint size
  • Characterize the performance impact of using remote procedure mechanism and show its effectiveness
  • Maintain complete transparency for the client application
  • Provide a proof of concept and motivation for future work on more complete implementations for embedded devices

4.System Design

Figure 1illustrates the architecture of our Remote Procedure Mechanism for networked applications. This architecture is based upon the SUN RPC design.

Figure 1 – Remote Procedure Mechanism Architecture

The RPM generator is a logical component, which generates stubs and skeletons for the server function. For the scope of this project, however the stubs and skeletons are hand crafted (and not automatically generated) due to time and resource constraints. In an ideal world, the stubs and skeletons would be generated by this component.

The Stub has the following responsibilities:

  • Provide location transparency to the client
  • Marshaling of the function arguments
  • Unmarshaling the return value of the function obtained from the skeleton
  • Request multiplexing on the transport layer

And the Skeleton provides the following functionality:

  • Unmarshaling the function arguments
  • Dispatching the request to the corresponding server function
  • Marshaling the response back to the client

Figure 2illustrates the flow of a RPM call. In the first step, the client application makes a procedure call exactly as it would make any local procedure call. The procedure is passed to the stub. It flattens the arguments passed to it, packs them with additional information needed by the server, and passes this entire packet to the client network interface. It is important to note that the client application need not know that this stub exists. The stub provides the illusion of a local procedure call. As far as the client is concerned, the stub is actually executing the procedure call.

Figure 2 – Remote Procedure Mechanism Call Flow

The network interface then executes a protocol to reliably transfer the stub-generated packet to the appropriate server. In step 4, the server's network interface calls the server skeleton. The server skeleton unpacks the received request and executes the remote procedure on the server. The returned data is then passed back to the server skeleton, which packages it and has the server network interface transmits it back to the client network interface (step 8). The client network interface then passes the packet to the client stub, which is now responsible for unpacking the received packet and returning the desired data to the client application.

5.Implementation

5.1Transport (TCP v/s UDP)

The most important requirement of the RPM network interfaces is that they provide reliable operation. When RPM is implemented on UDP, (an unreliable, connectionless protocol) this becomes quite challenging. UDP can potentially re-order, drop, duplicate and/or corrupt packets, and it is the job of the RPC to detect and correct such errors.

Given the timing and resource constraints for this project, it was decided to implement RPM on top of TCP, rather then UDP. TCP is a reliable protocol and its use as the underlying transport ensures that RPC does not need to correct/detect for errors as mentioned above.

However, it should be noted that a carefully designed implementation of RPM on top of UDP could avoid much of the overhead associated with network operations that use TCP.

5.2Single threaded

The second major decision that had to be made was whether RPM implementation should be multi-threaded or not. On the client side, a single thread is clearly sufficient since the client should block until the RPM call returns. Things are not as clear on the server side. Here it would be nice to have one server thread that reads all the data from the network. Another thread could examine the incoming data and spawn a new thread depending on the action that needs to be performed. Thus, this would be the thread-per-request model. Similarly, there could be a thread-per-connection model as well.

Such a multi-threaded server would clearly be much more efficient than a single-threaded implementation. This is because the functions the server needs to perform are mostly blocking instructions or system calls with much idle time (because of I/O delays). A uni-processor, and especially a multi-processor server, would thus benefit from a multi-threaded implementation. However, we chose to make the server a single-threaded process because it makes the design significantly easier. It is not clear to us, if, or how, one can share a single socket to communicate with clients among several threads. Many other issues such as how the server state is shared among the threads would also need to be studied. In order to make the design simpler we chose a single threaded design.

5.3Stub and Skeleton generation

For this project, the stubs were hand crafted, i.e., generated manually. Automatic stub generation is an interesting and complex research area and beyond the scope of this project.

5.4Implementation Environment

Platform Description

The prototype platform used in the implementation was iPAQ 3650. The iPAQ has an Intel StrongARM 1110 rev 8 32-bit RISC processor. This processor runs at 206MHz. The flash memory size is 16MB and the RAM memory size is 32MB. The SA-1110 uses only two crystals, 32.768 KHz and 3.6864 MHz, to generate all frequency needed. This hardware clock has a granularity of about 271 ns.

For detailed hardware specifications on iPAQ 3600 series, refer [8].

Compiler & ToolChain

The OS installed is a Redboot bootloader. This is used to install eCos applications. The compiler used, is the gcc cross-compiler: arm-elf-gcc 3.2.1. The eCos version used is 2.1 beta (downloaded from the CVS repository, dated 03/27/2003). The compiler options passed to the compiler are “–O2 –static”. The clock generates an interrupt every 10ms. This is referred as the eCos kernel tick. The eCos scheduler uses a timeslice of 50ms.

Host Platform & Synthetic target

Much of the behavior of the target can be emulated on a typical PC running Linux. Instead of running the embedded application,being developed, on a target board of some sort, it can be run as a Linux process. The processor is the PC’s own processor, running an x86, and the memory will be the process’ address space. Some I/O facilities are emulated directly through system calls. For example, clock hardware is emulated by setting up a SIGALRM signal, which causes the process to be interrupted at regular intervals. This emulation of real hardware is not accurate, the number of cpu cycles available to the eCos application between clock ticks varies widely depending on what else is running on the PC, but for much development work it is good enough [7]

The host platform is an x86 PC, running Redhat 9 with 2.4.20 kernel. This is also the synthetic target.

6.Results & Discussions

This section describes various micro-benchmarking tests as well as RPM implementation performance numbers. The micro-benchmarks are performed to provide a baseline for comparison with other RTOS’. The following metrics were used:

  • Latency: End-to-end latency is defined as the average amount of delay seen by a client from the time it makes the request to the time it completely receives the response (or the return value). High latency directly affects the system by degrading communication
  • Variance: Variance of the latency is the deviation of the latency from a series of requests. Large amounts of variance complicate the computation of the WCET and increase the unpredictability of the system
  • Footprint Measurement: Stringent constraints on the available memory in embedded systems impose a severe limit on the footprint of the RPM applications. It is imperative to maintain a small footprint of the stubs & skeletons. Hence the need to determine the “on-target” size of RPM

All the result data is sampled over 1000 iterations and then averaged to display the average latency. The min, max values is also shown in the figures below.

6.1Micro-Benchmarks

Figure 3 – Micro-benchmarks

Figure 3 illustrates various micro-benchmarks. The HW Clock latency is 1 tick, which implies that atmost is takes about 1 tick of the hardware clock to measure the HW clock. While the latency of the system clock is about 6.21 microseconds. Timing measurements for all RPM tests was performed by using both the system clock (to achieve a granularity of 10ms) and the hardware clock (which provides a granularity of ~ 271 ns). Note that the hardware register used to read the hardware clock ticks wraps every 36864 hardware ticks, which happens every 10 ms.

6.2TCP v/s RPM Performance Benchmarks

These experiments measure the performance latency of using RPM stubs & skeletons and compare the results with TCP. The TCP latency numbers form a baseline against which our RPM implementation is compared.

For these experiments, a simple server function was implemented. This function takes an integer and returns its cube. The TCP client and server are implemented as single-threaded with no locks. These two threads run within a process and communicate over the loopback device.

The RPM stubs and skeletons in this case, use a header of 17 bytes. As shown in Figure 4.

Figure 4 -- Remote Procedure Mechanism Headers

Note in this case the payload is 4 bytes for the integer which is the argument to the server function.

Several ways to optimize the latency for RPM stubs and skeletons were considered:

  • Minimizing the data copying: Multiple data copying tends to degrade performance for a networking application. For our case, the RPM stubs and skeletons are carefully designed to eliminate any unnecessary data copying.
  • No dynamic allocation: It is possible to design and implement the stubs and skeletons in the case of a single threaded client and server to eliminate any need for dynamic memory allocation. Thus, all the request and response buffers are allocated in static storage in the stubs and skeletons.
  • Single Read optimization: The skeleton typically needs to perform two reads on the network socket to read the client request. The first read gets the first 13 bytes, which tells it the size of the remaining payload. This could be optimized by always reading the maximum number of bytes (like the Ethernet MTU bytes), and then reading the first 13 bytes to determine the size of the payload as before.

Figure 5– Remote Procedure Mechanism Overhead

Figure 5 shows the performance latency for TCP roundtrip versus the RPM request-response mechanism over local loopback. The client and server run in their own thread within the same process and communicate over the local loopback. Note that the TCP payload for the RPM stubs/skeletons is 17 bytes as described earlier. Hence, the latency for RPM should be compared against the latency for TCP request-response with a payload of 17 bytes (the second item in the bar graph).The overhead of using RPM is about 7.3% over a similar TCP implementation. Note that implementing the single read optimization further reduces this overhead to just about 3.4%. This is a welcome result and suggests that a well-designed RPC implementation (with just the right set of features) is extremely lightweight over its transport.

It was observed re-ordering the RPM headers did not lead to any substantial latency improvement.

Figure 6 shows the footprint measurements for the above-mentioned cases. As is evident from this figure, the footprint “bloat” by our customized RPM is minimal over regular TCP application.

Figure 6 – Footprint Measurements

The result for the memory footprint was computed using the GNU size utility (arm-elf-size). This reports the total size as well as the segment sizes (Text, Data & BSS) of the application.

7.Conclusions & Future Work

The results in the previous section show that the latency for our RPM implementation is minimal over TCP. The following observations are noteworthy:

  • The latency overhead for using RPM over TCP is about 7.3% without the single read optimization. With this optimization, the overhead further reduces to a meager 3.4%.
  • The increase in footprint due to RPM v/s TCP is minimal

This suggests that a well designed and implemented Remote Procedure Mechanism implementation with just the right number of features should have minimal performance and footprint impact for embedded network applications. This is a welcome result for embedded and sensor network systems, since traditionally these developers have held the notion that a RPC mechanism is not apt for embedded devices. However, these conclusions suggest that a flexible RPC mechanism that allows compile-time feature configuration selection (by #define statements) can be optimized for the required application. This should have the added benefit of providing a high-level API to the application developers, in leveraging previous work and reducing project risks. This in turn will help to keep projects on time and under budget.