Single Threaded Applications Performance on Symmetric Multiprocessing Cpus

CCSE TECHNICAL REPORT KFUPM-CCSE-91-##/ICS

SINGLE THREADED APPLICATIONS PERFORMANCE ON SYMMETRIC MULTIPROCESSING CPUS

Khalid Salah, Raed AlShaikh and Abdulaziz M. Al-Dharrab

Department of Computer Science

COLLEGE OF COMPUTER SCIENCE & ENGINEERING

King Fahd University of Petroleum and Minerals

Dhahran31261, Saudi Arabia

SINGLE THREADED APPLICATIONS PERFORMANCE ON SYMMETRIC MULTIPROCESSING CPUs

Khalid Salah, Raed AlShaikh, Abdulaziz M. Al-Dharrab

Email: {salah, g199607190, g200145130}@kfupm.edu.sa

Abstract

In today's computing environment, commercial off-the-shelf small-scale SMP (symmetric multiprocessing) CPUs on desktops are becoming the standard. Given this advancement, it becomes imperative to study the performance of SMP machines when running sequential legacy applicationsdesigned for UP systems. In this paper, we conduct two different experiments by running two types of benchmarks, namely: the CPU-bound simplex benchmark and UNIX-bench under UP and SMP configurations. We show that running applications under UP may outperform SMP. Nevertheless, we also demonstrate that SMP configurationswould significantlybenefit from the system's implementation of parallel processing, thus, outperforming UP systems for today’s multiprocessing operating systems and we offer several interpretations for this.

1.Introduction

Most common multiprocessor manufacturers today, such as Intel and AMD, are complying with Moore's law by stretching the number of cores on the chip. Symmetric multiprocessing (SMP) involves a multiprocessor computer architecture where two or more identical processors can connect to a single shared main memory. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.

While the workstation market is moving from single processor to small-scale shared memory multi-core processors, many legacy applications are still single threaded. Therefore, it is essential to examine the performance of the SMP machines when running sequential or legacy applications.

Looking at the nature of single-threaded applications, the believed hypothesis is that these applications utilize a single core when running on an SMP machine, leaving the remaining cores essentially unused. As a result, running on a multi-core system will not have any speed benefit. On the contrary, the application may run a bit slower on SMP as the system will spend some resources in coordinating the multi-CPUs activities, such as cores scheduling and handling CPU affinity (where the OS binds specific processes to specific CPUs). Further, extra OS processes will be invoked in SMP to specifically manage the multi-core system, and obviously, invoking extra processes consumes extra resources from the system.

To validate this hypothesis, we conducted two different experiments by running two types of benchmarks, namely: the CPU-bound simplex benchmark and UNIX-bench, on two sets of SMP machines. With each run, we compared the performance indices of SMP runs to the UP (Uni-Processor) of each benchmark. The results will be discussed later in the paper.

2.Experimental Setup

Fedora 11 Linux OS with kernel 2.6.29.4 was used to conduct all the benchmark tests. Before running the actual tests of both SMP and UP modes, the following prerequisites were done on the systems to offline the CPU and memory as much as possible to have accurate tests. The most important service to be turned off is the cpuspeed service, which is a CPU Speed control service that adjusts the CPU speed dynamically based on the demand for processing power, and leaving this service on does affect the benchmark results. All unneeded services were turned off, as shown in Figure 1

Figure 1: All unneeded services are turned off

Also, the Linux run level [6] was set to 3 (i.e. running with no GUI) to eliminate the GUI factor from taking systems resources. This was set in the /etc/inittab configuration file, as shown in Figure 2

Figure 2: Run level 3 (Multi-User mode, console logins only)

3.Experimental Results

First, the CPU-bound sequential simplex program was used to benchmark the first SMP system, and compare the numbers with the UP mode of the same system. The system configuration is as follows:

System configuration:
Intel Core 2 Duo @ 2.2GHz (2x cores)
2GB RAM
Running Fedora Core 11 with user-compiled 2.6.29.4 kernel

1.Simplex Benchmark

Simplex is a CPU-bound application. Simplex is a commonly used nonlinear numerical method for optimizing multi-dimensional unconstrained problems belonging to search algorithms. In particular, we used a modified version of thedownhill simplex methodcode [1]. This program is a computationally heavy application with no disk or network I/O operations. The execution time is measured in microseconds

1.1Running Simplex Benchmark on the SMP Environment

The first tests were conducted with SMP option turned on. Specifically, the SMP feature was turned on in the .config file of the (make menuconfig) when compiling the kernel. Figure 3 shows the modification on the configuration file.

Figure 3: SMP option is set in the .config file

After compilation, uname –a command confirms that the kernel is indeed SMP enabled, as shown in Figure 4.

Figure 4: Kernel Details (SMP mode)

To make sure that the Fedora OS sees all cores, cat /proc/cpuinfo command displays how many CPUs/Cores detected in the system. The /proc directory contains virtual folders and files which represent the current state of the kernel. Figure 5 demonstrates this output.

Figure 5: CPU info in SMP mode

Finally, a simple script was created to run the simplex benchmark 50 times for each test input, and take the average. These iterations were done to have the best accurate result. Figure 6 demonstrates our simple iterative script.

sum=0
result=0
average=0
for i in `seq 1 50`
do
result=$(/work/simplex.out 20000 | grep microse | awk ‘{print $8}’)
let sum=$sum+$result
done

Figure 6: Script used to run and collect the results

Table 1 shows all the SMP runs done with 10000, 20000, 30000, 40000 and 50000 iterations, each was done 50 times while taking the average, while Figure 7 illustrates a sample of our runs.

No. of iterations / SMP Elapsed Time (µs)
10,000 / 14,921,090
20,000 / 29,844,342
30,000 / 44,429,388
40000 / 59,240,104
50,000 / 74,889,757

Table 1: Simplex run results for SMP mode

Figure 7: SimplexResults Sample for SMP mode

1.2Running Simplex Benchmark on the UP Environment

Similarly, the Linux kernel was recompiled again with UP configuration to conduct our benchmarks using a single processor. After compilation, uname –aoutput confirms that is a single CPU kernel, as in Figure 8.

Figure 8: Kernel Details (UP mode)

And cat /proc/cpuinfo | grep processors indeed shows one CPU only, as in Figure 9.

Figure 9: CPU info in UP mode

Similar to the SMP run, we conducted the same iterations using the Simplex benchmark. Table 2 below shows all the runs done, with 10000, 20000, 30000, 40000 and 50000 iterations, each was done 50 times while taking the average:

No. of iterations / UP Elapsed Time (µs)
10,000 / 14,748,610
20,000 / 29,492,310
30,000 / 44,329,012
40000 / 58,993,596
50,000 / 73,129,463

Table 2: Simplex run results for UP mode

1.3Results and Discussion of Simplex Performance

To compare the performance of both SMP and UP using the Simplex benchmark, Figure 10 and Table 3below show the difference between the UP and SMP runs in percentages.

Figure 10: Simplex run results (SMP VS UP Mode)

No. of iterations / SMP (µs) / UP (µs) / % difference
10,000 / 14,921,090 / 14,748,610 / 1.16%
20,000 / 29,844,342 / 29,492,310 / 1.19%
30,000 / 44,429,388 / 44,329,012 / 0.23%
40,000 / 59,240,104 / 58,993,596 / 0.4%
50,000 / 74,889,757 / 73,129,463 / 2.4%

Table 3: Simplex run results (SMP VS UP Mode)

From the figure and table above, we confirm that the difference between UP and SMP runs is minimal (between 0.22% and 2.4% only), and that UP kernel slightly outperforms the SMP kernel when running sequential applications.

Complementing the earlier hypothesis, the UP kernel slightly outperforms the SMP for the sequential Simplex runs because the system in SMP spends some system resources in coordinating the multi-CPUs activities (e.g. scheduling in SMP ...etc).

Furthermore, when the kernel runs with SMP support, the OS invokes (i.e. duplicates) other processes to handle the SMP mode. This was clearly shown when counting the number of running processes for both SMP and UP using the command: ps –ef | wc –l. For SMP, the number of processes running were 134, while for UP mode was only 119.

Obviously, invoking more processes in SMP will results in consuming more resources of the system (CPU load, memory ...etc). Examples of these extra processes on SMP are shown inTable 4:

Process / Process explanation / UP / SMP
events / events is a per-cpu process for handling low-level requests that need to run asynchronously. All sorts of notifications run through there (drivers to kernel messages for example). / There was only one instance events/0 / In SMP mode, the system invoked events/0 and events/1, each process for each core
watchdog / is a per-cpu extension for monitoring processes running on a Unix host / There was only one watchdog/0 process / The system invoked watchdog/0 and watchdog/1, each process for each core
ksoftirqd / ksoftirqd is a per-cpu kernel thread that runs when the machine is under heavy soft-interrupt load. / There was only one ksoftirqd/0 process / The system invoked ksoftirqd/0 and ksoftirq/1, each process for each core
aio / implements asynchronous I/O using the means available to Linux / There was only one aio/0 process / The system invoked aio/0 and aio/1, each process for each core

Table 4: Extra processes on SMP

2.UNIX Diagnostics Commands

In the this section, we use other Linux advanced commands to get a deeper insight into the system behavior when running on both SMP and UP modes, and analyze the system and kernel internals when running the Simplex benchmark.

2.1top command

During the SMP run, the system was monitored to make sure it runs on 1 core only. This was done with top command (pressing f for advance options, j for enabling cores), as shown in Figure 11 below.

Figure 11: ‘Top’ command capture in SMP mode

The “P” column above, which only enabled with advance options, indicates on which core or CPU a process is running. As illustrated, the Simplex benchmark was running on CPU1 while some other programs were running on CPU0. It was observed that the program does not switch from core to core, so no time is wasted in switching.

When doing top command on UP kernel, it was clearly shown that all processes run on one CPU only, as illustrated in Figure 12

Figure 12: ‘Top’ command capture in UP mode

The earlier SMP top figure shows around 50% aggregate CPU usage, meaning the other CPU is idle. In the UP version, the top command shows 99.7% aggregate CPU usage, since there is only 1 CPU and it is all taken by the Simplex benchmark.

2.2strace command

Further, strace –Tc /work/simplex.out 50000 command was run to show how many system calls this benchmark makes in the system, and how much time each system call takes. As indicated in Figure 13 below, Simplex benchmark does 34 system calls, and execve system call takes 0.000171 seconds of the system time to be invoked. All other system calls take less than 1 microsecond (and thus shown as 0).

This test gives the same result with 10,000 or 100,000 iterations, indicating that regardless of the number of iterations Simplex does, the system calls count is constantly the same (34), while 100% of the CPU the time (from the start to the end of the program) is spent to execute the program via the execve system call.

Figure 13: Number of system calls the simplex test made

2.3uptime command

During the Simplex benchmark run, uptime command was applied as in Figure 14

Figure 14: uptime during simplex benchmark

The figure above says the load average of this server in the last 1 minute, last 5 minutes and last 15 minutes are 1.16, 1.03 and 0.81 respectively. These figures represent the number of running processes at the same time on average for a CPU to process. Considering the fact that any CPU/core can only take on one process at any given time, there will possibly be processes waiting in the queue, which means server is overloaded. For example, in case of UP kernel (where there is only 1 CPU detected), the server is overloaded by(1.16 / 1 CPU) – 100% = 16%. This means, this single CPU is running 16% overloaded and has some processes in the queue.

In case of SMP (2 CPUs), we got the same uptime load average, system is: (1.16 / 2 CPUs) – 100% = -%42, which means it is 42% underutilized (i.e. the other CPU, or half of the system is almost doing nothing), which is true, since simplex.out is taking one CPU and other CPU is idle.

2.4dstat command

Now we present dstat –C 0,1command output while running Simplex. It is a new command in Red Hat 5.x and Fedora core 8+, that combines the power of vmstat, iostat, mpstat, df, free and sar, instead of doing each command separately. Figure 15 below shows dstat for UP kernel run. Only one CPU is detected and it is almost 100% utilized. The number of context switching (csw) is relatively high since only one CPU is detected and hence the system has to switch other processes to this CPU and do some runs.

Other values like disk reads/writes and network are all zeros since Simplex is CPU-bound only.

Figure 15: dstat for UP kernel run

Figure 16 below is for SMP kernel run. CPU0 was totally idle while CPU1 is 100% utilized during the total run. The context switching is lower than UP, because it has 2 CPUs and so the system minimizes the context switching: while CPU0 is busy, processes are shifted to the other CPU1.

Figure 16: dstat for SMP kernel run

2.5mpstat command

mpstat command shows the performance statistics for all processors (and cores) in the system. Figure 17 demonstrates the case of 1 CPU, where it is 100% busy (all means only one in this case) running the user program (i.e. Simplex benchmark). Further, Figure 18 demonstrates the SMP run (2 CPUs), and mpstat –P 0,1 2 shows the load on each CPU every 2 seconds interval. CPU0 is idle with CPU1 is fully loaded running Simplex.

Figure 17: mpstat output in UP mode

Figure 18: mpstat output in SMP mode

2.6iostat command

As in dstat, iostatreports the CPU statistics and I/O activities for the system. In case of 1 UP kernel, the results are the same: 100% CPU utilization, and no hard disk activities since Simplex benchmark is purely CPU bound (sda is the first storage device, which is the hard disk). The %sys (system load) is minimal to handle the context switching and interrupts (kernel level). In case for 2 CPUs (SMP), Figure 20illustrates the case. The system shows 50% utilization aggregate, taking into consideration one CPU is idle.

Figure 19: iostat output in UP mode

Figure 20: iostat output in SMP mode

2.7sar command

As for the other benchmarking commands, sar –I ALL command gathers statistical data about the system. Figure 21 shows the SMP kernel while the system is busy running Simplex. CPU0 is idle while CPU1 is fully utilized. The %sys (system load) is minimal to handle the context switching and interrupts (kernel level), as detailed in the dstat command earlier.

Figure 21: sar command output in SMP mode

2.8time command

time command reports how long it takes to execute in terms of:

-real time (the real wall-clock time the program takes to execute).

-user CPU time (time used by the program itself and any library subroutines it calls)

-system CPU time (system calls invoked by the program).

Important note: some Linux distributions have two different time commands: the built-in shell time command which provides basic functionality [4]. And there is the “time application” which is a separate exactable than the shell. In order to use the advance command, you need to specify the full qualified path (e.g. /usr/bin/time).

First we ran the Simplex benchmark using the UP kernel and measured its %real, %user and %sys. As indicated in Figure 22, the time for the system (%sys) is minimal and most of the time is spent on running the application itself. The real (%real) is the wall-time that the application has spent. And as expected, Simplex takes a bit more time in the SMP setup, as indicated in Figure 23.

Figure 22: time command output in UP mode

Figure 23: time command output in SMP mode

Next, we used the advance time command (/usr/bin/time) to measure other parameters. Figure 24 below shows the time command calculating other parameters, such as IO (input/output) which returns 0 since simplex.out is purely CPU bound, and also the page faults (minor page faults are data in the page is still valid but the system tables must be updated.). Swapping is 0 again since it is CPU-bound.In Figure 25, we used the advance time command with another option (/usr/bin/time –f%P) to calculate the percentage of the CPU that this job got, computed as: (%U user + %S system) / %E elapsed.

Figure 24: Advance time command calculating IO activities

Figure 25: Advance time command calculating CPU percentage

3.UnixBench Tests

UnixBench 5 [2] is a general-purpose benchmark designed to provide a basic evaluation and indicators of the performance of a Unix-like system. Multiple tests are used to test various aspects of the system's performance. These test results are then compared to the scores from a baseline system to produce an index value, which is generally easier to handle than the raw scores. The entire set of index values is then combined to make an overall index for the system. It auto detect and supports both single and Multi-CPU systems.