Software performance enhancement using multithreading and architectural considerations

Archer

Introduction

The Archer application is intended to recover a lost password used to garble an ARJ archive. Naturally, such a process involves many heavy calculations and memory operations due to the brute force nature of the algorithm which leads to a heavy CPU consumption. As well, there is a heuristic element in the program, so, there is a place for algorithmic optimization as well.

Taking into consideration the above facts we've chosen this application for our project.

Download

The program source is available for free download from SourceForge download page.

General Algorithm Description

The ARJ archive program uses Huffman table search & substitute algorithm, which is expected to shrink the size of the file being archived. The following description doesn’t go inside the principles of this operation as it's considered to be beyond the scope of the article. There is plenty of material regarding Huffman tables and trees operation on the web. Here we only bring the relevant parts of the program, where optimizations took place, so these pieces of code are covered in detail.

The program is written in plain C and its original version is implemented in 16-bit code without any threading nor SIMD instructions. Furthermore, there were certain pieces of the program, which were totally rewritten in order to employ more efficient algorithmic practices.

The program operates as follows:

1)The input file gets read.

2)The smaller garbled file inside the archive is selected.

3)An iterative password trial is performed till the CRC32 of the stored file is matched against the tried one.

The ARJ archive program uses the following technique to garble the produced archive:

1)Compress the file(s) as usual

2)XOR the resulting contents with the password, which is chained as necessary to match the length of the compressed data.

An important issue is that there are several compression methods, which can be encountered in ARJ archives. The first three (-m1, -m2 and -m3 command line options) differ only by the depth of the produced Huffman tables and therefore lead to the same unpacking procedure. Herein all these three will be referred to as the 1st method for simplicity. A very important issue about this 1st method is that in can be processed using certain heuristics at a very large speed in terms of passwords trial. This is because soon after starting the decode procedure there is a possibility to discover that some inner data structures, such as the length of the tables of tree depth, don't match in the allowed bounds, which immediately leads to skipping to the next password. Thus, for this 1st method, for majority of passwords, only the starting of the compressed file is analyzed, the remainder only gets accessed in rare cases, when the decoding "goes well".

There is also the 4th method (-m4 command line option), which corresponds to the "fast" compression method of ARJ. It's faster than the first three (and produces larger result) because it uses fixed table sizes and doesn't try to determine the dictionary length dynamically. The impact on the password guessing is that for each single password the entire contents of the compressed file are to be processed and only than there is a possibility to make the decision whether the password was correct according to the CRC32. When the given archive contains only relatively large files, the program's operation slows down dramatically. Although, there is no possibility to employ the "shortcut" technique like in the 1st method, our optimization deals also with this issue.

Optimization

32-bit variables

As long as the original program used 16-bit variables only it experienced performance degradation on Pentium processors. We converted majority of the variables to their 32-bit signed/unsigned analogs, which leads to better performance (see graphs). The only variables and buffers that remained to be 16-bit are those, which are inherently supposed to be such by the algorithm. For example, when there was an assumption that a certain variable should overflow at certain point.

Power buffer unwinding

As was discovered, there are two dynamically created buffers during program's operation, which actually are filled by the same values, which are certain combinations of powers of 2. The bottom line is that these were hard-coded by us into the program and the procedure, which created them was much simplified. There was a need to create two versions of this procedure and call them accordingly instead of sending extra parameters when calling and producing the tables at run time. The speedup is shown on the graphs below.

Cache alignment

All program buffers which are used for various tables, packed and unpacked data were modified to be cache-aligned for faster memory access when the data is aligned to the cache line size.

Threading

The main drawback of the original solution is lack of equal load on multi-threaded machine. As it comes clear after a short program inspection, there is virtually no dependence between multiple password trials, so if hardware allows - there can be as many threads get launched, as possible.

Our solution determines the number of processors/cores on the host machine and launches the optimal number of threads accordingly. The only interaction point between the running worker threads is synchronization on password increment, which is done for macro consistency. Excluding that, every worker thread has its own local storage variables and buffers for producing the result and all the common read only data such as the compressed source data and the archive file header are made global variables for all the threads.

Our original threading scheme looks like this (on this diagram the solid line represents a concurrent execution of a thread while an oval shape represents a synchronized operation):

The workers are totally independent except the point where they need to retrieve the next password – here we made them to synchronize. All the other time they do the decoding and trying to figure out the password and the only sections, which must not interleave - the "Increment password" sections.

The Main thread does all the initialization - loads the file, determiners the number of workers to launch, whether to use SSE instructions etc. Then it launches the workers and the show progress thread and waits for all of them to finish.

The show progress thread doesn't synchronize with the workers but rather takes snapshots of the relevant information and shows it out periodically. This approach gains program performance due to absence of synchronization objects. There are also some subtle points connected with this, which lead to appearance of some fake data race conditions, which are solved due to the semantics of the program. See details in the Thread Checker section below.

Remark: As it comes out, when the password trial rate reaches some point (this is relevant to the 1st method only) – above then about 1 million passwords per second, there is a drawback of the synchronization on passwords increment as long as this operation itself becomes significant in terms of CPU time relative to the decompression trial routine. Therefore, we decided to totally disconnect the worker threads at this point too and this was done by properly initializing the starting password value for each worker. As long, as each one of the workers knows how many threads are running in total - it simply increments its own password by an appropriate step. So the complete coverage of the password space achieved. This step allows utilizing the full CPUs power at high password change rate, because otherwise we experienced performance degradation due to thread interference.

Thus, our revised thread model looks like this:

For the 4th method this is not necessary because of much lower passwords rate. And besides, this synchronization ensures monotonic password space coverage, which is desirable when the password trial rate is relatively low.

Here you can see the snapshots that show the equal load of both worker threads:

Method 1, 2, 3:

Method 4:

Using SIMD instructions and registers

The algorithm of decrypting methods 1, 2 and 3 works in way that prevents using the SIMD approach. The main problem is that decryption can be stopped at any time when the sanity check for internal data structures fails after decryption of the current byte. These abandoned decryptions happen frequently for methods 1, 2 and 3 and using SIMD in these methods will slow down the execution.

The only place to use SIMD is method 4. This method doesn’t use internal data structure sanity checks. Instead, file decrypted from the beginning to the end trying each time a new password. The decryption itself is very simple. It is a bytewise addition of data with the current password and XOR with special byte value saved in the header.

In the original code this operation take place each time when a new decrypted byte is needed. If the size of encrypted file is N bytes then for its decryption the original algorithm performs N decryptions (addition and XOR). The SIMD approach uses the SSE2 instructions and XMM registers of 128 bits (16 bytes) in order to reduce the number of decryptions. Thus using SSE2 instructions we perform N/16 decryptions.

The overhead when using this approach comes from storing the decrypted file in temporal buffer where the decrypted bytes are taken of later. To perform this temporal storage we need to allocate a memory buffer of file’s size. In addition the whole operation of decryption will require N/16 additional loads and stores.

Another boost in performance may be acquired by smart maintenance of password chaining during decoding using SSE2 instructions. The original code keeps the password in memory and loops the pointer to its current byte over and over again when performing decoding of the next byte. In our optimization we keep the whole password maintenance in XMM registers and save the obsolete loads and stores.

How does it work? Let’s say we have a password “abc”. Now for a first time we put it chained with itself to form 16 byte value in memory, so we got “abcabcabcabcabca”. In the beginning, our decryption function puts this value to XMM register. Now we have the whole password in the register and we don’t need to load it from the memory over and over again like in the original code. The only thing remained to do is to shift it appropriately when the next 16 byte of data comes for decryption. This shift is a function of password’s length. The example how the algorithm works on password “abc” is shown below:

The additional optimization was added: when 16 % password_length equals to zero, no action required at all.

So, summarizing the explained above SIMD optimization, in the figure below you can see the pseudo code of the original source and the optimized pseudo code.

Cache Misses

The additional penalties that came with SIMD optimization were cache misses that appeared as a result of large memory consumption when allocating temporal buffers for decrypted files. Each worker thread allocates a temporal buffer equal to the size of the original file. Thus this amount of data even for not big files of 1-2 Mb can’t relocate in the cache and that is why each loading or storing to these buffers causes cache misses. The solution that was applied to prevent cache misses is to use the constant-sized small buffers which must be refilled each time the data they contain no longer required. The size of these buffers has a critical influence on performance. In order to refill the buffers the function that decrypts a file must store the current 16-bytes value of chained password at the end of every invocation and the check for current byte position in these buffers must be applied every time in order to detect that the buffer must be refilled. If the buffers are too small then the overhead to refill these buffers became significant. In the other hand large buffers case the cache misses. The size of the buffers is variable and was practically tested in order to find the most optimal value. After many tests on Pentium D machine we found that the most optimal value for these buffers is 128 K.

Optimizing CRC32

As was determined, the calculation of CRC32 on the resulting data was used quite extensively, especially when running the 4th compression method. Therefore, the CRC32 calculation routine was rewritten using the pre-generated values of the polynomial. These values fit into 4 basic tables as long as the CRC32 calculation involves an iterative operation on the given data. Each single byte of the data influences the corresponding bits in the CRC32 value, so the calculation can be done in buckets of 4 bytes, where the corresponding pre-calculated CRC32 intermediate value is looked up and then is straightly combined with the current value. This is done instead of creating a single CRC table at runtime and then iteratively calculating CRC for each byte separately. The performance boost due to this change was about 2 times faster regarding CRC32 calculation alone.

Thread Checker

Below we present an analysis of the potential data races detected by the Intel ® Thread Checker and describe their nature as well as showing that these have no impact on program's validity. They are not real data races because of the program's semantics: i.e. technically there indeed race conditions present, but practically the program was designed in such a way on purpose in order to improve its performance and these "race conditions" are fake because the program's functionality doesn't depend on them. Details follow:

So as you can see we have seven data races and we describe each one below:

1.

Here we see a special variable called "foundpassword" which purpose is to store the found password globally. Initially this variable is set to zeros. Once one of the worker threads discovers a password it copies it into this variable (line 197). At the same time all other workers check whether this variable is zero or not on a periodical basis (line 166). If they see it's been changed to a non-zero value they conclude that someone has discovered the password and therefore they should terminate in order to gracefully shut down the program.

The Intel ® Thread Checker reports a potential data race because there is no synchronization between these reads and the write. But according to program's semantics, there is no problem in this case because this variable is initially set to zeros and all what is interesting – when its value changes to non-zero. I.e. there is no requirement for data integrity here, just a kind of "flag set" detection. So, even if these read and write happen "simultaneously" there is no invalid condition here.

This solution was selected in order to improve the performance of the program, because otherwise there was a need for involving some synchronization object, which would degrade the performance. In some cases (like that described in the part "Threading") this degradation could become a very significant issue.

2.

This potential data race is caused by the same issue as the previous one, but in the reverse direction: Write->Read data race while the previous was the Read->Write. The meaning and the analysis are exactly the same.

3.

Here we used a Boolean variable called "exitprogress" which is used to signal the thread responsible for showing progress to shut down (upon program's termination). We wanted all the threads (both workers and the show progress) to shut down gracefully rather then terminating them forcibly in the end.

This variable behaves exactly like the "foundpassword" in the previous two "data races". I.e. it's used as a flag, which initially is set to false and when the program should terminate it's raised to true. This has been done to prevent involving a synchronization object and thus degrading the program's performance.

On the other hand there is no meaning of data integrity here because the only event which is interesting to the reader – when this flag becomes true. Hence, this is not a real data race, like the previous described above.

4.

Here the potential data race is reported because each one of the worker threads increments the global counter of the tries passwords. This counter is maintained for showing progress only.