Change to pen
Good morning ladies and gentlemen, I’m Andrew Noske,and my thesis is entitle“Efficient Algorithms for Molecular Dynamics Simulations and Other Dynamic Spatial Join Queries”.
Firstly I’ll recap on what my thesis is about. This work is part of the “Towards Molecular Structure Kinetics” project. My role was to create an efficient “engine” layer for interacting particles; which can be built on by future studentsperforming particle simulations.
Molecular dynamicsis a computer simulation technique whereby, at intervals of some timestep you calculated forces between all atom pairs, and then use this to move all atoms.
[1:00] [2:00]
Comparing all particles toall other particles is obviously not scalable, so it is common practise in thesesimulations to introduce a cut-off radius. Forces are now only computed on all pairs of particles within the cut-off radius of each other (we call these neighbour atoms), and this problem is actually a dynamic self-spatial join query.
A range search is: finding all points within some radius of a given point.
A self-spatial join query is:given a set of points, retrieve all unique pairs of points, such that their separation is less than or equal to some given cutoff radius. This problem has widespread application, not just in bioinformatics, but in computer graphics, databases, computer games and geographic information systems to mention a few.
The primary aim of my thesis was to investigate several new and existing techniques to optimisethis self-spatial join process, since it takes the bulk of processor time in molecular dynamics simulations.However the thesis has practical value in many fields.
[1:00] [2:00]
For the purposes of testing I used a very common pair-wiseinteraction modelcalled the Lennard-Jones pair potential model. This is the equation for it. If particles are far away have negligible attraction, as they get closer the attraction grows, but if they get too close they have strong repulsion.I’ve been using this model to simulate atoms wobbling about in a liquid.
[0:20] [2:20]
In order to simulate a bulk liquid I also used a Periodic Bounding Condition, whereby a cubic box is effectively replicatedinto an infinite lattice. Boundaries wrap-around, so that forces and particles which leave one face will enter through the opposite face.(yes, it’s just like the classic game asteroids, and if you’ve never played asteroid you lived a deprived childhood).
[0:20] [2:40]
The most efficient known techniques to perform this type of simulation are the verlet neighbour list techniqueand use of a fixed grid data structure to execute range searches.
So naturally these were the two techniques I build on.
In a fixed grid, the box containing all atoms is partitioned into a grid with a fixed number of equal sized cells per side. You can determinewhich cell an atom belongs to in constant time by simply dividing the atoms position by the cell length.A fixed grid is the optimalspatial data structure for conducting range search queries; unless points are highly skewed… but in a liquid atoms we’ll always have about the same number of atoms in every cell.
A popular method I reviewed is called a cell list. In a cell list, you setup the grid such that the size of each cell side is equalor a bit less than to the cutoff radius. And that basically means, for any atom in this cell any atom in this list of adjacent cells is a potential neighbour.
Andthis is where my investigation effectively began.I will introduce other techniques, and refreshing your memory of techniques from my last seminar, as I present results, but first I figured I should revise the scientific testing process I used.
[1:30] [4:00]
The process in each simulation was as follows; first the grid was setup ; atoms were placed into a lattice formation and given a random offset and random velocity. Then, each iteration, the neighbor list was built , and used to calculate forces between atoms. Then atoms are moved, and wrapped back the box if they travelled outside it. Very standard for molecular dynamics simulations.
[1:00] [5:00] SKIP?
To perform simulations I used a console application programmed in Microsoft Visual C++ 6 and would set up a grid and atom system of atoms with the parameters I wanted, and then batch process a number of simulations. The most annoying part of the project was each batch Iexecuted a hundred or more simulations in a loop, and could take hours to finish. In fact I probably spent over 70% of my time on this thesis just modifying loops and waiting for simulations to finish. I put each simulation through enough iterations for decent accuracy, and results from each were appended to CSV files, which I opened and analysed using Excel. The main metric of performance I’m usedwas the average number of CPU tics per iteration, but I also tallied and graphed numerous other variables including the number of distance calculations performed, performance breakdowns, # of cells search, just to name a few.
[1:00] [6:00]
To perform a range search in a fixed grid, an alternative to the cell list is: for each atom, work out which cells lie within plus and minus the cutoff radius in each dimension and check all these cells, I decided to call this acubicatom list.A disadvantage of atom lists is every iteration you have to determine which cells are in range of each atom, whereas in a cell list you already have a list of these cells.
A very important value through this thesis was the number of cell sides the cutoff radius spans. I call this value cellSidesPerCutoffRadius. For example …
Often it is necessary to make a finer grid, and the cellSidesPerCutoffRadius will be greater than one. In this case, instead of building a cubic cell list, we should refine this list to eliminate any cells outside cutoff radiusof the root cell. I call this the minimal cell list. It essentially describes a cube with rounded edges. As you can see, its volume is greater than the cutoff sphere, but the advantage is you only need to generate this thing once at the start of the simulation, and use if like a template. Also, as cellSidesPerCutoffRadiusgrows the difference in volume(between the cutoff sphere and my minimum cell list) reduces.
I’ll show you which of these approaches is better soon enough, so please contain your excitement Marion.
[1:30] [8:30]
Once you’ve generated a template for a minimum cell list, you have an important (storage vs., performance)implementation decision. For each cell, do you pre-compute values, so it contains pointers to the cells in its cell list, or do you not store values and recompute them each iteration. I called these techniques the loaded cell list and unloaded cell list respectively.
Here are the results.
Most of my graphs actually follow this same format. Each of these dots represents a simulation and I show the parameters I’ve used here. In this one a fixed radius, but I’ve increased the number of cells per side. X axis shows the cellSidesPerCutoffRadius. The y axis is the average cost of each iteration in processor clock tics, and on the machine I used in testing 2000 clock tics was approximately 30 seconds.
Costs per iteration using these cell list jumps up and down a lot. You’ll see why soon.
These results show loading indexes give much better performance, but is actually a catch. Depending on the size of the cell list, loading this many pointers into EVERY cell can take a lot of storage space, and it ALSO took a lot of time to load these values at the start of the simulation. This time grows exponentially as cellSidesPerCutoffRadiusincreases, and so this represents a huge disadvantage for a loaded cell list.
[2:00] [9:20]
One of the most successful techniques I proposed was something I call the half-range search technique. If, for each atom, instead of searching for neighbors within a sphere, I search for neighbors within, the upper hemisphere, I still capture all neighbors.
If I’m checking I, and J is belowI, then J is disqualified, because I know it will be captured when I’m considering J to I.
Now, instead of generating a full minimum cell list I can generate a halfminimum cell list.
In fact this approach is also better in the respect that I don’t capture each neighbor pair twice… therefore I don’t have to check for duplicates.
[0:40] [10:00]
This table shows the number of cells included in a half minimum cell list, a half cubic cell list… and also the full lists. Notice that the number of cells keep jumping up a notch every time it gets to the next number… this is why the last graph I showed you was so irregular. I will now zoom out, and this shows that as cellSidesPerCutoffRadiusincreases, the difference between the minimum lists and cubic lists eventually approaches about 52%.
[0:20] [11:00]
This graph shows performance improvements using the half-range search technique for a loaded minimum cell list. Improvements grow as cellSidesPerCutoffRadiusincrease, up to about 50% when the cutoff radius spans 4 cell lengths.
[0:20] [12:00]
This graphs improvements of the half-range searchtechnique using an atom list. The improvements are significantly better for the atom list. The reason for this is the difference in the number of cells searched using an atom list increases faster.
These results show the half-range search technique is very effective; however raises another question… is the loaded cell list technique better than the atom list technique?
[0:20] [12:30]
My investigation found that using the optimal number of cell per side, the loaded minimum cell list technique was USUALLY better than the atom list technique, but not always, and never by much.Most importantly, you can see the trend for the atom list is much smoother. This means that if you didn’t chose the optimal number of cells per side, you’re more likely to get better performance with the atom list, and unlike the loaded cell list, you can actually change the number of cells per side with minimal cost to performance – you only have to change the number of cells; you don’t have to load anything.
[0:40] [13:10]
A very different method of improve performance is trying to improve cache hits through better spatial locality.
The spatial search algorithms I use processesall atoms in one cell thenpoints in nearby cell in sequence . If the array storingmy atoms is unsorted two nearbypoints are probably far apart in memory, and this will result in many cache misses.To improve spatial locality, I can group points in cells togetherthenarrange these using a space-filling curve.
To revise: aspace-filling curve is a line passing through every point in a space, using an algorithm.Row major ordering is the simplest to achieve, but the z-order curveandHilbert curveexhibit better spatial locality. In fact I read about improvements of up to 50% using these techniques, but after eventually figuring out how to program the curves I was pretty disappointed by these results.
[0:50] [15:00]
In this graph Z-order curve only improved on random ordering by about 2% and Hilbert curve not much better, which I found very disappointing considering and certainly not worth the effort of coding. However, when I ran simulations withmore than 200,000 atoms I found there were suddenly big improvements. Compared to a random order, Z-order performed almost twice as fast and row ordering performed almost as well. I figured this was because suddenly the whole atom list couldn’t fit into memory at once, and the operation system had to use virtual paging. Unfortunately, I also found that, when this occurred results were inconsistent and to get more accurate results could have taken forever.
[0:40] [15:30]
Something else I did in my thesis was analyse optimal number of cells per side. For a cell list, the optimalvalue was very dependent on cellSidesPerCutoffRadius. Notice here the optimal number of cells per side migrates from 1 to 2 as the number of atoms increases. On the other hand, for an atom list, the optimal number of cell sides was not so much dependent on cellSidesPerCutoffRadius, but instead dependent on the average number of atoms per cell. Best performance was almost always when there were just one or two atoms per cell.
[0:30] [16:00]
Since the atom list has a smooth gradient and minimal cost in dynamically changing the number of cells per side, I also wrote a algorithm to dynamically find the optimal number of cells per side. Basically, it makes a first best guess, steps up or down in increments. It averages the time tics periterations using different values, and based on this knows whether to increment or decrement the value, and eventually finds a minimum. I found the algorithm was fairly successful at finding the optimal value, and finding it quickly, although often it would get stuck on a local minimum.
[0:30] [16:00]
The fixed grid is a the most efficient technique for conducting self-spatial joins, but in a moving point environment, we can build on the fixed grid and get big time savings by using the verlet neighbour list technique. I DID talk about this technique during my first seminar, but I figured you would have forgotten by now, so I’ll recap.
In the verlet neighbor list technique, the cut-off sphere, around each molecule is surrounded by a larger sphere, called a ‘skin’. During the first timestep you use a fixed grid to build a large neighbours list, containing all pairs of neighbours within the verlet skin radius of each other (instead of the cutoff radius). Over the next few time steps, the neighbours list doesn’t need to be rebuilt, but instead the list is updated, by recalculating all distances and seeing which of these verlet neighbours are within the actual cut-off radius. Updating the list is much cheaper than rebuilding it.The speed at which atoms move determines how frequently we have to rebuild.
[1:00] [17:00]
This little line here is the performance without using the verlet technique; this line is the performance using the verlet list technique. As we increase the verlet radius relative to the cutoff radius, the number of iterations between rebuilds increases linearly, and we get optimal improvement when the verlet radius is 1.2 times the cutoff radius in this simulation. This next graph briefly shows that the optimal verlet radius depends on how fast atoms are travelling. The slower they travel, the smaller the optimal verlet radius. This other line shows the actual size of the verlet neighbour list increasing exponentially as the verlet radius increases.
[1:00] [18:00]
I proposed using a dynamic algorithm for finding the optimal verlet radius which was very similar to the algorithm I used to find the optimal number of cells per side. However, in the cells per side algorithm when my values fluctuates up and back like this, I detect it, and then just stop… but for finding the optimal verlet radius, I let the algorithm just keep going, because, in a simulation particles might suddenly speed up or slow down therefore the optimal verlet radius may change over time.
[0:40] [19:00]
The final techniques I tested were aimed at actually reducing the cost of verlet updates.For this verlet neighbour, which is well outside the cutoff radius, it’s unlikely to pierce the cutoff radius for at least a few timesteps therefore it does seem like a waste to recalculate the distance between these every single timestep.This graphs shows the average fraction of neighbours in a verlet neighbour list which are within cutoff radiusof each other, showing that even for a verlet radius 1.2 time the cutoff radius, 40% of atoms are outside cutoff radius.
To reduce the cost of checking atoms outside the cutoff radius these I proposed a technique called “selective checking of verlet neighbours using atom displacements”.Foreach neighbour pair I store a safe distance. Each iteration I calculate the displacement of all atoms since the last iteration, and for each neighbour, if the safeDistance is positive, I subtract the sum of the displacement of this atom and this atom. When the safeDistance becomes negative, it’s possible (if the atoms have travelled directly towards each other) that the atoms are within the cutoff radius of each other, so I recompute the distance and the safeDistance.Obviously, the smaller the safe distance becomes, the more frequently I check atoms.
[2:00] [21:30]
The coding for this was actually very easy, for a nice change results were almost as good as I expected. For this simulation it shows a performance improvement of about 7%, but the bigger advantages is the fact it reduces the cost of checking the list, therefore the curve increases slower, which is good, because you can’t assume the user, or even my little algorithm, will always pick an optimal value, but with this technique at least you’ll always have closer to optimal performance.
[1:00] [22:30]