Wednesday, November 11, 2009

More Implementation Strategy patterns

Shared Queues
This pattern covers the common practice of using a shared queue in concurrent program. The authors talk about the major decisions being a centralized vs decentralized queue and simple protocols vs. more concurrency.

I have implemented a shared queue, where a pool of threads pulled tasks out of a shared queue. Since the accessing the queue represented a fractional amount of the task, it used what the authors call a simple protocol, using a mutex to protect the data structure. It was also centralized, since the application was running on a single system. If the program had spent more time in the dequeue or queueing operations, I may have chose to implement in a way that allowed greater concurrency.

Speculation
I found this pattern rather neat, as I hadn't thought of applying the idea which I was aware of to my software before. Rather than worrying a serialized task must run serially, one can still divide it up, detect if the result is correct, then rerun the parts that are incorrect serially.

It seems to me that problems that could fall under this are not too common as it needs several preconditions. First, the tasks must be long enough to to warrant the overhead and dividing up of speculations. Second, the task must be at least potentially dividable. The example given of parsing HTML was great as the page can be divided up. However, if the each task truly requires a previous task to complete, it will need rerun anyway, unless the result of the previous task can be guessed.

Circuits
This pattern was not what I expected when I read the name. Instead of being about digital circuits, it was really on how to parallelize bitwise operations over large blocks. A help in some contexts, but not one I'd tend to see in my work.

Monday, November 9, 2009

Implementation Strategy Patterns

These patterns cover some of the ones available when implementing algorithms for a parallel program.

Loop Parallelism
This pattern covered some of the techniques available for parallelizing loops, or allowing the loop to be processed concurrently over different bounds. The author first recommends profiling the application to identify what loops are used the most and use those as candidates. If one has a target increase desired, using Amdahl's law shows how much code needs to be improved. This could present a large percentage of the loops.

Task Queue Implementation
Tasks can be used for several different algorithms, such as graph traversal or part of loop parallelism. When implementing this pattern, first a task granularity must be decided upon, as discussed in other patterns. Next, the structure for the queue, centralized or decentralized must be chosen, with pros and cons for either choice.

It was nice to see code examples, although writing them in Python may not be the most universal way to document a pattern, as it is not as popular as c++ or Java. Even with out the examples, the concepts in this pattern were not as complex as others, which would have benefited more from examples.

Graph Partitioning
This pattern covered some of the many different ways a graph may be partitioned. As this is an NP-complete problem, different heuristical approaches were presented. I have not been in a position where I needed to do this, but having it documented may be useful for others in the academic or scientific domain.

PRES

PRES is a neat tool that helps developers reproduce concurrency bugs found in the field. Some tools try to search every path, but can take many replays (1000+) before being able to reproduce the problem. This is due to the many sources of nondeterminism in a multiprocessor system. Threads can be interleaved in any order. The other approach is to capture run time data when the system is deployed to reproduce any problem, but this comes at a high cost of performance, sometimes slowing down the system 10X.

PRES takes a middle ground of recording some of the run time data, such as the order functions are called. When a developer needs to reproduce a bug, he gathers this data and PRES uses it to limit the number of replays required in order to reproduce a bug. It guides the replay system automatically to not go down paths that did not happen in the field.

I thought it also made a wise choice to start with the possible race conditions closest to the time the bug occurred and work outward. This makes it more likely to find the race condition as usually there is some locality to a bug. This does not have to be the case, in which case it would be neat to offer other heuristics, such as targeting specific threads or data for race conditions.

I often develop multi-threaded applications for Windows in both user and kernel mode. While a system like PRES could be useful sometimes, I have not needed it due to other tools working well enough. Between static analysis tools (PREfast and Static Driver Verifier) and dynamic analysis tools (Driver Verifier, Application Verifier, UMDH), I can typically be warned of race conditions without waiting for them to occur. When a bug occurs, a dump of the process (available on any Vista+ box) can usually pinpoint the problem. Kernel level code typically bug checks the system quickly after a problem, also producing a dump that can be used.

That said, PRES can still have its place for some heisenbugs that occur in the field.

Wednesday, November 4, 2009

More strategy patterns

The patterns in this post cover different ways to dividing up data in an algorithm for processing. I have not had the opportunity to work with these patterns personally yet.

Geometric Decomposition
This pattern covers dividing data up into chunks to be processed concurrently. The one thing that struck me as interesting was the necessity in many of the examples for optimizing for hardware. Deciding the size of chunks and number can both be affected by the underlying hardware for some algorithms.

Pipeline
In this pattern, data is stuffed down a pipeline where different stages can work on different objects at the same time. Think of an assembly line or a CPU pipeline. While it is used most frequently at a low level in hardware where one part of the pipeline focuses on a specific task, it can be used at a higher level as well. The assembly line is one example. This pattern would be useful any time it is faster to have individual threads focusing on one task in a system. I think this could be easier to implement, even if the speed difference is minimal, than having each thread perform everything.

Data Parallelism
This pattern was pretty much non-existent, with only a paragraph currently written.

Armstrong Chapter 6

This chapter dealt with the practical aspects of creating applications in Erlang. A program can consist of numerous processes, performing a worker or supervisor role. I found his approach straightforward and the modules easy to understand due to intentional programming.

I also found it interesting how a small set of behaviors abstracted out the majority of the types needed for the entire system. Client/server, event handling and finite-state machine modules covered the majority of functionality required in a parallel system. It is nice to be able to abstract this out, as it reduces the amount of complex code that needs to be tested, and helps with the fault tolerance of the system.

Between these behaviors and the supervisor abstraction, a nice API is available for the developers.

Monday, November 2, 2009

More Patterns

We read three more patters:

Task Parallelism
This pattern covers a strategy for implementing algorithms. Many algorithms can benefit in a parallel system by performing distinct tasks in parallel. For example, performing computations on every vertex in a graph presents the parallel opportunity of running a task on each vertex.

I have the most experience is running OS level processes and threads in parallel computing. This pattern requires finding the best granularity for a task and where to run a task, but I don't think you need to be a hardware expert to determine this. One only needs to know the relative costs of running tasks, such as over the network vs. locally.

Recursive Splitting
This strategy pattern covers algorithms that use recursion to find a solution. I thought this paper was the best of the three, as it had great examples and illustrations. The key to this pattern is producing more than one task per recursive task, as it allows for parallelism.

Discrete Event Pattern
The discrete event pattern is another strategy pattern covering using events in an asynchronous system. The order may matter, as events may cause other events through the system. This paper read and looked like a rough draft of a pattern, and as such was harder to follow.

Armstrong Chapter 5

This chapter covers building a fault tolerant system. First, supervisors are used to detect when a process crashes. The supervisor can by an AND type, where if one child crashes, all are restarted, or an OR type, where only the faulty child is retried. I can see the OR model being common, as sub tasks hopefully do not need to know about other sub tasks. AND is useful in the cases where the tasks are dependent on the success of the other tasks.

Armstrong then covers the differences between exceptions, errors and failures. Exceptions occur whenever the run-time does not know what the do. Programmers then decide if these are truly errors or not by catching them and handling if necessary. Actually, the decision is not to be made by the programmers, but by the designers according to Armstrong. The specification for a function should tell what to do in the error cases, otherwise the programmer should just error out. This gives functions well defined behavior, but at the cost of needing to have good specifications. Problems are then specification problems, and are easier to recognize and fix. I think this is quite beneficial as a result, at the cost of more time up front spent in design.