while(true) {
Start A();
Start B();
Wait for A() to finish;
Wait for B() to finish;
Any processing needed to put everything together;;
}
Essentially, yes, you can do this IF you start off by designing your entire application with such parallelization in mind.
At the beginning of the frame you need to assign work every thread. Then you need to make sure is that the world state is not written to by any thread while they're running in parallel. All results computed by a thread need to be stored until all threads are finished; then the results need to be applied in a world write pass at the end of the frame run in serial on the coordinating thread.
Looking like this in complete & unrealistically simple pseudo-code:
PROC DO_FRAME:
WORK = DIVIDE_WORK(WORLD, THREAD_COUNT)
RESULTS = EXECUTE_AND_WAIT_FOR_WORKER_THREADS(WORLD, WORK)
WRITE_RESULTS_AND_VERIFY(WORLD, RESULTS)NOTE: The big ass problem here is when computations inside worker threads depend on intermediate results calculated by other threads. One must use algorithms dependent on only the last frame's data with this scheme. This is pretty much an unrealistic assumption for a game that wasn't coded with parallel execution in mind...
As for assigning work to each thread; that's not a trivial task. You want such an equal workload as possible. If one thread takes long time to compute its assigned work, all others threads will be idle waiting for this last one to finish. In game terms, you can try to assign certain tile ranges to various threads (though I don't think this'd work very well given that a fort is contrated in a small area of a map). You can try split object update lists (each thread A compute 33 of 100 dwarves, B 33, C 34).
Note that there's an important non-parallel stage; application of results. This also must resolve conflicts to maintain internal consistency (example: "Dwarf A tries to marry dwarf B in the same frame as dwarf C tries to marry A", assuming monogamy). Furthermore, in these kinds of games, a huge amount of data may be modified (for instance water may have flowed all over the map affecting every tile). Like Amdahl's law tells us the entire frame computation will be delayed by this stage if it's slow, so a large set of resuls may ruin performance.
Then there's the option which I see discussed a lot above; to "outsource" computations like pathing to worker threads, doing main updating in a single thread. Of course, that means timing will primarily depend on that single thread, but if 90% of computations are pathing related that'd still mean a great improvement.

...double buffering...While that's nice in it's own way, that implies you don't have a separate result buffer for each thread, instead requiring each thread to obtain a lock when writing to the common output buffer. These locks will occur in an arbitrary order... Threads trying to acquire lock at same time will be waiting to output their results also; then again, there's no special world write stage in this scheme which may compensate. An advantage is that you use the same type of data structure for output as for input, and don't have to write special structures for partial buffered thread output.
Writes occuring in an arbitrary order is not impossible to deal with, but there are definately downsides as well as upsides with this approach.