me?
in all truthfulness, I'm not qualified to say. Because I consider software engineering my ultimate career, I try to stay informed =P but really, the effectiveness of multithreading for a program like DF will depend a lot on the architecture of the code and how forgiving it is to parallelization. Let me provide an intensive example. I'm prone to writing walls of text so you can skip this, probably.
Suppose you have 3 adjacent tiles on the map, and each square has some magma in it. we want to determine what the 3 squares are going to look like on the next tick of the game. how much magma will be left? what will each square's temperature be? should any stupid miners be on fire?
There are so many factors here to take into account. Let's supersimplify the situation to just temperature. If we assume temperature does not spread between squares, this problem is easy. We look at each square, how hot the magma is, and shazam. Temperature. Since each of these calculations is independent of the other two squares, we could try to do this on one thread (execute one square, then the next, then the third - 3 time units) or break it into three threads, and we will assume we have infinite threading capabilities in which case we've tripled the speed of calculation. sick.
But it's not so simple, is it? Let's now suppose that the squares are dependent not only on their own contents, but also the contents next to one another. We'll assume that the squares outside our 3-space system are "temperatureless", they just don't factor on the calculations at all. But now we have a problem. We can't calculate any one of the squares on its own, because that temperature is not the final value; we need to take into account the squares next to it. There are a variety of treatments for how the temperatures would be combined, that would affect our processing method. Let's suppose that an adjacent square has an "after-the-fact" effect - that is, after any temperature effects on a tile originating from the contents of that tile, only then do we take into account the effects of contents on adjacent tiles. So we can first calculate the pre-affected temperature of each tile. As with above, this process is conducive to threading - we've used one unit of time to calculate the 3 squares' pre-adjacent temperatures. But now we have to take into account those after-the-fact effects. We first attempt to calculate the leftmost square, by factoring in the magma on the middle square. Done. Now we calculate the rightmost square, by factoring in the magma on the middle square. Also done. Now we can calculate the middle square by factoring in the left and right squares. Done. That pushed our pretty one time unit up to four. Ouch.
Can we speed this up a little bit? Sure. Since the left and rightmost tiles are not interdependent, we can calculate them simultaneously across two threads. We're now down to three time units. not bad, not bad! But what if our code isn't smart enough to know that the left and right tiles are independent of each other? We could let it go about by trial and error until it finds a stable configuration - probably quite a waste of time cycles, since there are 6 permutations of the 3 tiles (we could go left right center, left center right, or right center left, etc etc), and we'd have to run through every permutation to check for the one that preserves independence across tiles, an independence that we can convert into threading gain. If we do this, we've basically thrown away any advantage to threading this situation at all - no matter how many cycles we save, we're almost guaranteed to have wasted more in the process of finding a threadable solution. So this is an ineffective path.
Let's instead assume we are veritable coding GODS. We'll spend a measly one time cycle to analyze the configuration of these three tiles, and boom! We know instantly that the left and right tiles are independent, we thread those two calculations up, and we're done. That gets us to our ideal 3 time units, yeah?
What happens if we make the situation even more complicated? Consider this: Final calculations to the middle tile's temperature caused that temperature to change, right? Well, wouldn't that affect the temperatures of the left and right tiles? What if we actually have to develop an equilibrium of temperature between the three tiles, before we can move to the next tick? hoo boy, now we're in for it. If we're inefficient, we could simply simulate the temperature changes over and over and over again, and if our code was right, the system would naturally approach this equilibrium state. funky, but extremely inefficient - how many iterations would we have to do to make this work!? With every calculation taking one cycle, we'd be burning time like nobody's business.
What you can basically see happening here is that, as the system grows more and more complicated - or even if we ignored the increases in theoretical complexity and simply scaled up from 3 tiles to 4, or 5 ,or 6, or 2000 - is that the returns we get from multithreading are diminishing. The work we have to do to set up an efficient multithreaded system and find the independent calculations grows as the system becomes more complex, but the benefits from threading off the calculations that are independent are not growing as fast. With each increase in the system's size or depth of calculation, it takes more and more work for us to find out what calculations are threadable, but the number of calculations that are threadable (and the more independent calculations there are, the more things we can do simultaneously, for greater time savings) are not increasing. The problem is that ultimately, the system has to recombine somewhere - we can't do EVERYTHING in parallel, eventually we have to stop, put all the calculations together, and do a few things one after another before we can start splitting off again, and to start splitting off we have to do work to figure out what stuff we can split off, and so on and so on. The end result is that the more complicated the system gets, the less and less useful multithreading will become.
There are ways around this, of course. The problem we faced was that as our system became more complex, it became complex in ways that were interdependent; ie the complications weren't making the system MORE parallel. I'm sure that if we spent a few months drilling through Toady's code, perhaps we could find optimizations that would streamline these mechanical issues. But ultimately the result is still the same. The entire game state has to come together in one place before we can move to the next tick. No matter how we try, we can't handle parts of the fortress separately between ticks and keep them on permanently separated threads; always we must reconverge, then reseparate, converge, separate. And the programming overhead of doing so would dig deep into the benefits of multithreading at all.
The solution, as kohaku has been saying, does not lie in boosting the processing power at the program's disposal. For every Moore's law there is a software developer striving to overwhelm it (Moore's law states that the number of transistors on a processing chip will double every 18 months; it's been proven statistically in history, but of course code is becoming more complicated way faster than that). The solution is really about finding more clever ways to use the processing power we already have, and that's something Toady has already taken into account. There was already an example on this thread about how army battles are handled as individual unit vs individual unit. Well a whole unit is quite a complicated entity from a memory and code standpoint. Ultimately we hope to be able to streamline battles so that a unit is just a single number in the sea, so that even under the same computing constraints, our programming improves, and we can handle bigger battles. I'm pretty sure this is the kind of stuff that Toady has lying around on the army arc.
tl;dr multithreading could have benefits, but i suspect the processing needed to facilitate them will counterbalance or outweigh any gains. true optimization lies not in improving the amount of power available to your program, but improving the way your program uses the power it has. I think those optimizations are things Toady has already planned.
EDIT:
Its not as bad as "rewriting all the code". If you was sane with mallocs and other manual pointers/address space handling, switching to 64bits is very straightforward and easy - mostly just changing compiler settings and throwing some macros around.
Fair enough, but that still ignores the ripple effect of releasing a new compile. bugs, offset shifts and plain old sh!t happens are bound to occur somewhere, so even if a 64-bit release comes out in a month, the player base will probably take 2-3 months or more to start using such a release. I'd rather the sh!t happened in the process of improving a play mechanic in the game. Besides, it still doesn't solve the other problem I mentioned: switching from one address size to another doesn't instantiate memory. If there was memory waiting to be used, it might unlock it, but the physical limitation of the computer is still there. The benefits of moving to a new address system will be slim when the memory limitation of the user still exists.
Consider my world constraints right now. I top out at around 300yr on 257x257 worlds. As the world grows older, its complexity grows as well - and probably geometrically, if not faster. In other words, if I doubled the age of a world, the memory space it occupies is probably greater than the square of the previous age. Let's suppose I had 4GB of RAM and DF just couldn't use it because it was 32bit atm. So now I double my RAM. But if the 300year world is taking up all my RAM (ie 2GB), a 600yr world is probably gonna take up over 4GB. ofc I'm just guessing here as to the memory growth rate of a world, but my point is that being able to address an extra gigabyte or so of RAM is really not going to add much to the history of a world. If a world of x years takes up 3GB, 2x years would probably be more than 9GB, etc etc. 1GB starts looking pretty small.