Topic: Dispelling common myths about DF performance and optimization (Read 14301 times)

Biopass · « **on:** April 20, 2012, 06:53:21 pm »

I've seen a lot of people still saying that single-core speed is the most important thing in DF performance. Can we start clearing up some of these old myths? They might have been truer back when we were playing 40d16, but our understanding of how DF works, combined with the advent of much more powerful technology, necessitates a discussion on what is fact and fiction in the world of DF optimization.

I'll start off here. Correct me if I'm wrong on anything; this is all coming from my understanding and pooled forum knowledge.

Dwarf Fortress does not need multicore support to remain playable. Processor architectures are continually getting better, and while multicore support would lead to a fairly large framerate increase (capped by RAM channels and speed), modern processors can run 5x5 embarks with 200+ dwarves without massive performance decreases.
The most important thing about your CPU is your processor architecture and benchmarks, not necessarily your clock speed. Clock speed is virtually useless. The only way to compare modern processors is through benchmarking them on the same standard. The architecture is also vital - a 65nm pentium 4 running at 3.8ghz pales before a 32nm i5 clocked at 2.8ghz.
Reserving a single core for DF in a multi-core processor is NOT always a good idea. Even reserving 2 cores, one for DF and one for graphics, is pointless. Modern operating systems are excellent at shifting a process from one core to another; this is why you'll generally never see DF maxing out a single core - it's being continually shifted from core to core with very high efficiency. This also aids in heat distribution, which prolongs the life of your processor.
RAM latency is key to playing DF with a high framerate. Dwarf Fortress is unique in this aspect. Memory latency has no effect in most games, but the sheer amount of data processing Dwarf Fortress does means far more RAM operations are made than most games make. Each operation takes more time with slow RAM, and this adds up substantially. As a result, fast ram is vital.

Questions we still have:

What is the effect of dual/triple channel RAM in a DF machine? I'd imagine a substantial speed increase would result, but I honestly don't know.
How significant is the speed increase that results from overclocking?
What sort of cap does ram latency put on DF speed? And when will you start to encounter that cap?
What architectures are currently the best for DF? And what's a good CPU benchmark for DF?
What are the most cost-efficient DF-oriented computer builds out there?
Cats?

NW_Kohaku · « **Reply #1 on:** April 20, 2012, 10:04:54 pm »

Oh, hey, one of these threads again...

Well, Wikipedia on your RAM's CAS Latency.

Very fast RAM can be pretty cheap, provided you aren't buying too much of it at once. You can buy up to 32 or even 64 gigs of RAM, but you're going to be pretty hard-pressed to find an actual use for that much RAM, even among 64-bit programs.

If you don't already know, spare RAM does nothing. RAM is just storage capacity, and having excess RAM storage capacity is like paying for a whole warehouse even though you only are storing enough stuff in it to fill a small room. That doesn't mean that if you're building your computer to last for 8 years or something that you might not find a reason to fill it eventually, but just for the record, RAM tends to get cheaper all the time, and DDR4 is going to come out in about three more years.

Basically, to compare RAM latency, you have to look up it's general clock speed (measured in megahertz, which for those illiterate in tech speak, means "it can do things a million times per second") and then its CAS latency, measured in clocks (which is how many clock cycles the RAM can't do anything waiting for the signal to travel from the CPU to the RAM). Start with 1,000 nanoseconds, divide by the RAM's speed in megahertz, and then multiply by your CAS latency, and you have the amount of time a theoretical bit of data will take to be fetched out of RAM and into the cache.

I'll quote something I said in another of these threads:

Quote from: NW_Kohaku on April 18, 2012, 03:25:36 pm

For example, the products on this page are all only 8 gigs of RAM, which isn't all that much nowadays, but they'd have a response time of around 5 nanoseconds on a memory fetch. Plus they're all under $100.

Meanwhile, this monster is going to respond at the same rate at 32 gigs of RAM, but has a price around $650 or $700.

And here we have 4 gigs of RAM at a theoretical fetch time of 3 nanoseconds for $140.

alexandertnt · « **Reply #2 on:** April 21, 2012, 05:33:20 am »

I second this post. Especially the third point ("Reserving a single core for DF in a multi-core processor is NOT always a good idea").

I think what we need is some sort of benchmark save, a ready-to-download fortress designed to stress test DF on the computer. This should allow for easier benchmarking across systems.

Also, what about overclocking bridge/memory (however it works with these no-north-bridge processors)?

blue sam3 · « **Reply #3 on:** April 21, 2012, 07:30:16 am »

As far as the core reservation goes, I've noticed (on a bit of an antique processor, it has to be said) that rather than limiting DF to a single core, I get a bigger FPS increase from letting DF run wherever it likes and having one or more cores that everything that isn't DF (or at least, those things that suck up a lot of processor time) is banned from using. That way, DF gets to use whatever is available, and if my antivirus decides it really needs to do a scan RIGHT NOW, without asking me first, DF still has a core that it can run on that isn't being monopolised by the antivirus.

alexandertnt · « **Reply #4 on:** April 21, 2012, 07:36:43 am »

Quote from: blue sam3 on April 21, 2012, 07:30:16 am

As far as the core reservation goes, I've noticed (on a bit of an antique processor, it has to be said) that rather than limiting DF to a single core, I get a bigger FPS increase from letting DF run wherever it likes and having one or more cores that everything that isn't DF (or at least, those things that suck up a lot of processor time) is banned from using. That way, DF gets to use whatever is available, and if my antivirus decides it really needs to do a scan RIGHT NOW, without asking me first, DF still has a core that it can run on that isn't being monopolised by the antivirus.

Have you tried increasing process priority? I have a feeling increasing that might yield similar results (Might test if I get time).

blue sam3 · « **Reply #5 on:** April 21, 2012, 01:46:13 pm »

Quote from: alexandertnt on April 21, 2012, 07:36:43 am

Quote from: blue sam3 on April 21, 2012, 07:30:16 am
As far as the core reservation goes, I've noticed (on a bit of an antique processor, it has to be said) that rather than limiting DF to a single core, I get a bigger FPS increase from letting DF run wherever it likes and having one or more cores that everything that isn't DF (or at least, those things that suck up a lot of processor time) is banned from using. That way, DF gets to use whatever is available, and if my antivirus decides it really needs to do a scan RIGHT NOW, without asking me first, DF still has a core that it can run on that isn't being monopolised by the antivirus.

Have you tried increasing process priority? I have a feeling increasing that might yield similar results (Might test if I get time).

Makes it a tad too unstable on this heap of junk, unfortunately.

King Mir · « **Reply #6 on:** April 21, 2012, 04:13:12 pm »

Since RAM speed is a more limmiting factor than CPU speed in DF, overclocking the RAM will likely have a greator benefit than overclocking the CPU.

Biopass · « **Reply #7 on:** April 21, 2012, 08:50:42 pm »

Quote from: King Mir on April 21, 2012, 04:13:12 pm

Since RAM speed is a more limmiting factor than CPU speed in DF, overclocking the RAM will likely have a greator benefit than overclocking the CPU.

Latency, not frequency, my good sir. Overclocking your RAM won't do anything to your latency. You just have to buy really good RAM.

kaenneth · « **Reply #8 on:** April 22, 2012, 01:53:48 am »

Or as I mentioned in another thread, a CPU with bigger caches; reducing the times the CPU is waiting on RAM.

(WTB 2 gigs of on-die cache...)

Miuramir · « **Reply #9 on:** April 23, 2012, 01:33:36 pm »

Try this for a start:
Even well-researched absolute statements about DF performance are still wrong for some situations. Most direct statements about DF performance are wrong for a lot of situations.. (Including this one, of course.)

Optimizing DF is not fundamentally different from optimizing any other heavy-duty program; it seems to have a considerable similarity to ordinary real-world tasks such as running large matrix operations in Matlab, or to automated circuit-routing optimization on complex circuit boards. However, many (if not most) reviews and comparisons you'll find on the Internet are made against workloads with considerably different characteristics, such as 3D-heavy games, or stream-heavy video transcoding.

At any given combination of gross system resources, competing resource demands, net system resources, and target program state there will be one element that is the limiting factor (sometimes referred to as the "long pole", as in when packing a tent). Improvements to other factors without changing the long pole will have reduced, if any, effect. Improving the long pole factor will have far more direct effect. After improvements, it may *still* be the long pole, in which case the situation remains similar. However, at some point, the long pole will have been made enough shorter that something else is now the long pole; further improvements to the original long pole will drop off in increased effectiveness, possibly quite sharply.

In the most general terms, optimization comes from getting as much of the actively-manipulated problem space onto the fastest available handlers as practical. These are usually in clusters, and would typically look sort of like the following, in decreasing order of performance:
* actual registers, which on some designs may have multiple internal levels
* On-chip cache, usually dedicated to a core and frequently with layers such as L1 and L2
* On-die cache, in modern designs likely to be L3 and shared between cores
* Main RAM, usually these days operating at a unified speed
* "Fast" swap if present (still far slower than RAM), typically an expensive, fast, small hard drive (either high RPM spinning-media or a SSD), but sometimes RAM connected over a slow bus
* Main storage, typically a spinning-media hard drive
* Slow storage, such as network-mounted hard drives, hard drives limited by an external interface such as USB, etc.
* Archival storage, such as tape drives, swappable hard drives, etc.

Even a small DF embark is too large to fit into L3/on-die cache; so the fastest storage that can handle everything it needs to do when it needs to handle large-area calculations is main RAM. If your computer is not capable of loading DF's program code, the data block of your chosen embark, your operating system overhead, and anything else you choose to run at the same time into main RAM, the insufficient quantity of main RAM is almost certainly your long pole. Even a fast hard drive is more than an order of magnitude slower than most RAM, and more commonly two orders of magnitude slower or worse. Other than keeping an eye on memory usage, one other statistic you can use to get a handle on this problem is page faults; you want these to be low / few. "An average hard disk has an average rotational latency of 3ms, a seek-time of 5ms, and a transfer-time of 0.05ms/page. So the total time for paging comes in near 8ms (8 000 000 ns). If the memory access time is 200ns, then the page fault would make the operation about 40,000 times slower."

TL;DR: If you don't have enough memory (RAM) to run your embark, the most significant thing you can do is fix that. Run fewer other programs, use a smaller embark, or buy more RAM.

Once you have *enough* memory, however, there is little gain from adding more. A little bit of headroom may offer some moderate improvement, as the OS has to do less work to manage assignments and garbage collection, and you're less likely to have fragmentation problems; in modern systems I'd guess somewhere between 10% and 25%. Beyond that, it simply doesn't matter. You then need to look at a bunch of far more complex factors; the biggest are likely to be memory channels, memory bus speed, memory CAS latency, CPU architecture (including L1, L2, and L3 cache), and CPU clock speed. In modern designs, several of these are also affected by traditionally "external" factors, such as cooling; most CPUs these days have adaptive performance based on real-time thermal monitoring, and a common problem in laptops and cheap desktops is real-world performance well under theoretical maximums due to insufficient cooling.

Obviously, at some level, having both a wider pipe and a more responsive pipe are good. Unfortunately, in the real world, these parameters are somewhat mutually exclusive. One of the more interesting places to try Science! would be to try to characterize DF's memory access behavior. Real-time vs. clock-related CAS latency, how important the burst transfer versus line transfer rate is, and other elements can sometimes be difficult to reduce to easy-to-explain numbers.

King Mir · « **Reply #10 on:** May 27, 2012, 11:18:01 am »

Quote from: Biopass on April 21, 2012, 08:50:42 pm

Quote from: King Mir on April 21, 2012, 04:13:12 pm
Since RAM speed is a more limmiting factor than CPU speed in DF, overclocking the RAM will likely have a greator benefit than overclocking the CPU.
Latency, not frequency, my good sir. Overclocking your RAM won't do anything to your latency. You just have to buy really good RAM.

You want both.

Thief^ · « **Reply #11 on:** May 28, 2012, 05:52:54 am »

Ram latency is measured in clock cycles, so overclocking your ram will also improve latency. Not by as much as just buying lower-latency ram, however.

calrogman · « **Reply #12 on:** May 28, 2012, 06:45:19 am »

Quote from: NW_Kohaku on April 20, 2012, 10:04:54 pm

If you don't already know, spare RAM does nothing. RAM is just storage capacity, and having excess RAM storage capacity is like paying for a whole warehouse even though you only are storing enough stuff in it to fill a small room.

Which is why, under modern operating systems, the free RAM in the system is constantly being used to cache commonly used files.
This is most obvious when you run free on a Linux system (and maybe Mac, I'm not sure how complete the command line environment is in OS X).

Here's some sample output. As you can see, only 332 megabytes of the 3032MBs of RAM are completely unused, while only 725MB of RAM are actually being used by programs. 145MBs are being used by the kernel for buffers and 1828MBs are being used for caching files. The buffers and cache can be reclaimed for use by programs at any time.

In this way, "spare" RAM is always being used for something and will improve system performance, though with the gradual decline in rapidly rotating magnetic platter media, the advantages are becoming less noticeable.

vertinox · « **Reply #13 on:** May 29, 2012, 01:02:05 pm »

Believe it or not, I can run dwarf fortress with a 50 dwarf population on a EEE Netbook PC at about 40fps.

Draco18s · « **Reply #14 on:** May 30, 2012, 09:00:51 am »

Quote from: NW_Kohaku on April 20, 2012, 10:04:54 pm

If you don't already know, spare RAM does nothing. RAM is just storage capacity, and having excess RAM storage capacity is like paying for a whole warehouse even though you only are storing enough stuff in it to fill a small room.

It's more complicated than that, as if you have "exactly as much ram as you need" then the computer spends a lot of time shuffling around the allocations so that each program has a contiguous block. But as each program's desire for space changes over time, your ram quickly becomes fragmented and applications slow down causing the computer to spend time trying to shuffle things around.

You do need extra space, but you don't need a lot of it. 50% more than you use is generally plenty.

News:

Author Topic: Dispelling common myths about DF performance and optimization (Read 14301 times)

Biopass

Dispelling common myths about DF performance and optimization

NW_Kohaku

Re: Dispelling common myths about DF performance and optimization

alexandertnt

Re: Dispelling common myths about DF performance and optimization

blue sam3

Re: Dispelling common myths about DF performance and optimization

alexandertnt

Re: Dispelling common myths about DF performance and optimization

blue sam3

Re: Dispelling common myths about DF performance and optimization

King Mir

Re: Dispelling common myths about DF performance and optimization

Biopass

Re: Dispelling common myths about DF performance and optimization

kaenneth

Re: Dispelling common myths about DF performance and optimization

Miuramir

Re: Dispelling common myths about DF performance and optimization

King Mir

Re: Dispelling common myths about DF performance and optimization

Thief^

Re: Dispelling common myths about DF performance and optimization

calrogman

Re: Dispelling common myths about DF performance and optimization

vertinox

Re: Dispelling common myths about DF performance and optimization

Draco18s

Re: Dispelling common myths about DF performance and optimization