The order of paragraphs is somewhat random and they're poorly edited, but I finally have some time and energy now.
Hm, this quantile normalisation would be an answer to a problem with weighting entrance exams I've seen a few years ago (there were 5 tests with 50 questions each, everyone took 2, then they multiplied the results, divided them by means for the tests you took and multiplied that by 2500 which gave the final score).
I do not think that quantile normalisation is a good way for sorting INSIDE a role (as it looses information and just gives the order), but if you want to compare roles to each other (weighting them by priority, so that soldiers are more important then farmers), then yes, I think this would work to a degree, but it has it's anomalies, especially for low population. If there is a big gap in skill (say 1+ really good at it and 1+ really bad), that gap won't show in the results. This is why I would prefer normalising everyone's score in a category to 0 - 100, where 100 is current best and 0 is literaly 0 (with no one being this bad). That way I would see a gap (if any) and make a more informed decission.
Quantile normalisation might work as part of aggregating skills, attributes and traits into a final score, I guess.
Another consideration is that quantile normalisation seems made for comparing a small number of data sets. You have so many that a rapid change of values within one (such as a squad dying), won't really show in the means for each rank.
For things like traits / preferneces, I think you'll just need a vector of arbitrary weights for each role, then do a cross product and factor that into the overal rating. It's far from ideal, but it should kinda work. In other words treat each trait as a 0 or 1 (or if you prefer 0 or 100%), then multiply them by arbitrary weights (defined for each role, most of them 0, they can be negative for undesired traits), sum up the results and add that to the final score with a smallish weight.
If you want, you can even sort of scale this "dot product of traits and weights" to 0-100% by taking -sum {abs(weigth)} as 0% and sum{abs(weight)} as 100%. Once you have that, just aggregate it with the skills and attributes into final score.
if a dwarf has a preference, then it's % value = 1/(# of preferences in role - # of preferences that don't exist in population) % for each preference he has. This can be used as a simple additive/subtractive value from the role calculation used for ((attributes * weight) + (skills * weight) + (traits * weight)) / (sum of weights) [...]
Ah, you've figured out something like that already.
Max-min conversion = (x - min)/(max-min)
For attributes, It would be easy to scale them to the current highest, maximum highest for the species (min {5k, 2*max_starting_attribute} or just 5k. Then just give everyone an "attribute rank" of att/max, which would always be bigger than 0 (except odd cases in mods).
My method is better because if the population is oscillating around maximum (say 1200-3000 range), they will all get high scores (66 - 100%),
which will not hide information when aggregating with other criteria and result in a more accurate final score and I don't think it can be improved upon.
To me it seems that you fiddle with the numbers until you get something you like without giving it a deeper thought on how or why. By looking at the last 10 or so posts it is obvious to me thet the results are mangled and I shouldn't trust them. This makes me much less eager to suggest anything. I agree with anything that tussock said about it.
Another reason I don't contribute more is that while the problem is certainly interesting and I learned some maths, optimisation and operational research during my studies, I don't have the energy nor the time to work on it. It took me 10 days to even answer. But then I admit that skewed or not, a working labour optimiser would be a useful thing. Especially for newbies or 150+ Dwarves where you don't really care anymore.
But a high Agility skill 15 miner will easily beat a high Strength skill 25 miner to the job and also mine faster than them until their endurance and persistence comes into play and various ones head off for a drink before finishing.
I've conclusively proven through tests that speed (factring in agility, strenght and SPEED token) improves the speed of working at workshops. My conclusion is that (almost) all actions take a certain amount of "turns" and speed lowers the delay between those turns.
So speed should be a factor unto itself for any role and improve all roles equally.
Like, if my best dwarf is only 37% of the estimated maximum combination of numbers, that could be handy to know.
Yes, this. Also, if I have someone at 235% I don't really care about it being higer than 100%, just that this guy is really good.
http://en.wikipedia.org/wiki/Udarnik (That was a joke.)
Normalising everything, it just seems like you're taking what little accurate data you do have and hiding it under more layers of ... I struggle to find a kind word.
Yep. Just keep it simple and scale by (value / max_value) for every number for everyone. Aggregating this will be more accurate than the complicated thing you're trying to do. Sure, you might not like having a 100% and a 0% everywhere, but I'd find it more informative to know that all my guys kinda suck at this job, so I can pick the most useless ones or that I gave a group of good candidates, then a gap, then poor ones.
Your typical response to criticism is "but we're working hard and our formulae work, how dare you say they have no merit". Well, my answer is that is not how I would go about solving this problem, but you're free to try. Yelling at you would only confuse you and make me look like an asshole.
Communication with you is also difficult. For example your graphs and streams of pre-processed numbers aren't adeqately described and I often can't even guess what I'm looking at. A graph of 1-0 on y axis and what I assume to be spreadsheet row number on the x axis tells me almost nothing. Well, if those are ordered, then ecdf is at least monothonic, while the orange line looks like random noise. But this is really not the level of information, I'd like to derive from a graph and I can't really have an informed discussion without understanding what I'm looking at na what you're doing. At the very least add a 3rd line thats basivally (value / highest_value_everywhere).
I do admire your dedication, though. I have trouble working on one project for a week.
Finally, too much maths can be a bad thing. It introduces more work and more possible errors where a simpler method would suffice.
Almost each and every question I've asked on stats.stackexchange I have failed miserably to communicate properly, but I have found my own answers.
The problem is how ecdf returns a %. It works fine when the data is somewhat distinct uniformly. [Distinct] Meaning more or less a majority of distinct comparable values. When you have a set of data that are all the same, or a majority are the same, and if these values are 0 or if these values are either low in the distribution or high in the distribution, can have an affect on the %.
Yep, this is pretty much my argument why this skews the results as compared to simple (value / max).
Rank() in excel returns the ordinal position of a value from a set of values in a list. If there is a tie, it returns the earliest position, and skips the next position as it has been taken up by the tied value. So... if 2 values are ranked at rank 3, the next rank displayed is rank 5. Another example is if two values are ranked at rank 36, the next rank would be 38; if 3 values were tied at rank 36, the next rank position reported would be rank 39.
Well then, learn some real math software or a scripting language and stop using Excell. Matlab / Scilab / Octave should work pretty well.
It also means the labor optimizer will treat a [starting embark] population with no shearer skills (a skill only role at the time of this writing) as a 50% drawn value vs 0%. It's basically saying, this person is neither bad, nor good at this job compared to the rest of the population (as in they are all tied). This is an important distinction in the behvaior of the labor optimizer, as before no skill meant 50%. However, as soon as a dwarf starts to improve in that skill, you'll notice a 100% value and a ~<50% value for the rest. This means during labor optimization, those who are considered truly bad at a job compared to the rest of the population will be scored lowered than these neutral values. Which means the labor optimizer will assign neutral jobs before bad jobs. It also means when looking at the screen 50% = good, and your labor optimizer shouldn't be overexhausted to assign values below 50% (as in trying to assign too many labors).
I find this highly undesireable behaviour and it is not how my formula worked at all. For a skill, it was supposed to give 100% only for level 20, then lower (but never quite reaching 0) values as the skill and it's learning rate drop. Then 0 for 0 skill and 0 learning rate.
Telling me everybody has 50% to begin with, then differentiating the values (but always giving me a 0% and a 100% from that point on) is highly confusing. I'd much rather see 5-20% fit on unskilled Dwarves, then the values eventually increasing as they skill up. I mean if a role has a 12.4%, an 8.7% and 2 times 5% (4 Dwarves total for simplicity), that gives me a reasonably good idea of the situation, when I compare to another role. For example I might have someone really talented with spears to the point that it's worth it to make another squad. If all military roles are 0 for current lowest and 100 for current highest, I won't notice it.
Hm, I guess one idea would be to display a score based on (value / max), but highlight them green / red based on how good a Dwarf is compared to other Dwarves within the same role, so that in my earlier example the 12.4% guy gets a green and 2 times 5% get a red.
For a labour optimiser, some roles ashould be more important then others (either by listing them in order or giving them weights), There should be a cap on how many roles a Dwarf can have (or a max counter with roles having different scores). There should also be a counter for how many Dwarfs are needed for a role... and a lot of other variables, but from what I see, you just display the suggestions, not autocommit them to DF.