Answer ( 1 )

  1. They are very bad.

    They are the kind of folks who will go in and modify a finely tuned system with an “obvious” fix, and screw everything up horribly, with the absolutely best intentions.

    To give you some idea, another excellent engineer I know, Jeffrey Hsu, was working at ClickArray (now known as Array Networks), and got me hired in there because he needed another “big gun” type to get some really performance critical work done.

    And we got it done.

    On 1.3GHz Pentium 4 systems (it was 2001).

    We got a reverse proxy cache up to something over 38,700 TCP connections a second — which isn’t impressive, until you realize that there are only 16,384 usable ports for INADDR_ANY, unless you substantially modify the TCP/IP stack, if it’s a BSD or Linux based system.

    We had not modified the stack — it was a BSD stack — so it meant we were also responding to initial page load requests in that same time frame.

    And then we were on to the next problem

    About half a month into the next problem, apparently someone had finally made some time to do some performance testing on the changes they had been making to the rest of the code in the cache.

    We got to see some increasing panic around the office for a couple of days, and asked several times what the problem was, and got told not to worry about it, and keep working on what we were working on.

    We did, because we were fixing the finite state machine to be an actual finite state machine, and restructuring the cache code at the time. It’s one of the things that got the company their next round of funding, in fact.

    Finally, their hotshots called us in to see if we could fix their problem.

    They were getting around 6,300 connections per second.

    They had lost about a factor of 6X in performance, and they couldn’t figure out where.

    It took us a couple of hours, and we finally tracked it down to a “make optimization for multiple CPU machines” commit.

    We undid it, and we were back to about 35,000 connections per second, retuned a couple of hot code paths, and by the end of the day we were back up to the old numbers.

    To understand the “optimization” that the “hot shot programmer” thought he was making, you have to understand that threading was not really much of a thing at the time.

    Even had it been, we were still working through the state machine, which had to be done before the global state could have possibly been moved into a single “statite” object to make it per thread, and keep instances from interfering with each other.

    So the scaling model was to have multiple “work to do” engines as processes, and then a “gatekeeper” process that would let a process go to work on a request as it came in. It saw all requests, and then “let the process go”; if there were more requests, the kqueue in the gatekeeper would “let another process go”, and so on.

    Our “hot shot” had noticed that one process was doing almost all the request serving, while the other processes were (mostly) idle.

    This is because I had intentionally used a LIFO instead of a FIFO for the processes waiting for “work to do”.

    So he “fixed” it by changing it to a FIFO.

    And the performance went to hell.

    The reason I had used a LIFO in the first place, even though I knew it would not necessarily share the load out between the cores, was because I knew some things the “hot shot” did not know.

    I knew that:

    As a network appliance, we were almost always going to be I/O, rather than CPU, bound unless we were loading in a module to do something like whitespace removal or content rewriting for advertising purposes
    More to the point, I knew that the last process to get in line was going to have all of its pages in core, while the one which had been sitting idle the longest probably would not have all of its pages in core
    Further, I new that there would be TLB cache collisions, since all users space processes were mapped at the same address range, resulting in flushes
    Which meant there would have to be cache reloads, which minimally meant going out to at least L1, and probably L2 cache
    Combined, this would result in additional instruction pipeline stalls, because there was no L3 cache on the P4 architecture
    And so I intentionally traded what I knew was probably going to be idle CPU cycles on the additional cores for what I knew would have the best I/O bound performance
    So I had intentionally chosen a LIFO order, and verified a performance improvement, using a technique I’d first used in 1994 or so called (by me, since I had invented the technique) “hot engine scheduling”.

    So this “hot shot” had destroyed the performance because of the change.

    And then after the alleged “optimization”, failed to run a performance test to verify that it was in fact an optimization.

    The only thing he’d checked was that, under load, the processes were all marking up about the same CPU usage total, over time, which is what he though meant that he was going to get better multicore utilization.

    The absolutely worst software engineers are the ones who are good enough to be dangerous, but not good enough to recognize when they are making a bad decision.

    They are made even worse by not verifying after the fact that the decision was in fact a bad one by measuring the results of their changes using some impartial ruler.

Leave an answer