my HPC journey

From 1979 to 1986 I studied mathematics and a little bit of computer science at the Technical University of Brunswick/Germany.

Math was always *my* topic: I like its eternal truth, to me it is clear how it „works“, you do not have to memorize a lot. Since I am basically a lazy guy, math was the obvious choice when selecting my discipline. In addition, one had to choose a subsidiary subject: I decided for computer science (or informatics how it is still called in Germany) since I thought this is a good complement for the very theoretical, so-called pure mathematics.

After my pre-diploma in 1981 I had to decide for a special area to focus on: I decided for applied mathematics and numerics, again because I wanted to concentrate on more practical aspects of math, and also to increase my chances to get a job after my studies.

It was around 1982 or 1983 when I attended a colloquium on the use of vector computers at Prakla-Seismos, a seismic exploration company based in Hannover/Germany. It was given by Werner Butscher. He explained why supercomputers (which, at that time, were vector computers, in that case a Cyber 205) are necessary to cope with the demanding problems in seismology. This was my „moment of enlightenment“. Werner and I later became colleagues (at KSR – do you remember that one?) and friends: I still consider him being my „fatherly advisor“.

I attended lectures on numerics on vector computers which were quite revolutionary, at that time, and wrote my diploma thesis on the vectorization of a multigrid code.

It was a lucky incident, that in the mid-Eighties the Suprenum project was started in Germany. Its goal was to (co-)develop a German parallel supercomputer, based on Motorola microprocessors and Weitek vector chips. One rationale was to become independent from the US manfacturers of super/vector computers. From the beginning it was planned not only to build the hardware, but also develop software for the fundamentally new system in parallel, sotosay. My professor (Manfred Feilmeier) had just founded a company (for financial and insurance mathematics) in Munich and managed to get a Suprenum sub-project for the development of a parallel linear algebra library. I got the offer to join the company and to work on this project. It did not took me long to decide to accept this offer. (The alternatives were a job at the KFA Juelich, the nuclear reserach in institution in Germany (now called FZJ „Forschungszentrum Juelich“) which had one of the first Cray machines in Germany, and Volkswagen (I am born in Wolfsburg were the headquarters of VW are located.))

So, in 1986, after I had finished my studies I moved to Munich to work on SLAP, the Suprenum Linear Algebra Package, together with Dr. Wolfgang Roensch, a former research assistant who had also accepted my professor‘s job offer.

So, from May 1986 to summer of 1990, we explored the newly developed parallel algorithms for Gaussian elimination and the like and implemented them using SuSi, the Suprenum Simulator, since the hardware was not ready at that time (co-development!). One fundamental finding was that the data distribution is essential for the efficiency of the algorithms. (Suprenum was a distributed memory computer with up to 256 nodes.) We came in contact with Christian Bischof from Argonne, a native German who did research on exactly on that topic (parallel linear algebra). I met Chris much later again, after he had moved back to Germany and became professor and computing center director at Aachen university.

A problem in the whole Suprenum project (besides issues with hardware stability: Krupp-Atlas was the associated h/w company, they were strong in military electronics but had no real experience/expertise in high-end computer technology) was that the development of the Fortran was out-sourced to an American company called Compass that was not able to deliver its product in time. (Because Suprenum was a distributed memory architecture the programming paradigm was message passing. Long before PVM and MPI the decision was made to develop a Fortran compiler supporting language extensions like SEND and RECEIVE instead of using a library.)

At the end of the project and after spending about 180 million Deutschmarks, there was a big reference machine (full-blown with 256 nodes and ? GB of memory) was installed at the GMD (Gesellschaft für Mathematik und Datenverarbeitung, the federal German research insitute for mathematics and computer science, and the project lead. As far as I remember 2 or 3 other much smaller systems were installed at other project partners. I think one machine with 16 or 32 nodes went to Erlangen university.

Overall Suprenum was a mediocre success: the architecture was visionary, the co-development approach also, but there were a lot of hardware problems, i.e. the system was terribly instable, and probably most important: the system was (very) late to the market. In the meantime, parallel systems like the iPSC from Intel (also very instable) and the first Thinking Machine (somebody‘s knowing what happened to Daniel Hillis?) had appeared, and also the microprocessor revolution had started, with RISC being the dominant CPU architecture (Sun SPARC [SuSi (see above) ran on a Sun-1 workstation, IBM RIOS (which later became Power), MIPS, HP‘s PA-RISC being the major contenders).

So, commercially Suprenum was a desaster, BUT as I always said: it was a big funding for the US-American computer industry as many co-workers started working at big and small companies, like Intel, TM, Cray (you name them) where they could use the experience they gained in the Suprenum project. I myself joined nCUBE, one of the real pioneers in parallel computing. (That was after a short intermezzo at my first employer Feilmeier: after the end of Suprenum, for some months I should investigate the potential of building a consulting business on HPC. I also looked in the potential of neural networks, but time was not ready for this. So, the conclusion of my investigation: there was little chance to establish a profitable HPC consulting business, but I am still thankful to my old boss that he gave me the opportunity to work that independently. I am afraid this would be impossible nowadays, given the „quarterly thinking“ in most economies.)

So, I looked for a new job/employer and was lucky again: nCUBE had just opened their European operations in Munich. I applied for a job as systems engineer, and guess what: I got it. I remember my boss Peter Wuesten who was really a funny guy. He came from Siemens, the German competitor of GE, which also did build and sell computers (all sorts of, from PCs to super/vector computers from Fujitsu, at that time). I remember the following anecdote: we were in final negotiations with a prospect when Peter annouced in a team meeting that he had made an offer that the potential customer could not resist. He said he had offered a discount (10% off the list price). Later it came out that the contract said the customer would have to pay 10% of the list price. You notice the missing „f“. Sorry Peter, for disclosing this secret.

IMHO, nCUBE had one of the best system designs of that time. The most successful model nCUBE-2 consisted of up to 1024 nodes arranged in a hypercube topology, hence the company name. The node CPU was a custom-designed chip developed by the founder, Steve Colley. He had been with Intel before were he could not finalize the design of „his“ CPU, the 432(?) processor. So, he decided to found his own company where he could use his knowledge to design a complete parallel system. (On another note: nCUBE was a company privately held by Larry Ellsion of Oracle. He had the idea of using highly parallel machines running databases. I think nCUBE has never been profitable and could not have survived that long without Larry’s deep pockets.)

The nCUBE CPU was running at 20 MHz(!), had a VAX-like instruction set and an on-board communication engine (similar to the Transputer), offering 10 channels to build a 10-dimesional hypercube, hence the maximum of 1024 nodes. It ran a micro-kernel, also a very advanced approach at that time. The single node was a PCB a little larger than a credit card (I still have one in my drawer) with the cpu and up to 4 MB(!) of DRAM memory. Up to 64 nodes were arranged on a large board which was put in a very nice black cabinet (designed by Hartmut Esslinger of Frog Design). The system was sometimes called „Darth Vader“ because of its similarity with the „Dark Side“. Later, nodes with up to 16 MB of memory (wow!) became available but those were double-sized hence limiting the maximum system size to 512 nodes.

I joined nCUBE in November 1990. Retrospectically, I have to say that the following 2 years were the period I learned the most during my professional carreer. As I said: I was a system engineer (1 of 2 in the beginning), the whole Munich office consisted of the general manager Peter Wuesten, the technical (and my direct) manager Horst Gietl (who also came from Siemens and now is one of the organizers of ISC), a secretary… pardon me: office manager, a sales rep, and us 2 system engineers. We did everything: benchmarks, pre- and post sales support, even hardware installations. One other person I have to mention in this context is Matthew Hall. He was delegated to Munich from the headquarters in California to support us „newbies“ with his technical expertise. Being a Brit and theoretical physicists Matthew is one of the smartest people I have ever met. He was great in benchmarking and software optimization. (One advice: try to work with smart people, if possible smarter than you. Matthew was definitely one of those.) Another anecdote: the whole office team usually went to lunch together. One day, we were not going to the nearby mall as usually but to a typical Bavarian restaurant where Matthew ordered a „Schweinsbraten“ or a „Schweinshaxe“, so a big piece of pork with dumplings and a dark brown heavy sauce, no vegetables. He also had a (or two) beer with his lunch. After returning to the office he almost felt asleep. Later, he complained that he gained a lot of pounds during his time in Munich. Well, but we were pretty successful: during my 2 years with nCUBE we sold about 10 systems, besides others one to Prof. Hertwig of IPP Garching (who also bought the first Cray-1, without a working OS, in Germany). Likely the biggest success was that we sold a system to BMW. Imagine: parallel processing was still in its infancy, and the makers of the Ultimate Driving Machines decided for nCUBE to run the CFD code FIRE on our system. We won this deal against Intel with their Paragon. This would not have been possible without Matthew.

Well, you might ask me: if you were so happy and successful why did you leave nCUBE? The answer is: I thought there might be something even better, namely an even more promising architecture, namely distributed shared memory. (In addition, I also had some problems with my direct manager.) So, it happened that the nCUBE sales rep Wolfgang Carl had already accepted an offer from KSR: somebody remembering Kendall Square Research? He asked me if I would join him to establish the German subsidiary of KSR. Again, it did not took me very long to accept the offer. Again, I was part of a small team: the General Manager (Wolfgang Carl), an additional salesman (Werner Butscher in Hannover, see above), a secretary… pardon me: office manager, a hardware engineer (for system installations and hardware maintenance) and me, a software engineer, being responsible for benchmarking and pre-sales support. And well, the story started very promising: within one year we sold about 10 systems. We had to close a distribution contract with Siemens (again!) because our small team could not handle all the projects.

The biggest deal was LRZ: they got a relatively small KSR-1 with 32(?) nodes with the option to buy a very large KSR-2 with 256(?) nodes. While there were minor issues with the first test system the real problems started with the system that was meant to become the new production machine at LRZ. The system did never achieve acceptance since e.g. even my very experienced colleague responsible for hardware, Peter Immerz, „burnt“ several boards when putting them into the cabinet. The problem were the connectors to the backplane. In addition, there were problems (of course) with the software, especially the operating systems which was a Mach micro-kernel. I remember working 42 hours in a row at KSR hq in Waltham/MA to fulfill the benchmarks for acceptance but the system kept crashing again and again. In addition, there were major management faults: the upper management, the so-called „Gang of Four“ became greedy:: in order to get the stock price up, they applied „innovative/creative revenue recognition“::: e.g. I remember a deal in the US where an KSR-1 had been sold to a customer, and then the salesman promised to take the old machine back and to install a newer KSR-2 for just a few bucks. In the books however, the new machine was registered for almost the full list price. By this kind of behaviour of upper management the whole company lost the faith of the community and potential customers. As a consequence, e.g. in Germany we could not sell any system in the second year after having sold about 10 in the first. So, that was the beginning of the end: there was no money to solve the hardware and software issues, and the company filed for Chapter 11 in summer of 1994. In Germany, all employees but me were fired. I had to stay another 3 months or so until I had to turn off the light.

So, end of October I was unemployed but not unprepared. I had looked for job opportunities, besides others at Convex, the maker of so-called Crayettes. Do you remember their slogan: one quarter of the performance of a Cray for one tenth of the price. They had just started shipping their new machines SPP (for Scalable Parallel Processor), another distributed shared memory architecture, pretty similar to KSR. I applied for a job as software engineer, and guess what: I got the job, after just 1 month of being unemployed. I started fresh but soon I had to realize that the SPP was almost as unstable as the KSR. Although, we sold a few systems (besides others to Erlangen university which was/is the „competitor“ of LRZ in Bavaria) the SPP was not a commercial success. We burnt money because the older vector machines did not sell well anymore (the Gallium-Arsenide-based Convex-4 never really got ready, the Convex-3 was selling okay but not the „bringer“). So, it was a big relief when HP announced that they will buy Convex for 150M$. The main reason was that HP was looking for an extension of their product portfolio with a real parallel machine: the biggest HP was the 4-cpu K-class at that time. And the Convex SPP was based on HP‘s PA-RISC CPU, in contrast to KSR which had a custom CPU (running at 25 MHz and 40 MHz in the KSR-1 and KSR-2, respectively). (Btw: this was another reason for the failure of Kendall Square Research: its performance could not keep up with the one of the RISC CPUs, e.g. about 8 KSR CPUs were necessary to compete performance-wise with 1 Power CPU.)

Back to Convex and HP: I remember a morning in the Munich office of Convex hearing a „YIPPIEH“ out of the room of our branch manager, Lothar Wieland aka „LoWie“. He just had read the announcement that HP will buy Convex. I am sure that those 150M$ were the best investment HP has ever made. (Just think of the billions of Dollars paid for the acquisitions of Compaq and EDS, not to name Autonomy.) (Well, I do not know how much HP paid for Apollo, the inventor of PA-RISC.) 

Being part of HP Convex was able to further develop their SPP architecture. The first machines after the acquistion were still running SPP OS that kept on crashing. Later, the S-class (we managed to sell an S-class to Mercedes 😉 came out which was finally replaced by the V-class running HP-UX. This was the breakthrough since the system finally became stable (enough). It was a huge effort though to port HP-UX to the SPP architecture. This architecture was a so-called hypercomplex consisting of hypernodes of 8 or 16 CPUs connected by a crossbar. These hypernodes were connected in a ring topology. Over this ring an SCI-based protocol was run (Scalable Coherent Interconnect). (In contrast, the KSR architecture implemented a ring of rings. As far as I remember the protocol was also SCI-based. The whole architecture was called COMA for Cache-Only Memory Architecture. The memories of the single CPUs were considered caches, with more or less latency depending on where the memory is located, local to the CPU, in the same ring, or in a remote ring. In other words: NUMA.) (Oh man: hypernode, hyperplex, COMA:: marketing people were much more inventive those days than nowadays.)

After having lost deal after deal especially against SGI with their Origin (there was a predecessor SGI productline whose name I do not remember) Convex/HP was finally able to compete. The next breakthrough was the HP SuperDome (nice wording again) which was still based on the original SPP architecture but featured the PA-8000 CPU and was running fully-fledged HP-UX. With this machine we were able to replace the dozens of SGI Origins at BMW. Yes, again BMW played in important role in my professional career. The basis was the HP „dream team“ consisting of the HP sales rep responsible for BMW, Peter Schumacher, and me, the pre-sales guy. We sold about 10 SuperDomes to BMW to mainly run their crash simulations (ESI Pamcrash). Later, we replaced some of the SuperDomes by smaller multiprocessors (N/L-class) which were bus-based systems. With this mix we were able to run BMW‘s technical workload consisting of Pamcrash, various CFD codes (Fluent, ExaFlow (for their Formula-1 team) and still FIRE (Flow In Reciprocating Engines, and MSC Nastran for NVH (noise vibration harshness) and the like. (Btw: HP sold also many SuperDomes for running Oracle at BMW, even more than those for their technical workload. So, again Larry Ellison played a certain role in my life…)

But then the next wave in HPC appeared on the horizon: Beowulf clusters. When we still sold SuperDomes to BMW I was warning internally: we should get prepared for this new paradigm. But as usual in large companies my voice was not heard, for a long time. When BMW asked HP to make an offer for an HPC cluster we had nothing, no product, just the components: pizza boxes, storage devices, network gear, but no integrated offering. So, we offered and finally delivered a first batch of equipment to the BMW computing center. It was a desaster: everything came in single boxes, no preparation or integration. It took very long until we had the first small demo cluster running at BMW. For the next procurement we were a little bit wiser: we planned the integration at the HP factory in Herrenberg near Stuttgart. But we still had no cluster *product*, hence we build what I call a „Turkish cluster“: a collection of Intel-based workstation put into racks, together with some Ethernet switches, hidden behind covers. But then something totally unexpected happened: I was working with the guys from MSC (they were our project partners, had their own Linux distribution to build a Beowulf cluster) in the factory when a colleague came around and asked us if we know what is just happening in NYC: it was 9/11. We watched the TV transmission on a notebook, I could not believe what I saw: the towers of the world trade center collapsing. We were all totally shocked. The person most affected was David Lombard of MSC, the father of the Linux distribution mentioned. He said he has relatives or good friends working in the World Trade Center. We could hardly finish our work of the day. On my way back to the hotel, in my car I said to myself: Bush senior will probably/likely send his bombers and this might be the beginning of WWIII. It did not come this bad.

Back to business: we finished integration our „Turkish cluster“ and shipped it to BMW. The customer was relatively satisfied with what we did and we got some more orders for compute clusters. But the competition was stiff: there were companies on the market that were able to deliver fully integrated solutions, e.g. I remember Siemens/Fujitsu (again!) with their HPCline. And even HP finally realized the business potential: they came out with the CP productline, an integrated HPC cluster solutions with various options, e.g. different CPUs (Intel/AMD), storage devices, networking (Ethernet/Myrinet/Infiniband), complemented by software like CMU (Cluster Management Utility, developed at HP Grenoble/France), third party products like Platform LSF. (As Thomas Sterling said in one of his keynotes at ISC: HP is the leader in following.) Later came Lustre and its HP version called SFS (Scalable File Share) which was a Lustre distribution with some additional tools around to make the management easier. But as so often: the revenue coming from SFS was not high enough, so HP pulled the plug on SFS after just one year. This company had lost its direction when Carly Fiorina became CEO. The total focus was on shareholder value, the employees were disregarded (in contrast to what is written in the book „The HP Way“ by the company founders), the majority of the workforce got (very) frustrated, by several waves of WFR (work force reduction), no more salary increases, and Carly was talking about a journey. (The joke of the year was that Carly gave every employee 10 stock options for free, later when they could be exercised they were valueless.) Then came Mark Hurd, who was hired to further reduce costs, e.g. more and more people were put into ever less office space, and investments in new technologies and products (like SFS) were cut. At that time, I was very unhappy at HP although they had saved Convex. In 2010, I finally accepted an offer to sign a agreement to terminate my working contract. To be fair: HP was very generous, I got a (very) good compensation, and the opportunity to work in a „transfer company“ where I could look for and was trained to get a new job while being paid 80% of my previous salary for one year.

However, after just two months I had a new job: Business Development Manager Germany at T-Platforms, the Russian HPC company which just had opened their operation in Hannover. (This was the kind of job I was looking for: somewhat business-related job in which I could use my technical knowledge as well as my connections in the community. It was also a natural development in my professional career: from a technical background to a more sales-oriented engagement.) Things looked very promising again: the company seemed to have a competetive product. Their lighthouse account was Lomonosov university in Moscow were they installed a very big machine, with several thousands of CPUs which was huge at that time. The peak performance was several petaflops which gave a top ten(?) ranking in the TOP500 list. And they claimed they have new products in the pipeline to be shipped in about a year, e.g. one of the first water-cooled machines in the market. (The Lomonosov system was air-cooled, producing a very high noise level in the computer room: one had to wear ear protection.) So, I was very busy using my very comprehensive network in the HPC community especially in Germany I had gathered over the years. We participated in various procurements, mainly at the high end. But it appeared that our offering was always way too expensive, usually 30% above that of the competitors. So, we lost deal after deal. Also: it came out that the plans for new systems were all just paperware, years away from being probably produced. The most promising-looking machine was the T-Racks system. The system design was based on racks, not single severs that are put into racks. That would save a lot of metal sheet and allow a very compact system. (You understand the play on words, T-Racks sounds like T-Rex? Again: great marketing, but as I said: just paperware.)

The only real deal I could close (well: almost), was QPACE2, a special purpose system for QCD. This was a joint development project by FZJ and the universities of Regensburg and Wuppertal. The goal was to develop the successor of the QPACE system which was one of the HPC systems with the best power efficiency. The whole system was a custom design, one of the strength of T-Platforms I thought, and pretty similar to T-Racks, so completely water-cooled, with GPUs and all the bells and whistles. One more time I met very smart people in this project: Tilo Wettig of Regensburg univeristy and Dirk Pleiter of FZJ. They were/are real experts in their field (quantum chromodynamics), but also very knowledgable in computer systems. We were in the final system design phase, but then came the black listing:

T-Platforms was accused by the US governement of having delivered systems to evil states and/or nuclear, i.e. atomic weapons research institutes in Russia. And that was of course the end: since T-Platforms could not buy components like CPUs from the US companies Intel and AMD anymore the basis was gone. I felt forced to inform Dirk and Tilo that T-Platforms would not be able to finish the project work although I was not allowed to do so. (I was in a similar situation with LRZ and KSR in 1994: although I was not allowed by the company to tell the customer the whole truth I felt obliged to do so. My slogan is: „the customer comes first“. Btw: the Convex slogan was similar, in every office there was a flag showing the question: „what have you done for your customer today?) And T-Platforms shutdown its German and European operations. At the end, I was not unhappy about this development: the company had cheated in several ways:: when being hired I was told that I would build a complete team in Germany, with pre- and post-sales support, additional sales reps and so on. The only person that was hired after my recommendation was a former colleague from HP: he was a bid manager. After he was fired also I felt guilty. But he obviously got a good new job at a company that has nothing to do with HPC. Another person I have to mention in this context is Gilbert Lauber: he became a friend, had tried to hire me before when I was at HP and he at Quadrics (if I remember correctly). At T-Platforms he was the European sales manager, hired me after I ran into his arms on the exhibition floor of ISC 2010. He was defending me when my problems with the company culture of T-Platforms became obvious: the salaries and expense reimbursements came later and later, resulting in a heated incident at T-Platform‘s German Xmas party, in January 2013. I said to the ww sales manager that delaying payments will have consequences. The lady said: „you are threatening me?“ with a very angry face. Later I learned from Gilbert that the lady has asked to fire me, but Gilbert gave me a last respite. We negotiated about new goals which I could not accept however. But then came as already mentioned the black listing, and everything was superflous. In summary, the lesson I learned from my time at T-Platforms is that it is hard to work for a company coming from a different „culture“. E.g. when the company was already in financial trouble the founder and CEO Vselvolod „Seva“ Opanasenko was still residing in five-star-plus hotels when traveling all around Europe whereas his engineers were allowed to spend 40 or 50 Euros max per day for a hotel which is hard to find in cities like Munich. Fyi: the price for a room per night at the „Vier Jahreszeiten“ (Seva’s favorite hotel in Munich) is 400€ and up.

So, in June of 2013 I was unemployes again… and really pissed. I said to myself: next time you are more careful when choosing an employer.

Then a new period of my life started: because of several private and health problems I could not apply for a full-time job. I was trying to work part-time as a freelancer but my private business never took off. I am still very interested in HPC and associated topics (AI, ML/DL, quantum computing…): I was and am still attending SC, ISC, and smaller events like PASC (a very good conference!) and the semi-annual German ZKI AK SC meetings, and tried to keep in contact with former customers, partners etc.

Then, in November of last year I signed a contract with a computer offering a private cloud solution (based on OpenStack, Kubernetes, Ceph, OpenShift). I was working part-time, using my comprehensive network, as a business development consultant. This was a prefect solution (in principle) for me: working for just 1-2 days a week was what I was able to do given my health problems, and probably most important: I was finally working in the HPC software business. – Didn‘t I tell you: although I always considered myself being a software guy (s/w, i.e. applications are what really counts, they are closer to the problems than h/w) I have worked for hardware companies, up to now (except my very first job at Feilmeier). But soon after I started my new job I realized: my new boss (his name is not worth to be mentioned) is an a**hole (pardon me!):: I do not want to go into details.

And then came Corona: on April 1st (not an April‘s fool joke) I got my termination notice, effective April 15th. Although I made several proposals for a compromise but we did not come to a mutual agreement.

So, it looks as if I am at the end of my professional career.

Nevertheless, I am still very interested in HPC and related topics. I will keep on following new technology (and business) developments.

Let me summarize:

  • HPC is *my* topic.
  • The slogan of an SC show about ten years ago was: HPC matters. This slogan is valid more than ever, in these times: e.g. Tabor Research recently published a list projects were HPC plays a role in the fight against Corona.
  • I would not like to have missed the opportunities HPC gave me: working with some very smart people (I just realized that three of the best HPC system architects I had the honor to meet have the same first name: Steve Colley (nCUBE), Steve Frank (KSR), Steve Wallach (CONVEX)), get insight in various fields of application (weather forecast, climate reserach, industrial product design (CFD, crash simulation and its little brother stamping, FEM), gene sequencing, QCD, just to name a few, and last but not least: I have seen some places around the world (mainly in the US attending the SC conferences and company meetings, but also in some European countries and Asia), meeting people from similar and different cultures.

—Henry Strauss, in May 2020