Massive communication failure on Milleotto
No less than four pass-thru modules have failed since yesterday, affecting many parallel jobs
As announced earlier, there has been a recurring problem with failing pass-thru modules in the blade centres of Milleotto. When one module fails, parallel communication stops to 14 nodes, usually hanging or killing parallel jobs using these nodes. Apparently it is a cooling problem and once a module has been overheated, it is permanently damaged. Now there has been an escalated amount of failures with four modules shutting down within about a day. One of them has been replaced and spares for the rest will arrive tomorrow, but much of the damage to parallel jobs have been done.
The manufacturing of a redesigned module that is expected to solve the problem permanently has started and a batch of them are on the way here, to replace all our modules. It is a great pity that it has not arrived before this massive blackout.
We are sorry for the inconvenience and can only offer the hope that this type of problem will soon be gone.

