Personal tools
You are here: Home Systems System information Parallel Problems on Milleotto
Navigation
 

Parallel Problems on Milleotto

Pass-thru modules keep failing, killing parallel jobs, but a solution is in sight.

Since the summer, there has been a recurring problem with failing pass-thru modules in the blade centres of Milleotto. When one module fails, parallel communication stops to 14 nodes, usually hanging or killing parallel jobs using these nodes. Apparently it is a cooling problem and once a module has been overheated, it is permanently damaged. The current practice at Lunarc is to frequently check for faulty modules, remove the affected nodes from the queue and restart hanging jobs. The module is then replaced.

However, as a result of this problem, the modules have been redesigned and a new version is being put into production. The first ones are expected to roll off the production line and be shipped to us in about 6 weeks. When they arrive we will have a scheduled stop to replace all the old ones and hopefully the system will become more stable after that.

 

We are sorry for the inconvenience and ask for your patience.

Document Actions