Quantcast
Channel: HECC User Support (Updates on Issues)
Browsing all 9 articles
Browse latest View live

Intermittent Job Failures Due to InfiniBand Issues

Problem: Since the March 21 dedicated time, Pleiades users have reported new MPI/InfiniBand errors. Status: Resolved Actions: 05.09.12 - The subnet_timeout parameter for ib0 was reverted back to 18....

View Article



Slow Response to the "ls -l" Command

Problem: Users have experienced a slow response to the Unix command ls -l, often waiting for minutes for the command to complete. Status: Resolved Actions Updated 02.03.12 - As of December 14th, 2011,...

View Article

Jobs Alternate Between Running and Queued States

Problem: Users experience unexplained job behavior, where a job alternates between Running and Queued states. Status: Resolved If you are still experiencing issues with this problem, contact the NAS...

View Article

MPI Program Fails or Hangs

Problem: MPI program fails or hangs due to network communication problems. Status: Resolved Actions Updated 02.06.12 - Our systems staff continue to replace bad or unreliable cables as they are...

View Article

Files Fail to Open

Problem: Users experience errors opening or inquiring about existing files using Intel Fortran on Lustre filesystems. Status: Resolved If you are still experiencing issues related to this problem,...

View Article


Backlog in the Pleiades PBS Queues

Problem: Pleiades users have experienced longer wait times for PBS jobs due to heavy loads on the queues served by pbspl1. Status: Under Investigation Actions Updated 04.02.12 - On March 21st, 24...

View Article

MPT Startup Failures

Problem Some PBS jobs are terminated due to mpiexec startup errors. Status: Under Investigation Workaround A wrapper script called several_tries is available in the /u/scicon/tools/bin directory. The...

View Article

InfiniBand QP Errors

Problem Some PBS jobs were terminated due to InfiniBand-related queue pair (QP) errors. Status: Resolved Mellanox, the vendor for the InfiniBand cards that are in each Pleiades node, has developed a...

View Article


Problem with /nobackupnfs2 filesystem server may cause data loss or corruption

Problem Due to a server problem, any data sent to the /nobackupnfs2 filesystem within at least 30 minutes (up to several hours) before the server hangs/reboots is being held in memory and not written...

View Article

Browsing all 9 articles
Browse latest View live




Latest Images