Intermittent Job Failures Due to InfiniBand Issues
Problem: Since the March 21 dedicated time, Pleiades users have reported new MPI/InfiniBand errors. Status: Resolved Actions: 05.09.12 - The subnet_timeout parameter for ib0 was reverted back to 18....
View ArticleSlow Response to the "ls -l" Command
Problem: Users have experienced a slow response to the Unix command ls -l, often waiting for minutes for the command to complete. Status: Resolved Actions Updated 02.03.12 - As of December 14th, 2011,...
View ArticleJobs Alternate Between Running and Queued States
Problem: Users experience unexplained job behavior, where a job alternates between Running and Queued states. Status: Resolved If you are still experiencing issues with this problem, contact the NAS...
View ArticleMPI Program Fails or Hangs
Problem: MPI program fails or hangs due to network communication problems. Status: Resolved Actions Updated 02.06.12 - Our systems staff continue to replace bad or unreliable cables as they are...
View ArticleFiles Fail to Open
Problem: Users experience errors opening or inquiring about existing files using Intel Fortran on Lustre filesystems. Status: Resolved If you are still experiencing issues related to this problem,...
View ArticleBacklog in the Pleiades PBS Queues
Problem: Pleiades users have experienced longer wait times for PBS jobs due to heavy loads on the queues served by pbspl1. Status: Under Investigation Actions Updated 04.02.12 - On March 21st, 24...
View ArticleMPT Startup Failures
Problem Some PBS jobs are terminated due to mpiexec startup errors. Status: Under Investigation Workaround A wrapper script called several_tries is available in the /u/scicon/tools/bin directory. The...
View ArticleInfiniBand QP Errors
Problem Some PBS jobs were terminated due to InfiniBand-related queue pair (QP) errors. Status: Resolved Mellanox, the vendor for the InfiniBand cards that are in each Pleiades node, has developed a...
View ArticleProblem with /nobackupnfs2 filesystem server may cause data loss or corruption
Problem Due to a server problem, any data sent to the /nobackupnfs2 filesystem within at least 30 minutes (up to several hours) before the server hangs/reboots is being held in memory and not written...
View Article
More Pages to Explore .....