NASA Logo

NTRS

NTRS - NASA Technical Reports Server

Back to Results
The SGI/Cray T3E: Experiences and InsightsThe NASA Goddard Space Flight Center is home to the fifth most powerful supercomputer in the world, a 1024 processor SGI/Cray T3E-600. The original 512 processor system was placed at Goddard in March, 1997 as part of a cooperative agreement between the High Performance Computing and Communications Program's Earth and Space Sciences Project (ESS) and SGI/Cray Research. The goal of this system is to facilitate achievement of the Project milestones of 10, 50 and 100 GFLOPS sustained performance on selected Earth and space science application codes. The additional 512 processors were purchased in March, 1998 by the NASA Earth Science Enterprise for the NASA Seasonal to Interannual Prediction Project (NSIPP). These two "halves" still operate as a single system, and must satisfy the unique requirements of both aforementioned groups, as well as guest researchers from the Earth, space, microgravity, manned space flight and aeronautics communities. Few large scalable parallel systems are configured for capability computing, so models are hard to find. This unique environment has created a challenging system administration task, and has yielded some insights into the supercomputing needs of the various NASA Enterprises, as well as insights into the strengths and weaknesses of the T3E architecture and software. The T3E is a distributed memory system in which the processing elements (PE's) are connected by a low latency, high bandwidth bidirectional 3-D torus. Due to the focus on high speed communication between PE's, the T3E requires PE's to be allocated contiguously per job. Further, jobs will only execute on the user specified number of PE's and PE timesharing is possible but impractical. With a highly varied job mix in both size and runtime of jobs, the resulting scenario is PE fragmentation and an inability to achieve near 100% utilization. SGI/Cray has provided several scheduling and configuration tools to minimize the impact of fragmentation. These tools include PScheD (the political scheduler), GRM (the global resource manager) and NQE (the Network Queuing Environment). Features and impact of these tools will be discussed, as will resulting performance and utilization data. As a distributed memory system, the T3E is designed to be programmed through explicit message passing. Consequently, certain assumptions related to code design are made by the operating system (UNICOS/mk) and its scheduling tools. With the exception of HPF, which does run on the T3E, however poorly, alternative programming styles have the potential to impact the T3E in unexpected and undesirable ways. Several examples will be presented (preceeded with the disclaimer, "Don't try this at home! Violators will be prosecuted!")
Document ID
19990107382
Acquisition Source
Goddard Space Flight Center
Document Type
Reprint (Version printed in journal)
Authors
Bernard, Lisa Hamet
(NASA Goddard Space Flight Center Greenbelt, MD United States)
Date Acquired
August 19, 2013
Publication Date
January 1, 1998
Subject Category
Computer Operations And Hardware
Meeting Information
Meeting: High Performance Computing and Communications Computational Aerosciences
Location: Mountain View, CA
Country: United States
Start Date: August 25, 1998
End Date: August 27, 1998
Sponsors: NASA Ames Research Center
Distribution Limits
Public
Copyright
Other

Available Downloads

There are no available downloads for this record.
No Preview Available