Research Computation Facility for GOSAT-2

 

Background
Background

 

The Greenhouse gases Observing SATellite, GOSAT (IBUKI) was launched in January 2009 as the world’s first satellite dedicated to measuring greenhouse gases from space. The large amount of the GOSAT observational data has been processed and distributed through the GOSAT Data Handling Facility (GOSAT DHF). To ensure the research and development of the data processing algorithms, the National Institute for Environmental Studies (hereinafter referred to “NIES”) installed the GOSAT Research Computation Facility (hereinafter referred to “RCF”) in March 2010.  

 

RCF processed a total of 58-year data of short wavelength infrared during the six-year operation. Based on the research processing results, the processing algorithms for the short wavelength infrared data were revised and the accuracy of the column-averaged dry-air mole fractions of carbon dioxide and methane was improved greatly.  

 

Through this achievement, the necessity of the computing facility for the research and development of the algorithms was recognized, which led to introduce the Research Computation Facility for GOSAT-2 (hereinafter referred to “RCF2”). The main purpose of RCF2 is to research and develop the GOSAT-2 data processing algorithms steadily as the whole GOSAT-2 project based on the GOSAT data.


The major users of RCF were only researchers at NIES, while RCF2 became available to researchers from the outside relating to the GOSAT-2 project.

 

 

 

 

運営組織図RCF2 operation

   

RCF2 has been operated by the GOSAT-2 project of NIES Satellite Observation Center. The operational status of RCF2 is as follows:

 

  March 2016   installed RCF2
  September 2016   started service for users at NIES
  December 2016   started service for users outside NIES

 

RCF2の仕様Specifications of RCF2

 

 

The following are the specifications of RCF2:

 

The specifications of RCF2 (those of RCF are shown in parentheses.)

  • Number of compute nodes: 120 node (160 node)
  • Total number of cores: 2880 cores (1280 cores)

 

The name of main parts

  • CPU E5-2650 v4 (Xeon E5530)
  • GPU NVIDIA Pascal (NVIDIA Fermi)
  • DISK DDN SFA 14K (DDN S2A 9900)

 

Theoretical peak performance

  • CPU 101 TFLOPS (12 TFLOPS)
  • GPU 900 TFLOPS (165 TFLOPS)
  • Total  1 PFLOPS (177 TFLOPS)

 

Energy efficiency 

  • 9796 MFLOPS/W (636 MFLOPS/W)

      * Ranked number 8 on the Green500's energy-efficient  

        supercomputers list (as of June 2017).

        https://www.top500.org/green500/lists/2017/06/

 

 

Shared storage capacity

  • Usable capacity 2PB (0.1PB)

 

Interconnect performance

  • Bandwidth 100 Gbps (32 Gbps)
  • Standard InfiniBand EDR (InfiniBand QDR)

 

RCF-2導入経緯Characteristics of RCF2

 

RCF2 installed EcoManager2, a successor of EcoManager which was the original function of RCF. EcoManager was initially designed to only save energy interlocked with simple jobs. With experience during the RCF operation, some functions such as timing adjustment for start the compute node were added to EcoManager. Additionally, EcoManager2 has more functions such as auto-balancing of compute node utilization, auto-confirmation of compute node soundness, and assigning compute node redundancy.

 

New functions added to EcoManager2 are as follows:

 

  • Auto-balancing of compute node utilization

With EcoManager, compute nodes were statically linked with job queues, therefore, those linked with a job queue frequently used had more utilization time and a larger number of start/stop than the average.

  

In general, the number of failures increases as utilization time and the number of start/stop rise. Accordingly, with EcoManager2, static links between job queues and compute nodes were removed, instead the function was implemented to allocate a compute node for a job dynamically based on the past utilization and balance compute node utilization automatically.

 

  • Auto-confirmation of compute node soundness

When EcoManager started compute node, failure such as unexecuted jobs and irregular stops occasionally occurred due to start failure and other troubles. In each case, operators isolated the cause of the failure and restarted it manually. EcoManager2 is equipped with a function to automatically check compute node soundness immediately after the computer node starts, which is the first phase of manually isolating the failure cause.

 

  • Assigning compute node redundancy

In RCF, some spare compute nodes were prepared as static cold standbys, while in RCF2, EcoManager2 implemented a dynamic hot standby function to start compute nodes which exceed the requested number and assign available normal nodes.

 

* The photo on the top banner: Interconnect switch of RCF2

 

 

 

  

Updated: June 20, 2017