High-Throughput Computing as a Service (HTCaaS)
High-Throughput Computing (HTC) consists of running many loosely-coupled tasks that are independent (there is no communication needed between them) but requires a large amount of computing power during relatively a long period of time. Middleware systems such as Condor or BOINC have successfully achieved a tremendous computing power by harnessing a large number of computing resources. However, as the number of jobs and the complexity of scientific applications increase, it becomes a challenge for the traditional middleware systems employing typically a single type of resources (e.g., clusters of workstations, desktop machines over Internet) to solve the given scientific problem within a reasonable amount of time. Also, recent emerging applications requiring millions or even billions of tasks to be processed with relatively short per task execution times have led the traditional HTC to expand into Many-Task Computing (MTC).
Therefore, to effectively support complex and demanding scientific applications, it is inevitable to harness as many computing resources as possible including Supercomputers, Grids, and even Cloud. However, it is challenging for researchers to effectively utilize available resources that are under control by independent resource providers as the number of jobs (that should be submitted at once) increase dramatically (as in parameter sweeps or N-body calculations).
We designed and implemented the HTCaaS (High-Throughput Computing as a Service) system that can hide heterogeneity and complexity of leveraging different computing resources from users, and efficiently submit a large number of jobs at once by effectively managing and exploiting of all available computing resources.
Our Design Philosophy is as followings:
- Ease of Use: We minimize user overhead for handling a large amount of jobs & computing resources
- Intelligent Resource Selection: HTCaaS can automatically select more responsive and effective resources and adapt to the current load by dynamically adjusting acquired resources
- Pluggable Interface to Resources: We adopt GANGA's plugin mechanism for accessing heterogeneous computing resources without hardcoding
- Support for Many Client Interfaces: A wide range of client interfaces are supported including a native WS-interface, Java API, and Client tools (CLI, GUI)
System Architecture & Components
HTCaaS system consists of five server-side modules (Account Manager, User Data Manager, Job Manager, Agent Manager, Monitoring Manager) and two client-side tools (Command-Line Interface and Graphic User Interface).
A job in our system is the data and associated profile that describes a computation to be performed. Since users may want to submit a large number of jobs by employing parameter sweeps or N-body calculations, HTCaaS introduces a concept of the Meta-Job which specifies a higher-level job description based on the OGF JSDL standard. Once a Meta-Job is submitted, HTCaaS automatically splits it into many jobs and inserts them into the Job Queue (implemented in ActiveMQ) managed by the Job Manager. All of required input data and produced results are stored at the User Data Manager. Once jobs are submitted into our system, agents (implemented in Java) are dispatched from Agent Manager and process jobs in Supercomputers, Grids, and Clouds. HTCaaS employs agent-based multi-level scheduling & streamlined job dispatching so that a first-level request to a batch scheduler (e.g., Load Leveler in PLSI Supercomputers, gLite for Grids, PBS for Amazon EC2) reserves resources by submitting agents as batch jobs and then each agent proactively pulls the tasks from the Job Manager which implements the lightweight and fast job dispatching mechanisms.
Therefore, users of HTCaaS are able to submit and execute hundreds of thousands of jobs (which can be simply expressed by a single JSDL script) within an automated process, effectively monitor them and process the final results. For those who are not familiar with XML style of scripting, we also provide an easy-to-use GUI tool which can automatically generate JSDL script based on user’s input so that it can be submitted into our system. The overall steps of job submission and execution in HTCaaS system are as followings:
- User logins HTCaaS and uploads input data through User Data Manager.
- User submits a Meta-Job (written in JSDL) which can be composed of multiple tasks.
- HTCaaS automatically divides a Meta-Job into multiple tasks based on the specification and insert them into the Job Queue.
- Agent Manager dispatches agents based on job requirements and resource availability.
Publications & Technical Presentations
- Jik-Soo Kim, Sangwan Kim, Seokkyoo Kim, Seoyoung Kim, Seungwoo Rho, Ok-Hwan Byeon, and Soonwook Hwang, Towards a Next Generation Distributed Middleware System for Many-Task Computing, To appear at International Journal of Software Engineering and Its Applications, 2013.
- Jik-Soo Kim, Sangwan Kim, Seokkyoo Kim, Seoyoung Kim, Seungwoo Rho, Ok-Hwan Byeon, and Soonwook Hwang, From High-Throughput Computing to Many-Task Computing: Challenges, Systems and Applications, To appear at The 2nd International Conference on Software Technology (SoftTech 2013), April 2013.
- Sehoon Lee, Seokkyoo Kim, Seungwoo Rho and Soonwook Hwang, HTCaaS (HTC as a Service): A Large-scale HTC Problem Solving Environment Using Distributed and Heterogeneous Infrastructures, 2012 International Symposium on Grids and Clouds (ISGC), Feb 2012 (PDF)
- Sangwan Kim, Seoyoung Kim, Seungwoo Rho, Seokkyoo Kim, Jik-Soo Kim and Soonwook Hwang, HTCaaS, a Viable Choice for Efficient and Simplified Large-Scale Scientific Computing, Research Poster at YongPyong International Winter Conference on Particle Physics (YongPyong-2013), February 2013.
- Seungwoo Rho, Seoyoung Kim, Sangwan Kim, Seokkyoo Kim, Jik-Soo Kim and Soonwook Hwang, HTCaaS: A Large-Scale High-Throughput Computing by Leveraging Grids, Supercomputers and Cloud, Research Poster at IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), November 2012 (PDF)
- Jik-Soo Kim, Efficient and Simplified Large-Scale High-Throughput Computing over Grids and Supercomputers, at The First KIISE-KOCSEA HPC SIG Joint Workshop on High Performance and Throughput Computing, November 2012
- Seokkyoo Kim, HTCaaS on PLSI, Korea Supercomputing Conference (KSC) 2012, Oct 2012
- Jik-Soo Kim, HTCaaS: A Large-Scale High-Throughput Computing by Leveraging Grids, Supercomputers and Cloud, at 2012 Korean Institute of Information Scientists and Engineers (KIISE) HPC Special Interest Group Workshop, August 2012