John Brevik, Daniel Nurmi, and Rich Wolski (2006)
Predicting Bounds on Queueing Delay for Batch-scheduled Parallel Machines
In: Principles and Practice of Parallel Programming (PPoPP 2006), ACM.
Most space-sharing resources presently operated by high performance computing centers employ some sort of batch queueing system to manage resource allocation to multiple users. In this work, we explore a new method for providing end-users with predictions of the bounds on queuing delay individual jobs will experience when waiting to be scheduled to a machine partition. We evaluate this method using scheduler logs that cover a 9 year period from 7 large HPC centers. Our results show that it is possible to predict delay bounds with specified confidence levels for jobs in different queues, and for jobs requesting different ranges of processor counts.