Apache Spark is quickly being adopted at Facebook and now powers an important portion of Facebook’s batch ETL workload. While Spark is typically more efficient than Hive, we continue to search for opportunities to further reduce hardware costs. Recently, we have started an effort to apply custom resource oversubscription for every unique Spark job. We often observe that system resources (e.g. CPU / memory) tend to be underutilized in Spark jobs.
For example, when the data can not fit in memory, disk spill may dominate overall performance. Since the default configuration allocates one CPU core per Spark task, this can lead to situations where Spark jobs are not using all the CPU resources allocated to them and as a result, the overall cluster CPU utilization can remain low even under peak scheduling conditions. Similarly, for memory, depending on the input data and shuffle sizes, the job might not fully utilize its reserved memory and a significant amount of cluster memory can be wasted. In this session, we will describe the CPU and memory oversubscription technique we use to increase overall resource utilization of our Spark clusters. We leverage a historical stats based resource oversubscription algorithm that considers the historical resource usage of each unique Spark job and predicts the ideal resource allocation to minimize unused resources and increase the cluster utilization.
We’ll also explain the changes necessary to Spark and the resource manager in order to support oversubscription. Finally, we will conclude by sharing our results and the future direction for this project.