Face detection is a critical function in many embedded applications, such as computer vision and security. Although face detection has been well studied, detecting a large number of faces with different scales and excessive variations (pose, expression, or illumination) usually involves computationally expensive classification algorithms. These algorithms may divide an image into sub-windows at different scales, evaluate a large set of features for each sub-window, and determine the presence and location of a face. Even with state-of-the-art CPUs, it is still challenging to perform real-time face detection with sufficiently high energy efficiency and accuracy. In this paper, we propose a suite of acceleration techniques to enable such a capability on the CPU-FPGA platform, based on a state-of-the-art face detection algorithm that employs a large number of simple classifiers. We first map the algorithm using the integrated OpenCL environment for FPGA. Matching the structure of the algorithm, a nested architecture is proposed to speed up both memory access and the computing iterations. This multi-layer architecture distributes parallel computing cores with the memory. The physical aspects of the nested architecture, such as the core size and the number of cores, are further optimized to achieve real-time face detection, under realistic hardware constraints.