State-of-the-art deep convolutional neural networks (CNNs) are widely used in current AI systems, and achieve remarkable success in image/speech recognition and classification. A number of recent efforts have attempted to design custom inference engine based on various approaches, including the systolic architecture, near memory processing, and processing-in-memory (PIM) approach with emerging technologies such as resistive random access memory (RRAM). However, a comprehensive comparison of these various approaches in a unified framework is missing, and the benefits of new designs or emerging technologies are mostly based on qualitative projections. In this paper, we evaluate the energy efficiency and frame rate for a VGG-like CNN inference accelerator on CIFAR-10 dataset across the technological platforms from CMOS to post-CMOS, with hardware resource constraint, i.e. comparable on-chip area. We also investigate the effects of off-chip memory DRAM access and interconnect during data movement, which are the bottlenecks of CMOS platforms. Our quantitative analysis shows that the peripheries (ADCs) dominate in energy consumption and area (rather than memory array) in digital RRAM-based parallel readout PIM architecture. Despite presence of ADCs, this architecture shows >2.5× improvement in energy efficiency (TOPS/W) over systolic arrays or near memory processing, with a comparable frame rate due to reduced DRAM access, high throughput and optimized parallel read out. Further >10× improvements can be achieved by implementing bit-count reduced XNOR network and pipelining.