Deep learning inference is increasingly run at the edge. As the programming and system stack support becomes mature, it enables acceleration opportunities in a mobile system, where the system performance envelope is scaled up with a plethora of programmable co-processors. Thus, intelligent services designed for mobile users can choose between running inference on the CPU or any of the co-processors in the mobile system, and exploiting connected systems such as the cloud or a nearby, locally connected mobile system. By doing so, these services can scale out the performance and increase the energy efficiency of edge mobile systems. This gives rise to a new challenge-deciding when inference should run where. Such execution scaling decision becomes more complicated with the stochastic nature of mobile-cloud execution environment, where signal strength variation in the wireless networks and resource interference can affect real-time inference performance and system energy efficiency. To enable energy efficient deep learning inference at the edge, this paper proposes AutoScale, an adaptive and lightweight execution scaling engine built on the customdesigned reinforcement learning algorithm. It continuously learns and selects the most energy efficient inference execution target by considering characteristics of neural networks and available systems in the collaborative cloud-edge execution environment while adapting to stochastic runtime variance. Real system implementation and evaluation, considering realistic execution scenarios, demonstrate an average of 9.8x and 1.6x energy efficiency improvement over the baseline mobile CPU and cloud offloading, respectively, while meeting the real-time performance and accuracy requirements.