Voice activity detection (VAD) is among the first preprocessing steps in most speech processing applications. While there are several very low-power analog solutions, the more recent deep neural network (DNN) based solutions have superior VAD performance in even complex noisy backgrounds at the expense of increase in computations. In this paper, we propose a computationally-efficient network architecture, ResCap+, for high performance VAD. ResCap+ operates on small-sized sequences and is built with residual blocks in a convolutional neural network to encode the characteristics of the input spectrum, and a capsule network with LSTM cells to capture the temporal relationship between these sequences. We evaluate the model using the AMI meeting corpus and show that it outperforms a state-of-the-art DNN-based model on accuracy with ≈55× less computation cost. We also present initial hardware performance results on a low-power programmable architecture, Transmuter, and show that it can process every 40ms input audio sequence with a delay of 15.17ms resulting in real-time performance.