2's complement number system imposes a fundamental limitation on the power and performance of arithmetic circuits, due to the fundamental need of cross-datapath carry propagation. Residue Number System (RNS) breaks free of these bonds by decomposing a number into parts and performing arithmetic operations in parallel, significantly reducing the breadth of carry propagation. Consequently, RNS arithmetic has been proposed as a solution to improve the power-efficiency of arithmetic hardware. However, limitations of the expressiveness of RNS in terms of arithmetic operations together with overheads related to interaction with 2's complement arithmetic make programmable processor design that takes advantage of these benefits challenging. In this paper we meet this challenge by multi-tier synergistic co-design of architecture, micro-architecture, hardware components, as well as compilation techniques. Our experiments not only demonstrate simultaneous improvement of up to 30% in performance and 57% reduction in functional unit power consumption, but also that most of these benefits can be exploited with automatically generated code.