Long Short-Term Memory (LSTM) Recurrent Neural network (RNN) is known for its capability in modeling sequence learning tasks such as language modeling. However, due to the large number of model parameters and compute-intensive operations, existing FPGA implementations of LSTMs are not sufficiently energy-efficient as they require large area and exhibit high power consumption. This work describes a substantially different hardware implementation of an LSTM which includes several architectural innovations to achieve high throughput and energy-efficiency. The architectural innovations include (1) an improved design of an approximate multiplier (AM) and its integration with the compute-intensive units of the LSTM; (2) the design of control mechanisms to handle the variable-cycle (data dependent) multiply operations; (3) incorporation of hierarchical pipelining at multiple levels of the design to maximize the overlap of the variable cycle computations. In addition, this work applies a post-training, range-based, linear quantization to the parameters of the model to further improve the performance and energy-efficiency. A python framework is also developed that allows for analysis and fine tuning of the input parameters before mapping the design to hardware. This paper extensively explores the design trade-offs and demonstrates the advantages for one common application - language modeling. Implementation of the design on a Xilinx Zynq XC7Z030 FPGA shows maximum improvement as compared to three recent published works in throughput to be 27.86X, 7.69X and 11.06X and in energy-efficiency to be 45.26X, 14.76X and 16.97X, respectively.