Stream processing has emerged as an important model of computation in the context of multimedia and communication sub-systems of embedded System-on-Chip (SoC) architectures. The dataflow nature of streaming applications allows them to be most naturally expressed as a set of kernels iteratively operating on continuous streams of data. The kernels are computationally intensive and exhibit large amounts of data and instruction level parallelism. Streaming applications are mainly characterized by real-time constraints that demand high throughput and data bandwidth with limited global data reuse. Conventional architectures fail to meet these demands due to their poorly matched execution models and the overheads associated with instruction and data movements. We present StreamEngine, an embedded architecture for energy-efficient computation of stream kernels. StreamEngine introduces an instruction locking mechanism that exploits the iterative nature of the kernels and enables fine-grain instruction reuse. We also adopt a Context-aware Dataflow Execution model to exploit instruction-level and data-level parallelism within the stream kernels. Each instruction in StreamEngine is locked to a Reservation Station and maintains a context that is updated upon execution; thus instructions never retire from the RS. The entire kernel is hosted in RS Banks close to functional units for energy-efficient instruction and operand delivery. We evaluate the performance and energy-efficiency of our architecture for stream kernel benchmarks by implementing the architecture with TSMC 45nm process, and comparison with an embedded RISC processor.