The prevalence of stream applications in signal processing, multi-media, and network processing domains has resulted in a new trend of programming and architecture design. Several languages and multicore architectures have been developed to support streaming applications. In many of these multicore architectures scratchpad memories (SPM) have substituted caches due to their lower power consumption. Performance optimization on SPM based architectures requires the programmer/compiler to efficiently manage the limited local memory. Our paper addresses the problem of compilation of stream programs onto multicore architectures that incorporate SPMs. We propose a retiming technique that maximizes the throughput under a memory constraint with a user-specified number of software pipeline stages. Trade-offs between double buffering and code overlay are explored intensively in our technique to achieve the best performance. The efficiency of our technique was evaluated by compiling several stream applications for the IBM Cell BE and comparing their results against existing approaches.