Many real-time vision applications require accurate estimation of optical flow. This problem is quite challenging due to extremely high computation and memory bandwidth requirements. This paper presents a parallel block-based optical flow algorithm along with an optimized multicore hardware architecture. The algorithm is based on neighbor-guided semiglobal matching (NG-fSGM), a dynamic programming algorithm that aggressively prunes search space using flow vector information of the neighboring pixels. In the block based NGfSGM, the image is divided into overlapping blocks and the blocks are processed in parallel for high throughput. While large overlap between blocks improves the accuracy, it results in larger memory and higher computational complexity. To minimize the amount of overlap among blocks with minimal effect on the accuracy, we use temporal prediction to guide flow vectors along the block boundaries. A pseudo-random flow candidate selection technique is also introduced to reduce memory access bandwidth and computation requirements. The proposed algorithm is mapped onto a multicore architecture where each core has a high degree of internal parallelism and implements a prefetching technique to improve throughput and reduce memory latency. The proposed hardware-efficient algorithm and the corresponding architecture achieve significant gains in throughput, latency, and power efficiency with only 1.25% accuracy degradation compared to the original NG-fSGM when evaluated on the Middlebury dataset.