Computation intensive applications usually consist of multiple nested or flattened loops. These loops are the main building blocks of the applications and embody a specific type of execution pattern. In order to reduce the running time of the loops, developers need to analyze the loops in the code and try to parallelize them on hardware accelerators, such as GPUs, TPUs, and FPGAs, which are increasingly available in the cloud. Unfortunately, the lack of understanding of loop characteristics and the ability of hardware accelerators in handling these types of loops prevents developers from choosing the right platform to develop their applications in the cloud. Also, developing and optimizing code for a specific accelerator is a time-consuming effort. To address these issues, this paper studies the effectiveness of different processors in accelerating common patterns of loops. It identifies five important types of loops that commonly exist in real-world applications, and presents Loopy, the implementations of these loops optimized for different architectures. Using Loopy, the paper also evaluates different hardware in accelerating the loop patterns. The result reveals the architectural differences among different accelerators with regard to different loop patterns. It also provides insights for the developers to choose the right accelerators for their applications. The current version of Loopy supports both FPGAs and GPUs, which are the most versatile and available accelerators.