The integration of rich sensory technologies into critical applications, such as gesture recognition and building energy optimization, has highlighted the importance of intelligent time-series analytics. To accommodate this demand, uni-variate approaches have been extended for multi-variate scenarios, but naive extensions have led to a deterioration in model performances due to their limited ability to capture the information recorded in different variates and complex multi-variate time series patterns' evolution over time. Furthermore, real-world time series are often contaminated with noisy information. In this paper, we note that a time series often carries robust localized temporal events that could help improve model performance by highlighting the relevant information; however, the lack of sufficient data to train for these events makes it impossible for neural architectures to identify and make use of these temporal events. We, therefore, argue that a companion process helping identify salient events in the input time series and driving the model's attention to the associated salient sub-sequences can help with learning a high-performing network. Relying on this observation, we propose a novel Saliency-Driven Mutual Cross Attention (SDMA) framework that extracts localized temporal events and generates a saliency series to complement the input time series. We further propose an architecture that accounts for the mutual cross-talk between the input and saliency series branches where input and saliency series attend each other. Experiments show that the proposed mutually-cross attention framework can offer significant boosts in model performance when compared against non-attentioned, conventionally attentioned, and conventionally cross-attentioned models.