Emotion analysis and recognition has become an interesting topic of research among the computer vision research community. In this paper, we first present the emoF-BVP database of multimodal (face, body gesture, voice and physiological signals) recordings of actors enacting various expressions of emotions. The database consists of audio and video sequences of actors displaying three different intensities of expressions of 23 different emotions along with facial feature tracking, skeletal tracking and the corresponding physiological data. Next, we describe four deep belief network (DBN) models and show that these models generate robust multimodal features for emotion classification in an unsupervised manner. Our experimental results show that the DBN models perform better than the state of the art methods for emotion recognition. Finally, we propose convolutional deep belief network (CDBN) models that learn salient multimodal features of expressions of emotions. Our CDBN models give better recognition accuracies when recognizing low intensity or subtle expressions of emotions when compared to state of the art methods.