Research has shown that caregivers implementing pivotal response treatment (PRT) with their child with autism spectrum disorder (ASD) helps the child develop social and communication skills. Evaluation of caregiver fidelity to PRT in training programs and research studies relies on the evaluation of video probes depicting the caregiver interacting with his or her child. These video probes are reviewed by behavior analysts and are dependent on manual processing to extract data metrics. Using multimodal data processing techniques and machine learning could alleviate the human cost of evaluating the video probes by automating data analysis tasks.Creating an 'Opportunity to Respond' is one of the categories used to evaluate caregiver fidelity to PRT implementation. A caregiver is determined to have successfully demonstrated cre-ating an opportunity to respond when they have delivered an appropriate instruction while she or he has the child's attention. Automatically determining when the caregiver has correctly provided an opportunity to respond requires classifying the audio and video data from the probes. Combining the modalities into a single classification task can be undertaken using feature fusion or decision fusion methods. Two decision fusion configurations, and a feature fusion model were evaluated. The decision fusion models achieved higher accuracy, however the feature fusion model had a higher average F1 score, indicating more reliable prediction capability.