There is good reason to believe that humans use some kind of recursive grammatical structure when we recognize and perform complex manipulation activities. We have built a system to automatically build a tree structure from observations of an actor performing such activities. The activity trees that result form a framework for search and understanding, tying action to language. We explore and evaluate the system by performing experiments over a novel complex activity dataset taken using synchronized Kinect and SR4000 Time of Flight cameras. Processing of the combined 3D and 2D image data provides the necessary terminals and events to build the tree from the bottom-up. Experimental results highlight the contribution of the action grammar in: 1) providing a robust structure for complex activity recognition over real data and 2) disambiguating interleaved activities from within the same sequence.