We propose to develop a comprehensive framework for the joint analysis of audio-visual signals obtained from spatially distributed microphones and cameras. We desire solutions to the audio-visual sensing problem that will scale to an arbitrary number of cameras and microphones and can address challenging environments in which there are multiple speech and nonspeech sound sources and multiple moving people and objects. Recently it has become relatively inexpensive to deploy tens or even hundreds of cameras and microphones in an environment. Many applications could benefit from ability to sense in both modalities. There are two levels at which joint audio-visual analysis can take place. At the signal level, the challenge is to develop representations that capture the rich dependency structure in the joint signal and deal success-fully issues such as variable sampling rates and varying temporal delays between cues. At the spatial level the challenge is to compensate for the distortions introduced by the sensor location and pool information across sensors to recover 3-D information about the spatial environment. For many applications, it is highly desirable if the solution method is self-calibrating, and does not require an extensive manual calibration process every time a new sensor is added or an old sensor is moved or replaced. Removing the burden of manual calibration also makes it possible to exploit ad hoc sensor networks which could arise, for example, from wearable microphones and cameras. We propose to address the following four research topics: 1. Representations and learning methods for signal level fusion. 2. Volumetric techniques for fusing spatially distributed audio-visual data. 3. Self-calibration of distributed microphone-camera systems 4. Applications of audio-visual sensing. For example, this proposal includes considerable work on lip and facial analysis to improve voice communications.