DeepFake Detection Challenge (DFDC)

DFDC as shown on Facebook AI’s landing page.

While at Facebook, I was the tech lead for building the DeepFake Detection Challenge Kaggle competition. This competition awarded $1M in prizes and was designed to advance the state of the art of automatic Deepfake detection models. 2,265 teams participated over the course of three months, and were given access to a dataset of over 100,000 training clips. At the start of the competition, this was the largest DeepFake dataset constructed, and one of the largest public video datasets overall.

Deep Submodular Functions

In general, learning submodular functions is provably hard. Previous approaches to learning with submodularity used weighted mixtures of submodular functions, but this does not take into account possible interactions between the constituent functions themselves. We expand upon previous approaches for learning with submodularity by repeatedly composing submodular functions into a deep feedforward structure.

In our 2016 NIPS paper, we propose and study a new class of submodular functions called deep submodular functions (DSFs). We define DSFs and situate them within the broader context of classes of submodular functions in relationship both to various matroid ranks and sums of concave composed with modular functions (SCMs). Notably, we find that DSFs constitute a strictly broader class than SCMs, thus motivating their use, but that they do not comprise all submodular functions. Interestingly, some DSFs can be seen as special cases of certain deep neural networks (DNNs), hence the name. Finally, we provide a method to learn DSFs in a max-margin framework, and offer preliminary results applying this both to synthetic and real-world data instances.

Efficient Structured Prediction

Structured learning involves predicting values that belong to an exponential output space, rather than a single label. For instance, the label corresponding to the pose of an object in 3-dimensional space is a 6-dimensional vector, one for each degree of freedom. Inference techniques for producing a structured output can be very slow, as we must search over the space of possible labelings. Even with intelligent reductions of some models, inference often cannot be run in real time.

I am investigating the tradeoffs between complexity and accuracy in structured prediction systems. In other words, can we sacrifice an acceptable amount of accuracy for the speedup necessary to run these systems in real time?

Eye Gaze Estimation with the Microsoft Kinect

Eye gaze estimation is a field with a long history of applications ranging from psychological experiments to accessibility. Most tracking systems either require a user to wear an intrusive head unit or consist of a large (and expensive) unit that sits on a table. We hope to bring eye gaze estimation to the general public through the use of the Microsoft Kinect, a relatively inexpensive and widely-available consumer device.

This research is in still in progress, but we have gathered data from over 150 subjects and have started building the gaze estimator. Video of its capabilities will be posted shortly.

Future applications include interfacing with a larger HCI system and providing accessibility to patients who have difficulty interacting with a computer through traditional means. The Kinect's non-intrusive size and low cost will make gaze estimation a simple and affordable venture.

Instrumentation Identification using the Million Song Dataset

 

 
overview.png
 

Most music recommendation systems use metadata to make song suggestions. A major limiting factor of services like Pandora is that it cannot recommend songs for which the metadata is not available. This research project focuses on content-based analysis of audio to identify the musical ensemble that produced a piece of music. Using this system, you could, for instance, create a playlist consisting of songs produced by a traditional four piece rock band, and create another playlist of songs created by a jazz quartert.

This project uses a Universal Background Model (traditionally used for speaker verification) trained on the Million Song Dataset to make predictions of what ensmble produced a particular song.  You can download my thesis on the topic here.

 

Interfacing with Sound

Percussion Simulation 

 

 
hit_detect_zoomed.png
 

Most mobile percussion applications don't provide a realistic simulation of striking an actual instrument. Usually apps use the accelerometer to measure a drum stroke, but typical peak prediction on the accelerometer signal will produce a sound too late, which mars the perception of striking a virtual drum. I carried out user studies to measure the acceleration profiles of various drum strikes, and then developed a system that causally predicts when a sound should be produced to provide the best user experience. The paper can be downloaded here.

Science of Jazz

As part of the 2012 Philadelphia Science Festival, the METlab hosted a jazz concert featuring Marc Cary and Wil Calhoun. Accompanying their performance were a number of visualizations of the live audio, including a live 3D spectrogram and a concentric "Chroma Wheel." In addition, concert goers could download a mobile application that allowed them to select their visualization. The mobile app also let attendees perform in their own crowd-sourced visualization, as the app recorded the sound level on their device's microphone. The sound level of all devices was then projected onto the main screen to show a "sound map" of the concert venue during the performance. 

Excerpt from a performance for the 2012 Philadelphia Science Festival, demonstrating the Drexel University Music Entertainment Technology (MET) lat's PhilaSciFest App, which displays real-time visualizations of audio features extracted from a live performance. Full video coming soon!