Alex Berg: Action Recognition and Synthesis

Alex Berg Home : Action Recognition and Synthesis

Action Recognition and Synthesis

The overarching goal of this work is to recognize human actions in video with enough precision to drive a synthesis algorithm for video of human actions. Underlying this goal are difficult problems such as building priors on human actions, smooth interpolation between video of articulated figures, video matting, and figure tracking with occlusion and clutter. The two papers described below cover some of this work. Recognizing Action at a Distance presents a descriptor for motion similarity that is used in a simple nearest neighbor framework for action recognition. In addition the paper presents simple techniques for using this machinery to produce novel video by splicing together sections of video with desired motions. Video Based Motion Synthesis by Splicing and Morphing goes into more detail about how to find good sections of video to splice together, and how to smooth the transitions between splices when figures are at high enough resolution to resolve limbs.

Recognizing Action at a Distance [pdf] [ps]
Alexei A. Efros, Alexander C. Berg, Gregory P. Mori, Jitendra Malik
International Conference on Computer Vision (ICCV) 2003, Nice, pp 726-733.

Example NTSC frame with one player enlarged.

Our goal is to recognize human actions at a distance, at resolutions where a whole person may be, say, 30 pixels tall. We introduce a novel motion descriptor based on optical flow measurements in a spatio-temporal volume for each stabilized human figure, and an associated similarity measure to be used in a nearest-neighbor framework. Making use of noisy optical flow measurements is the key challenge, which is addressed by treating optical flow not as precise pixel displacements, but rather as a spatial pattern of noisy measurements which are carefully smoothed and aggregated to form our spatio-temporal motion descriptor. To classify the action being performed by a human figure in a query sequence, we retrieve nearest neighbor(s) from a database of stored, annotated video sequences. We can also use these retrieved exemplars to transfer 2D/3D skeletons onto the figures in the query sequence, as well as two forms of data-based action synthesis Do as I Do and Do as I Say . Results are demonstrated on ballet, tennis as well as football datasets.

Our motion descriptor is based on optical flow
split into channels and blurred.

Best matches for classification (ballet, tennis, football). The top row of each set shows a sequence of input frames, the bottom row shows the best match for each of the frames. Our method is able to match between frames of people performing the same action yet with substantial difference in appearance. (Click to enlarge.)

Videos:
Best Matching Motions
Football Action Classification
Tennis Action Classification
Do as I do Tennis
Do as I say Tennis 1
Do as I say Tennis 2
Best match to video of a single player, no smoothing.
Greg in the World Cup

Video Based Motion Synthesis by Splicing and Morphing [pdf] [ps]
Gregory P. Mori, Alexander C. Berg, Alexei A. Efros, Ashley Eden, Jitendra Malik
U.C. Berkeley Technical Report UCB/CSD-4-1337

In this paper we present a method for synthesizing videos of human motion by splicing together clips of input video. There are two main contributions in this work. The first is developing a method for kinematically correct morphing of images of human figure, which is used to splice together the clips in input video in a manner that produces smooth output sequences. The second contribution of this work is the application of activity recognition algorithms to our input data in order to automatically extract action labels, which allow us to control the synthesized video by issuing high-level action commands. We present results of synthetic sequences on two domains: ballet and tennis.

Morphing articulated figures. (a,c) Original frames. (b) Synthetic morphed frame halfway (in body parameters) between original frames. (d,f) Skeletons from original frames. (e) Target skeleton for morphed frame (b). (g) Cross-fade between original frames. (click to enlarge.)

Some Videos