Entropy Driven Hierarchical Search for 3D Human Pose Estimation | Zendy

Ben Daubney | Zendy; Xianghua Xie | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Entropy Driven Hierarchical Search for 3D Human Pose Estimation

Author(s) -

Ben Daubney,

Xianghua Xie

Publication year - 2011

Language(s) - English

Resource type - Conference proceedings

DOI - 10.5244/c.25.31

Subject(s) - kullback–leibler divergence , computer science , entropy (arrow of time) , graphical model , conditional entropy , artificial intelligence , divergence (linguistics) , sampling (signal processing) , set (abstract data type) , pattern recognition (psychology) , data mining , algorithm , principle of maximum entropy , computer vision , linguistics , philosophy , physics , filter (signal processing) , quantum mechanics , programming language

3D Human pose estimation from a single monocular image is an extremely difficult problem. Currently there are two main approaches to solving this problem, the first is to learn a direct mapping from image features to 3D pose [1], the second is to first extract 2D pose as an intermediate stage and then ‘lift’ this to a 3D pose [2]. The limitation with both of these approaches is that they are only applicable to poses that are similar to those represented in the original training set, e.g. walking. It is unlikely they will scale to extract arbitrary 3D poses. Contrary to this, in the domain of 2D pose estimation current state-of-the-art methods have been shown capable of detecting poses that are much more varied [3]. This has been achieved using generative models built around the Pictorial Structures representation that decomposes pose estimation into a search across individual parts [4]. In this paper we present a generative method to extract 3D pose from single images using a part based representation. The method is stochastic, though in contrast to methods used for 3D tracking (e.g. the particle filter), where the search space in each frame is tightly constrained by previous observations, in single image pose estimation the search space is much larger. To permit a search over this space a generative prior model is learnt from motion capture data. Stochastic samples are used to approximate this prior and to facilitate its update. In effect, the initial prior is iteratively deformed to the posterior distribution. The body is represented by a set of ten parts, each part has a fixed length and connected parts are forced to join at fixed locations. The conditional distribution between two connected parts is modeled by first learning a joint distribution using a GMM p(xi,x j∣θi j), where xi and x j is the state of the ith and jth part respectively and θi j is the set of model parameters. As each model is represented using a GMM the model parameters are defined as θi j = {λ k i j,μ i j,Σi j}k=1, where K is the number of components in the model and λ k i j,μ k i j,Σ k i j represent the kth component’s weight, mean and covariance respectively. For efficiency all covariances used to represent limb conditionals are diagonal and can be partitioned such that Σi j = diag(Λ k ii,Λ k j j) and likewise μ k i j = ( μk i ,μ k j ) . Given a value for x j (e.g. a sample) the conditional distribution p(xi∣x j,θ k i j) is first calculated from the joint distribution p(xi,x j∣θi j), following which a sample xi can be drawn from it. The conditional distribution, p(xi∣x j,θ k i j), is also a GMM with parameters {λ k i ,μk i ,Λii}k=1. The component weights are proportional to the marginal distribution λ k i ∝ p(x j∣θ k i j), which is calculated from the normal distribution p(x j∣θ k i j) = λ k i jN (x j; μj ,Λj j). Note this conditional model is different to typical approximations used, when the conditional model is approximated by p(xi j∣θi j), where xi j is the value of xi in the local frame of reference of x j [3]. A benefit of learning a full conditional model between neighboring parts is that different GMM components learnt in quaternion space correspond to different spatial locations in R3. This is illustrated in Fig. 1 where it can clearly be seen that this representation can clearly capture multiple modes.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research