## 3D Facial Reconstruction (part 1)

I have a week to report about this topic and I find it interesting to follow a survey footsteps before actually having it in hand.The survey I followed is called “State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person” and I don’t have it at hand already but I’m trying to get the Kyoto University Library to provide it for me. The abstract of this paper guides me through several papers so I will study them as soon as possible and extract useful data to get myself familiarize with the domain. As I was searching I stumbled upon the publication page of Professor Thomas Vetter, which is mentioned in these papers several times and I’d like to read his other papers to get insight in this topic. I searched some other papers which I’ll mention them throughout this review.

3D facial reconstruction systems attempt to reconstruct 3D facial models of individuals from their 2D photographic images or video sequences. Currently published face recognition systems, which exhibit well-known deficiencies, are largely based on 2D facial images, although 3D image capture systems can better encapsulate the 3D geometry of the human face. Accordingly, face recognition research is gradually shifting from the legacy 2D domain to the more sophisticated 2D to 3D or 2D/3D hybrid domain.

Currently there exist four methods for 3D facial reconstruction. These are:

**Stochastic Newton Optimization method (SNO)**

- Blanz, V., Vetter, T., 1999. A morphable model for the synthesis of 3D faces. In: Proc. 26th Annu. Conf. on Computer Graphics and Interactive Techniques, SIGGRAPH. pp. 187–194; (dedicated page)
- Blanz, V., Vetter, T., 2003. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Machine Intell. 25(9), 1063–1074;
- Blanz, V., 2001. Automatische Rekonstruction der Dreidimensionalen Form von Gesichtern aus einem Einzelbild. Ph.D. Thesis, Universitat Tubingen, Germany] . (Take a look at his astonishing page)

**Inverse compositional image alignment algorithm (ICIA)**

**Linear shape and texture fitting algorithm (LiST)**

**Shape alignment and interpolation method correction (SAIMC)**

The first three, SNO, ICIA + 3DMM, and LiST can be classified as “analysis-by-synthesis” techniques and SAIMC can be separately classified as a “3D supported 2D model”. In the survey, the authors introduce, discuss and analyze the difference between these two frameworks. They begin by presenting the 3D morphable model (3DMM; Blanz and Vetter, 1999), which forms the foundation of all four of the reconstruction techniques described here. This is followed by a review of the basic “analysis-by-synthesis” framework and a comparison of the three methods that employ this approach. Next they review the “3D supported 2D model” framework and introduce the SAIMC method, comparing it to the other three. The characteristics of all four methods are summarized in a table that should facilitate further research on this topic but actually I don’t have the survey paper yet, I can’t promise!

## A morphable model for the synthesis of 3D faces

**Limitations**of automated techniques for face synthesis, face animation or for general changes in the appearance of an individual face- Problem of finding corresponding
**feature locations**in different faces - Crucial for all morphing techniques, both for the application of motion-capture data to pictures or 3D face models, and for most 3D face reconstruction techniques from images.
- A limited number of labeled feature points marked in one face must be located precisely in another face.
- e.g. The tip of the nose, the eye corner and less prominent points on the cheek
- The number of manually labeled feature points varies from application to application, but usually ranges from 50 to 300.
- Only a correct alignment of all these points allows
- Acceptable intermediate morphs
- A convincing mapping of motion data from the reference to a new model
- The adaptation of a 3D face model to 2D images for ‘video cloning’
- Human knowledge and experience is necessary
- To compensate for the variations between individual faces
- To guarantee a valid location assignment in the different faces
- Automated matching techniques can be utilized for feature points
- Corners of eyes (needs ref)
- Corners of the mouth (needs ref)
- Problem of separating
**realistic faces**from faces that could never appear in the real world - Human knowledge is even more critical
- Many applications involve the design of completely new natural looking faces that can occur in the real world but which

have no “real” counterpart. Others require the manipulation of an existing face according to changes in age, body weight or simply to emphasize the characteristics of the face - Such tasks usually require time-consuming manual work combined with the skills of an artist.
**Model Features****Category**: A parametric face modeling technique**Generality**: Assists in both problems (i.e. Feature Location, Realistic Faces)**Procedure**- Arbitrary human faces can be created simultaneously controlling the likelihood of the generated faces.
- The system is able to compute correspondence between new faces.

**Dataset**: Large dataset of 3D face scans, Geometric and Textural data, using Cyberware ™**Idea**: Exploiting the statistics of a large dataset→ A morphable face model → Recover domain knowledge about face variations by applying pattern classification methods → An algorithm that adjusts the model parameters automatically for an optimal reconstruction of the target, requiring only a minimum of manual initialization.**Mathematical Representation**: A multidimensional 3D morphing function that is based on the linear combination of a large number of 3D face scans.**Problem 2 – Avoid unlikely faces**: Computing the average face and the main modes of variation in our dataset, a probability distribution is imposed on the morphing function**Extras**: Parametric descriptions of face attributes such as gender, distinctiveness, “hooked” noses or the weight of a person → evaluating the distribution of exemplar faces for each attribute within the face space.**Problem 1 – Correspondence Problem**: Parametric face model that is able to generate almost any face → The correspondence problem turns into a mathematical optimization problem → New faces, images or 3D face scans, can be registered by minimizing the difference between the new face and its reconstruction by the face model function → The output of the matching procedure is a high quality 3D face model that is in full correspondence with our morphable face model → Consequently all face manipulations parameterized in model function can be mapped to the target face.**Brags:**- The prior knowledge about the shape and texture of faces in general that is captured in our model function is sufficient to make reasonable estimates of the full 3D shape and texture of a face even when only a single picture is available.
- When applying the method to several images of a person, the reconstructions reach almost the quality of laser scans.
**Database****Count****of Subjects**: Laser scans ( Cyberware ™) of 200 heads of young adults → 100 male and 100 female**Representation**: Head structure data in a cylindrical representation → r(h,deg) + R(h,deg)+ G(h,deg)+ B(h,deg)- Surface points sampled at 512 equally-spaced angles , and at 512 equally spaced vertical steps
- The RGB-color values → RGB → Recorded in the same spatial resolution → stored in a Texture map with 8 bit per channel.
**Appearance of Subjects**: All faces were without makeup, accessories, and facial hair. The subjects were scanned wearing bathing caps, that were removed

digitally.**Pre-processing**: Vertical cut behind the ears, a horizontal cut to remove the shoulders, and a normalization routine that brought each face to a standard orientation and position in space**Samples**: The resultant faces were represented by approximately 70,000 vertices and the same number of color values**Morphable 3D Face Model****Geometry**: X,Y and Z of n Vertices → (X1,Y1,Z1,X2,…,Yn,Zn)**Texture**: R,G and B color values of that n Vertices → (R1,G1,B1,R2,…,Gn,Bn)- Simplification → number of valid texture values in the texture map is equal to the number of vertices
**Model**→ Set of faces (S(a),T(b)) parametrized by the coefficients a and b- Shape Vector: geometry vector weighted by a
- Shape Vector: texture vector weighted by b
- Parameters: a and b controls shape and texture
- Estimated the probability distribution for the coefficients a(i) and b(i) from our example set of faces → distribution enables

us to control the likelihood of the coefficients a(i) and b(i) → Regulates the likelihood of the appearance of the generated faces

**Modeling**:- Multivariate normal distribution to our data set of 200 faces
- Mean: averages of shape S and texture T
- Covariance: covariance matrices CS and CT computed over the shape and texture differences
- Principle Component Analysis
- Basis transformation to an orthogonal coordinate system
- Descending order of eigenvalues → sigma=eigenvalue of CT
- Decease m dimension to m-1 dimension → Shape degrees of freedom
- Segmentation: The expressiveness of the model can be increased by dividing faces into independent subregions that are morphed independently
- e.g. eyes, nose, mouth and a surrounding region
- Since all faces are assumed to be in correspondence, it is sufficient to define these regions on a reference face.
- Equivalent to subdividing the vector space of faces into independent subspaces
- A complete 3D face is generated by computing linear combinations for each segment separately and blending them at the borders according to an algorithm proposed for images
- Data reduction applied to shape and texture data will reduce redundancy of our representation, saving additional computation time

**Facial Attributes**- Shape and texture coefficients alpha(i) and beta(i) in morphable face model do not correspond to the facial attributes used in human language.
- Some facial attributes can easily be related to biophysical measurements → the width of the mouth
- Others can hardly be described by numbers→ facial femininity or being more or less bony
- Method for mapping facial attributes → Defined by a hand-labeled set of example faces
- At each position in face space (that is for any possible face) → shape and texture vectors that when added to or subtracted from a face,

will manipulate a specific attribute while keeping all other attributes as constant as possible. - In a performance based technique: facial expressions = recording two scans of the same individual with different expressions → adding the differences to a different individual in a neutral expression.
- Unlike facial expressions, attributes that are invariant for each individual are more difficult to isolate. The following method allows us to model facial attributes such as gender, fullness of faces, darkness of eyebrows, double chins, and hooked versus concave noses
- Based on a set of faces (Si; Ti) with manually assigned labels μi describing the markedness of the attribute, we compute following weighted sums → Multiples of (ΔS;ΔT) can now be added to or subtracted from any individual face. (proved in paper)
- For binary attributes, such as gender, we assign constant values μA for all mA faces in class A, and μB<>μA for all mB faces in B. Affecting only the scaling of ΔS and ΔT, the choice of μA, μB is arbitrary.
- A different kind of facial attribute is its “distinctiveness”, which is commonly manipulated in caricatures
- Individual faces are caricatured by increasing their distance from the average face
- In this representation, shape and texture coefficients alpha(i) and beta(i) are simply multiplied by a constant factor
**Matching Model to Image**- In an
**Analysis-by-Synthesis**loop: the algorithm creates a texture mapped 3D face from the current model parameters → renders an image → updates the parameters according to the residual difference - It starts with the average head and with rendering parameters roughly estimated by the user.
**Model Parameters**- Coefficients of the 3D model are optimized along with a set of rendering parameters such that they produce an image as close as possible to the input image.
- Facial shape and texture: defined by coefficients alpha and beta
- Rendering parameters (rho): Contain camera position (azimuth and elevation), object scale, image plane rotation and translation, intensity i of ambient light (R,G and B), and intensity of directed light(R,G, and B). In order to handle photographs

taken under a wide variety of conditions, rho also includes color contrast as well as offset and gain in the red, green, and blue channel. - Other parameters, such as camera distance, light direction, and surface shininess, remain fixed to the values estimated by the user.
**Procedure**- Rendering color image (I_model) using perspective projection and Phong illumination model
- The reconstructed image is supposed to be closest to the input image in terms of Euclidean distance (E)
- Matching a 3D surface to a given image
- Ill-posed problem → Along with the desired solution, many non-face-like surfaces lead to the same image.
- Essential to impose constraints on the set of solutions.
- In morphable model, shape and texture vectors are restricted to the vector space spanned by the database.
- Within the vector space of faces, solutions can be further restricted by a tradeoff between matching quality and prior probabilities, using P(alpha), P(beta) and an ad-hoc estimate of P(rho).
- In terms of Bayes decision theory, the problem is to find the set of parameters (alpha; beta; rho) with maximum posterior probability, given an image I_input. While alpha, beta, and rendering parameters rho completely determine the predicted image I_model, the observed image Iinput may vary due to noise.
- For Gaussian noise with a standard deviation sigma_N, the likelihood to observe I_input is calculated.
- Maximum posterior probability is then achieved by minimizing the cost function E
- Speed up the matching algorithm by implementing a simplified Newton-method for minimizing the cost function
- Instead of the time consuming computation of derivatives for each iteration step, a global mapping of the matching error into parameter space can be used
- Optimization algorithm → uses an estimate of E based on a random selection of surface points.
- Predicted color values I_model are easiest to evaluate in the centers of triangles.
- In the center of triangle k, texture and 3D location are averages of the values at the corners.
- Perspective projection maps these points to 2D image locations. (p)
- Surface normals n(k) of each triangle k are determined by the 3D locations of the corners.
- According to Phong illumination, the color components of I_model (R, G, and B) is calculated regarding direction of illumination (l), the normalized difference of camera position and the position of the triangle’s center(v(k)), and the direction of the reflected ray (r(k)), surface shininess (s), and angular distribution of the specular reflection parametrized by gamma.
- Phong illumination reduces to simpler form if a shadow is cast on the center of the triangle
- For high resolution 3D meshes, variations in I_model across each triangle k are small, so EI may be approximated ( ak is the image area covered by triangle k. If the triangle is occluded, ak = 0)

- Gradient descent
- Contributions from different triangles of the mesh would be redundant
- In each iteration, we therefore select a random subset K of 40 triangles k and replace EI → The probability of selecting k is approximately ak.
- This method of stochastic gradient descent
- More efficient computationally
- Helps to avoid local minima by adding noise to the gradient estimate.
- Before the first iteration, and once every 1000 steps, The algorithm
- Ccomputes full 3D shape of the current model
- Ccomputes 2D positions (px; py) of all vertices
- Determines ak
- Detects hidden surfaces and cast shadows in a two-pass z-buffer technique.
- Assumption: occlusions and cast shadows are constant during each subset of iterations.
- Parameters are updated depending on analytical derivatives of the cost function E
- Derivatives of texture and shape yield derivatives of 2D locations(p(k)), surface normals(n(k)), triangle’s center (v(k)) and reflection ray(r(k)), and Phong illumination I_model(k) using chain rule.
- Coarse-to-Fine strategy → To avoid local minima
- The first set of iterations is performed on a down-sampled version of the input image with a low resolution morphable model.
- Start by optimizing only the first coefficients alpha(j) and beta(j) controlling the first principal components, along with all parameters rho(j) → In subsequent iterations, more and more principal components are added.
- Starting with a relatively large sigma(N), which puts a strong weight on prior probability in Newton approximation statement and ties the optimum towards the prior expectation value, we later reduce sigma(N) to obtain maximum matching quality.
- In the last iterations, the face model is broken down into segments. With parameters rho(j) fixed, coefficients alpha(j) and beta(j) are optimized independently for each segment. This increased number of degrees of freedom significantly improves facial details.
**Multiple Images**: several images of a person are available- While shape and texture are still described by a common set of alpha(j) and beta(j), there is now a separate set of rho(j) for each input image.
- EI is replaced by a sum of image distances for each pair of input and model images, and all parameters are optimized simultaneously.
**Illumination-Corrected Texture Extraction**:- Specific features of individual faces that are not captured by the morphable model
- such as blemishes, are extracted from the image in a subsequent texture adaptation process.
- Extracting texture from images is a technique widely used in constructing 3D models from images
- To change pose and illumination, it is important to separate pure albedo at any given point from the influence of shading and cast shadows in the image
- In this approach, this can be achieved because our matching procedure provides an estimate of 3D shape, pose, and illumination conditions.
- Subsequent to matching, we compare the prediction for each vertex and compute the change in texture that accounts for the difference.
- In areas occluded in the image, we rely on the prediction made by the model.
- Data from multiple images can be blended using methods similar to this paper.

Please refer to this page for the implementation code and other material regarding this model: http://faces.cs.unibas.ch/bfm/main.php?nav=1-2&id=downloads

*To be continued…*

## Leave a Reply