The Beautiful Future

Face2Face

Small Octopus — Fri, 3 Mar 2023 15:35:05 +0900

multi-linear PCA model 사용 (아래 모델들에 기반을 둠..)
[3] BASEL
[1] The Digital Emily Project: photoreal facial modeling and animation.
[9] Facewarehouse

(1) 수식은 geometric shape
(2) 수식은 skin reflectance
a_id 은 3xn 크기, a_alb 은 3xn 크기, E_id는 3nx80, E_exp는 3nx76, E_alb 는 3nx80 크기 이다.
메쉬는 53,000 버텍스와 106,000 페이스로 이뤄져있다.
rigid transformation $ \Phi $, full perspective transformation $ \Pi $, illumination $ \gamma $.
P = { alpha, beta, delta, R, t, k }
Illumination is approximated by the first tree bands of Spherical Harmonics(SH) basis function.
Labertian surface and smooth distant illumination, neglecting self-shadowing.
[23] A signal-processing framework for inverse rendering 2001.

photo-consistancy Ecol, facial feature alignment Elan, statistical regularizer Ereg.
w_col = 1, w_lan = 10, w_reg = 2.5e-5

photo-consistancy, least-squares 대신에 outlier에 강인한 l2,1-norm [12]을 사용하였다.
[12] R1-pca: rotational invariant l1-norm principal component analysis for robust subspace factorization.
color distance: l2, enforce sparsity summation over all pixels: l1

SOTA facial landmark tracking algorithm
[24] Deformable model fitting by regularized landmark mean-shift.

statistically close to the mean.
degenerations of facial geometry and reflectance. guides the optimization strategy out of local minima.

Data-parallel Optimization Strategy
data-parallel GPU based Iteratively Reweighted Least Square (IRLS) solver.
IRLS의 키아이디어는 문제를 변화시키는 것이다. 매 이터레이션 마다
논리니어스퀘어 문제는 두개의 컴퍼넌놈 놈으로 나눠진다.

r 은 일반적인 레지듀얼(위에서 정의한 에너지펑션?), Pold는 이전 스텝에서 계산된 결과.
Gauss-Newton [29] Real-time expression transfer for facial reenactment TOG 2015.
매 IRLS 이터에서 GN 스텝을 적용, 아래 수식을 풀었다.
$ \textbf{J}^T\textbf{J} \delta^* = -\textbf{J}^T\textbf{F} $
PCG에 기반하여 최적의 선형 파라미터 delta*를 구한다.
자코비안 J 와 $ -\textbf{J}^T\textbf{F} $는 [29]처럼 미리 계산되고 저장된다.
[29] Real-time expression transfer for facial reenactment.
[33], [29] 에서 제안된것처럼 old descent direction d 와 PCG 솔버 안의 J^TJ을 연속된 행렬곱으로 계산한다.

Supplemental Material
preconditioned conjugate gradient (PCG) method
parallel prefix scan을 이용해서 신더싸이즈된 이미지에서 보이는 픽셀을 모은다.
레지듀얼 벡터 F의 자코비안 계산과 에너지의 그래디언트 J^TF 는 지피유로 페러럴하게 계산된다.
병렬처리는 모든 부분 미분과 그래디언트 시작점이 독립적으로 계산되기때문에 가능하다.
자코비안의 값들은 계산되어 글로벌 메모리에 저장된다.
모든 지역 픽셀단위그래디언트를 합칠때, 투 스테이지 리덕션이 사용된다.
PCG 메쏘드를 이용한 델타 엑스 파라미터 업데이트를 위해
자코비안과 그래디언트를 이용해서 $ \textbf{J}^T\textbf{J} \delta^* = -\textbf{J}^T\textbf{F} $ 문제를 푼다.

Denoising Diffusion Probabilistic Models(DDPM)

Small Octopus — Tue, 28 Feb 2023 23:55:25 +0900

https://www.youtube.com/watch?v=_JQSMhqXw-4

고려대학교 산업경영공학과 김정섭

text to image generation
EBMs Flow-based models GANs VAEs
DALL-E ( VAE 기반 활용, OpenAI January 2021 )
GLIDE ( diffusion, OpenAI December 2021 )
DALL-E 2 ( diffusion, OpenAI April 2022 )
Imagen ( diffusion, Google Brain May 2022 )

Diffusion 이란
물리 통계 동역학 Thermodynamics
Deep Unsupervised Learning using Nonequilibrium Thermodynamics ICML 2015. (시초)
Diffusion process

Markov Chain
Markov 성질: t+1 상태의 확률은 오직 t 의 상태에 의존한다.

Normalize Flow
MLP 기반 확률적 생성 모형, laten varialbe 기반 확률적 생성모형, z 획득에 변수 변환 공식을 활용
$ p_x(x) = p_z(z) \vert \frac{dz}{dx} \vert $

Overview of generative models
GAN: Adversarial training
VAE: maximize variational lower bound
Flow-based models: Invertible transform of distributions
Diffusion models: Gradually add Gaussian noise and then reverse
반복적인 변화를 활용한다는 점에서 Flow-based models과 유사
분포에 대한 변분적 추론을 통한 학습을 진행한다는 점은 VAE와 유사
최근에 Diffusion 모델의 학습에 Adversarial training을 활용하기도 함 Diffusion-GAN 2022.

Laten variable model
simple distribution(tractable gaussian) →complex distribution(visual/audio pattern)
결국 생성 모델로부터 원하는 것은 간단한 분포 z를 특정한 패턴을 갖는 분포로 변환(mapping, transformation, sampling)하는 것.

VAE

taobao Text/Speech-Driven Full-Body Animation

Small Octopus — Mon, 20 Feb 2023 11:58:39 +0900

divided into two major parts including lip movements in the lower face and diverse expressions in the upper face.
multi-pathway framework to generate movements of two facial parts respectively.

use the facial motion capture device to collect 3 hours human talking data with diverse expressions.
record both video data as well as 3D face parameter sequences under the definition in ARKit with 52 blendshsape.

Lip movement generation
cross-modal transformer encoder to utilize both speech and textual information.
for the modeling of txtual information, we extract phoneme alignment annotation according to
the speech and textual scripts by time alignment analyzer such as Montreal Forced Aligner toolkit.
Ph = {pht}, t=1,...,T.
concatenates MFCCs and MFB features denoted as Au = {aut), t=1,...,T.
transformer encoder takes a sequence of concatenated phoneme embedding and audio features as input
with a window size of 25 fps, whose duration is 1 second.
the transformer encoder can effectively model the temporal context information
with a multi-head self-attention mechanism across different modalities.

Objective for Lip consists of two terms, including a shape term and motion term.

bt is 3D facial parameters.
articulation correction by given phoneme label. the mouth should be closed during the pronunciation fo b/p/m.

Expression generation
the facial expression in the upper face mainly lies in the movements of eye and eyeborws.
which are related to speech rhythm and intention of the speaker with longer-time dependencies.
- rhythmic expression movement by learning-based framework.
an audio encoder to model the current speech signals as well as
a motion encoder to model the history expressions.
a transformer decoder is adopted to predict the final expr movements according to audio and history motion info.
since synthesizing expression is a one-to-many mapping, use SSIM loss to explore the structural similarity between the predicted expression and ground truth.

mu adn sigma are the mean and standard deviation of generated 3D facial parameter sequence.
cov is the covariance.
- intention-driven facial expression based on the semantic tags
semantic tags are extracted from textual scripts via sentiment analysis.
semantic tags include happiness, sadness, emphasis, fear and etc.
Actors are asked to performers more than 50 intention-based expression according to the semantic tags.
fusing the generated rhythmic expression with the proper intention-driven expression triggered by the semantic tags.
integrate the lip movements to form the final expressive and diverse facial animation.

Body Animation
a graph based on an existing motion database.
motion segments according to the features of the given text/speech.
Motion graph construction
semantic motions: 24 kinds of actions( such as numbers, orientations, and special semantics)
non-semantic motions: declarative actions( upper body movements of standing, body center shifting movements, foot stepping movements)
node denotes a motion segment, edge denotes the cost of transition between two nodes.
obtain graph nodes
dividing each long sequence in database to obtain many small motion segments
dividing points: local minima of the motion strength,
build graph edges
connection relationship between motion segments
transition cost based on the distances between salient joint positions and movement speeds.
a graph edge can be created if the transition cost between adjacent nodes is below a threshold sigma.
semantic motions in the motion graph need to be obtained manually.
graph-based retrival and optimization
rules: special semantic text and phonetic rhythm
given a section of text/speech P, analyze the input, divide it into many phrases (Pi, i=1, ..., n).
(Pi, i=1, ..., n) according to text structure and find the special semantic text in the section.
meaningful motion segments, the semantic text and similarity of rhythm between motion segment and speech phrase.
motion and phonetic rhythm are obtained by motion strength
motion str: Dancing to music. Advances in Neural Information Processing Systems, 32, 2019.
phonetic rhythm: librosa (Audio and music signal analysis in python. In Proceedings of the 14th python in science con-
ference, volume 8, pages 18–25. Citeseer, 2015.)

to assign a motion node to each text/speech phrase so that the cost is minimized

Ct( i, i + 1) is the transition cost between adjacent nodes.
Cp(i) accounts for the loss of Cs and Cr.
Cs(i) special semantic text.
Cr(i) phonetic rhythm.

확률 기초

Small Octopus — Mon, 30 Jan 2023 10:29:12 +0900

참고자료: 패턴인식 (오일석)

terms
probability, random variable, probability density function, conditional probability, joint probability
marginal probability, prior probability, likelihood, posterior probability, bayes rule, confidence.

probability이란 어떤 사건들의 집합을 정의하기에 따라서 다라진다.
각 사건들은 probability를 가지고 있다. probability는 0보다 크고 모든 경우의 합은 1 이다.
1. 동전 던지기 -> 앞면, 뒷면 -> 0.5, 0.5
2. 주사위 던지기 -> 1, 2, 3, 4, 5, 6 -> 1/6, ..., 1/6
3. 날씨 -> 맑음, 비, 눈
4. 로또 -> 1~45 중 6개 중복 없이, 순서 없음

random variable이란 사건들을 하나의 변수로 표시하는 것이다.
동전 던지기: $ X \in \{앞면, 뒷면\} $
$ P(X=앞면) = 0.5, P(X=뒷면) = 0.5 $

random variable이 2가지인 경우
아래 그림에는 주머니 한개와 바구니 두개가 있다.
주머니에서 바구니 선택 사건(event) 이후 선택된 바구니에서 공 선택 사건(event)이 일어나는 실험이다.
즉 이 시스템에는 순서가 있는것이다.
아래와 같이 X, Y 두개의 random variable이 있다.
$ X \in \{A,B\}, Y \in \{하양, 파랑\} $

prior probability
일련의 두 사건 중 주머니에서 바구니를 선택하는 사건은 두번째 사건이 일어나기 전에 일어나는
사건이기 때문에 사전 확률이다.
주머니에서 A가 뽑힐 확률 $ P(X=A) = P(A) = 7 \div 10$

conditional probability
바구니 A에서 하얀 공이 뽑힐 확률, 주머니에서 이미 A가 선택되었다는 조건 하의(given) 확률, 조건부 확률.
실제 확률 값에 영향을 미치지 않고 수식적으로 조건을 표시해줌.
$ P(하양|A) = 2 \div 10 $

joint probability
주머니에서 A가 뽑히고 바구니 A에서 하양이 뽑힐 확률.
두 사건이 연속적으로 일어날 확률을 구해야한다. 곱셈이다.
$ P(X=A, Y=하양) = P(A, 하양) = P(A)P(하양|A) = 7 \div 10 \times 2 \div 10 = 7\div50 $
joint probability는 $ P(X, Y) = P(Y, X) $ 가 성립한다.(product rule)
$ P(X=A, Y=하양) = P(A)P(하양|A) = P(Y=하양, X=A) = P(하양)P(A|하양)$

marginal probability
최종적으로 하얀공이 뽑힐 확률.
최종적으로 하얀공이 뽑히는 경우는 아래 두 joint probability를 생각할수 있다.
1. 주머에서 A가 뽑히고 바구니 A에서 하얀공이 뽑히는 경우
2. 주머에서 B가 뽑히고 바구니 B에서 하얀공이 뽑히는 경우
두사건이 연속적이지는 않고 최종적으로 하얀공이 뽑히는 경우의 수가 증가했다고 볼수있다.
그래서 두 joint probability를 더하면된다.
$ P(Y=하양) = P(A, 하양) + P(B, 하양) $
$ P(B, 하양) = P(B)P(하양|B) = 3 \div 10 \times 9 \div 15 = 9\div50 $
$ P(Y=하양) = 7\div50 + 9\div50 = 8\div25$
최종적으로 검은공이 뽑힐 확률.
$ P(Y=검정) = P(A, 검정) + P(B, 검정) = 1 - P(Y=하양) = 17\div25 $
$ P(A)P(검정|A) + P(B)P(검정|B) = 7 \div 10 \times 8 \div 10 + 3 \div 10 \times 6 \div 15 = 17\div25 $

independent
random variable이 서로 영향을 미치지 못하는 경우.
$ P(X,Y) = P(X)P(Y) $ 를 만족해야한다.
위 예제의 경우, 결론부터 이야기하자면 주머니($X$) 가 공색깔($Y$)에 영향을 주기때문에 독립이 아니다.
$ P(X,Y) $ 는 joint probability 이다.
$ P(X) $ 는 바구니 선택 probability 이고 $ P(Y) $ 는 marginal probability으로 계산되기 때문이다.
$ P(X=A, Y=하양) = 7\div50 $
$ P(X=A)P(Y=하양) = 7 \div 10 \times 8 \div 25 = 28 \div 125$
$ P(X=A, Y=하양) \neq P(X=A)P(Y=하양) $, 비독립이다.

likelihood
likelihood는 공색의 확률을 구하는 문제가 아니라 바구니의 확률을 구하는 문제의 관점이다.
즉, 하얀 공이 나왔는데 어느 바구니에서 나왔는가?
주머니를 고려하지 않고 각 바구니에서 하얀공이 나올 확률이 높은 바구니를 생각할수있다.
likelihood 우도는 conditional probability와 같다. 그러나 합이 1이 아니기 때문에 우도 함수라한다.
$ P(하양|A) = 2 \div 10, P(하양|B) = 9 \div 15, P(하양|A) +P(하양|B) = 12 \div 15 $

posterior probability
"하얀 공이 나왔는데 어느 바구니에서 나왔는가? "
주머니만 고려한경우는 $ P(A) = 7 \div 10 > P(B) = 3 \div 10 $ 이기때문에 A 주머니라고 생각할 수 있다.
그러나 주머니와 바구니 둘다 고려해야한다.
문제의 관점을 바꿔서 생각해보면 하얀 공이라는 조건이 주어졌고 바구니의 확률을 구하는 문제이다.
사후에 활률을 구하기때문에 posterior probability라고 한다.
$ P(A|하양), P(B|하양), P(X|Y)$
둘 중 더 큰 값을 갖는 바구니를 선택하면 된다. 그럼 위 두 값을 어떻게 계산할것이냐?
joint probability의 product rule에서 bayes rule이 유도되어 풀수 있다.
$ P(X, Y) = P(Y, X) $
$ P(X)P(Y|X) = P(Y)P(X|Y) $
$ P(X)P(Y|X) \div P(Y) = P(X|Y=하양) $
여기서 $ P(Y) $는 위에서 구했던 marginal probability로 구할 수 있다. (하얀 공이 나올 활률)
그리고 $ P(X) $는 prior probability 이다. (주머니에서 바구니 선택 확률)
마지막으로 $ P(Y|X) $ 는 likelihood 이다. (각 주머니에서 공색의 확률)
$ P(X|Y) = \frac{ likelihood \times prior \ probability}{marginal \ probability} $
$ P(A|하양) = \frac{ P(하양|A)P(A) }{P(하양)} = \frac{( 2 \div 10 )( 7 \div 10 )}{ 8 \div 25} = 0.4375 $
$ P(B|하양) = \frac{ P(하양|B)P(B) }{P(하양)} = \frac{( 9 \div 15 )( 3 \div 10 )}{ 8 \div 25} = 0.5625 $
0.5625의 신뢰도(confidence)로 주머니 B에서 하얀공이 나왔다고 할 수 있다.

Facial Expression Retargeting from Human toAvatar Made Easy

Small Octopus — Thu, 26 Jan 2023 12:45:30 +0900

IEEE Computer Transactions Graphics on Visualization and 2020

DT: [7] Deformation transfer for triangle meshes, TOG 2004.
BS: [2] Facial retargeting with automatic range of motion alignment, TOG 2017.

Facial expression retargeting is a cross-domain problem.
blendshapes are not orthogonal: [6] Practice and theory of blendshape facial models, EuroG 2014.

two drawbacks of blendshape-based representation.
1. difficulty in representing expressions outside of the linear span.(exaggerated or unseen expressions)
2. creating the blendshapes for avatars still requires a tedious and time-consuming modeling process.
[17] Facewarehouse: A 3d facial expression database for visual computing, VCG 2013.
[32] Direct manipulation blendshapes, CGA 2010.

data-driven shape analysis methods
[43] Variational autoencoders for deforming 3d mesh models, CVPR 2018.
[44] Automatic unpaired shape deformation transfer, TOG 2018.

사용된 캐릭터
Mery, Chubby, Conan : Face rigs: meryprojet.com, Tri Nguyen, www.highend3d.com

Network 구조

Attention Mesh: High-fidelity Face Mesh Prediction in Real-time

Small Octopus — Wed, 28 Dec 2022 11:11:06 +0900

Grishchenko, I., Ablavatski, A., Kartynnik, Y., Raveendran, K., Grundmann, M.: Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. In: CVPR Workshops (2020)

Google Research

50 FPS on a Pixel2 phone.

puppeteering 인형극

기여한점: 30% 속도향상을 가져오면서 multi-stage cascade approach와 같은 성능

3D 얼굴모델을 사용하지 않고 다이레트로 좌표를 예측.

[5] 연구에 기반을 두고 있음. 얼굴 검출 후 리그레션하는 두 단계 구조.
[5] Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs, 2019.
하지만 얼굴 전체를 리그레션하면 인지적으로 중요한 얼굴 파트에서 품질저하가 있다.
그래서 얼굴파트단위 크롬해서 각 특화된 네트워크로 추론.
하지만 이렇게 여러개의 모델을 각 이미지 인풋으로 학습 및 사용하는 것은 비효율적이다. feature level 공유 시간절약
그리고 이렇게 여러모델을 사용하면 모델간 데이터 전달을 위해 CPU, GPU간 동기화가 필요하게되고 이것도 코스트다.

region-specific heads는 spatial transformers[4] 사용하여 feature maps을 트랜스폼한다.
[4] Spatial transformer networks. NIPS 2015.
이 방식으로 cascaded approach와 비교하여 하나의 모델로 같은 성능의 속도향상을 보인다.
이 구조를 attention mesh로 명명한다.
추가적으로 분리된 모델들과 비교하여 내부적으로 일관성있게 연결되어 있어 학습하기 편하다는 장점이다.

[7]과 비슷한 모델구조를 사용했다.
A deep regression architecture with two-stage re-initialization for high
performance facial landmark detection. CVPR 2017.
[7]은 spatial transformers를 사용 다양한 얼굴 검출기에서 제공된 초기값에 신뢰성있게 네트워 구축.
이 논문과 [7]의 목적이 다르지만 하나의 모델로 salient face regions이 성능향상을 보인다.

검출 또는 트랙킹된 256x256 얼굴 이미지 입력 , 64x64x32 얼굴 feature map 출력
feature map 은 왼눈, 오눈, 입, 얼굴 전체 총 네개의 브렌치로 나눠져 인퍼런스됨.
얼굴 전체는 64x64x32 feature 전체를 받아서 인퍼런스 나머지는 24x24x32 크기로 크롬된 feature map
으로 부터 인퍼런스한다.
얼굴 전체 모듈은 478개의 3D landmark를 예측하고 각 서브모듈의 ROI을 정의한다.
나머지 눈/입 모듈은 어텐션 메커리즘으로 구해진 24x24xx32 feature map에서 예측한다.
눈 모듈은 feature map이 6x6 까지 줄어든 이후에 동공을 별도로 예측한다. 이것은 눈의 feature를
재사용하게해주며 고정적인 눈 랜드마크에서 보다 동공 랜드마크를 동적이게 해준다.

각 서브모듈은 네크웍의 용량이 각 얼굴 부위에 헌신하게 해주며 품질을 높인다.
보다 높은 품질향상을 위해, 눈과 입을 수평으로 정렬하고 일정 크기로 정규화했다.
attention mesh network을 두가지 절차로 학습했다.
1. 각 서브모듈을 GT landmark 좌표 기준으로 독립적으로 학습했다.
2. 모델 자체에서 구해진 landmark 좌표 기준으로 서브모듈들을 재학습했다.

Attention mechanism
[2] Draw: A recurrent neural network for image generation, 2015.
[4] Spatial transformer networks. NIPS 2015.
위 어텐션 기법들은 feature space에서 grid of 2D points들을 뽑아낸다.
그리고 미분가능하게 feature를 뽑아낸다. 2D gaussian kernel 또는 affine transformation interpolation을
통해서. 이 방법은 네트워크를 E2E로 학습가능하게 하며 attention mechanism에 의해 사용된 feature가
풍부해지게 한다. 우리는 [4]의 transformer module을 사용하였다.
affine transform으로 정의되서 sampled grid of points를 zoom, rotate, translate, skew할 수 있다.
이 affine transform는 supervised로 구축 수도 있고 face mesh submodel의 출력으로부터 계산되어질수도 있다.

3만장의 이미지에 2D 랜드마크를 직접 어노테이션했고, synthetic model을 이용해서 z값을 근사화했다.

평가를 위해, 얼굴 파트 단위로 학습된 모델과 비교하였다. 기본 메쉬, 눈, 입 순차적으로 동작된다(cascade).

Performance (속도 비교)

attention mesh model이 cascade of seperate face and region model보다 25% 빠르다.
[6] On-device neural net inference with mobile gpus. 2019.
을 사용하여 TFLite GPU 속도를 측정하였다.
각 region model: 8.82+4.18+4.7 = 17.7 ms 에 CPU-GPU sync에 소비되는시간 4.7 ms을 추가해야한다.

Mesh quality
3D interocular distance로 오차 정규화함.
attention mesh model이 cascade of model을 눈 영역은 능가했고 입 영역은 상응하는 수준임을 보인다.

AR Makeup
잘못된 landmark위치는 uncanny vally 에 쉽게 빠지게한다.
base mesh와 attention mesh with submoduel을 비교.
80명 대상으로 A/B test를 10개의 이미지에 했고 46%의 AR samples이 실제 립스틱 바른 이미지로
분류되었고 38%dml 실제 이미지가 AR로 분류되었다.

Puppeteering(퍼피티어링, 인형극)
인형극 또는 trigger로 사용될 수도 있다.
[3] Dual laplacian morphing for triangular meshes. Computer Animation and Virtual Worlds 2007.
Laplacian mesh editing to morph a canonical mesh into the predicted mesh.

결론
리얼타입 통합 정확한 face mesh predection.
differentiable attention mechanism.
독립적 지역 특화 모델 돌리는 대신 salient face region 마다 연산가능하게 했다.

StackOverflow

Q.
How to convert Mediapipe Face Mesh to Blendshape weight (https://stackoverflow.com/questions/68169684/how-to-convert-mediapipe-face-mesh-to-blendshape-weight)

A1.
(두가지 방향성이 있다.)
Blendshape generation can be divided into two methods:

(landmark로 부터 바로 구하는 방법)
Direct math from mesh landmarks:

kalidokit, https://github.com/yeemachine/kalidokit
Mefamo, https://github.com/JimWest/MeFaMo

(network을 사용하는 방법)
AI model:

mocap4face, https://github.com/facemoji/mocap4face
AvatarWebKit, https://github.com/Hallway-Inc/AvatarWebKit

(데이터 페어를 만들어라.)
With the rapid development of supervised learning, collecting face and 52-bs paired datasets seems the best way to solve this problem.

==== update 2022.11.21 =====

NVIDIA has released maxine-ar-sdk to compute face blendshapes. The predicted blendshpaes are slightly different from Arkit 52. I have successfully compiled it and run it well on windows with RTX-20 or RTX-30 cards.

If anyone really needs one mediapipe-based solution, just comments. I can contribute to label CC face datasets for fine-tuning your own models with NMAXINE-AR-SDK.

A2.
My approach: First sample many pairs of random blendshapes -> face mesh (detecting face mesh on 3D model), and then learning an inverse model from that. (A simple neuronet would do)

Therefore you end up with a model that can give blendshapes given a face mesh.

The catch, which is also mentioned in the above blurb, is that you wanna handle different face mesh inputs. In the above blurb it seems that they sample the 3D model but transform the sampled mesh into the canonical face mesh, and hence end up with a canonical inverse model. At inference you transform a given mesh into the canonical face mesh as well.

Another solution might be to directly transform your different people's face meshes into the 3D model's mesh.

COMA

Small Octopus — Fri, 23 Sep 2022 15:58:51 +0900

Generating 3D faces using Convolutional Mesh Autoencoders ECCV 2018.

Abstract

기존 방법은 선형 서브 공간 또는 고차원 텐서 일반화를 사용했다. 이 선형성 때문에 극한 변형과 비선형 표정을 캡쳐할 수 없었다.
그리서 얼굴 비선형 표현할수 있는 모델을 제안하며 spectral convolutions을 mesh surface에 적용함으로써 가능하다.
계층적 mesh 표현이 가능한 mesh sampling operation을 사용해서 shape과 expression의 비선형 변형을 멀티스케일로 캡쳐한다.
variational setting으로 우리의 모델은 multivariate Gaussian distribution으로 다양한 리얼 3D faces을 뽑아낼수 있다.
학습 데이터셋은 12명의 20,466 mesh가 사용되었고 제한적인 데이터양에 비해서 75%의 적은양의 파라미터를 사용하면서 50% 적은 리컨스트럭션 에러를 보인다.

3. Mesh Operators

F = ( V, A), V ∈ R^nx3, A ∈ {0,1}^nxn
Aij =1 에지 커텍션 연결됨, Aij=0 에지 커넥션 연결안됨.
non-nromalized Laplacian: L = D - A, Dii = sum_j Aij
[15] Spectral graph theory. No. 92, American Mathematical Soc. (1997) <-- graph fourier transform
diagonalized Laplacian: L = U∧U^T,
fourier basis: U ∈ R^nxn, U=[u_1, u_2, ... , u_n-1]
eigen vector of L : u_i
∧ = diag([λ_1, ..., λ_n-1] ) ∈ R^nxn
mesh vertices : x ∈ R^nx3
graph fourier transform: x_w = U^T x
inverse graph fourier transform: x = U^T x

3.1 Fast spectral convolutions
컨볼루션은 퓨리에 스페이스에서 하다마다프로덕트로 정의된다. x ∗ y = U ((U T x) (U T y))
U 매트릭스가 스파스하지 않기때문에 연산량이 많다. recursive Chebyshev polynomial [17, 23]
을 이용해서 메쉬 필터링 커널 g 쎄타를 정의 할수 있다.

스케일드 라플라시안 L̃ = 2L/λmax − In ,
θ ∈ RK is a vector of Chebyshev coefficients.
Tk ∈ Rn×n is the Chebyshev polynomial of order k that can be computed recursively as Tk (x) = 2xTk−1 (x) − Tk−2(x)
T0 = 1 and T1 = x. The spectral convolution can then be defined as in [17]

yj는 y ∈ Rn×Fout 의 j 번째 특징을 계산한다.
입력 x ∈ Rn×Fin 은 Fin 개의 특징을 가지고 있다.
face mesh 는 Fin = 3 개의 버텍스와 대응하는 포지션을 가지고 있다.
Each convolutional layer has Fin × Fout vectors of Chebyshev
coefficients, θi,j ∈ RK , as trainable parameters

3.2 Mesh Sampling
지역적 전역적 문맥을 캡쳐하기위해 hierarchical multi-scale representation를 사용한다.
지역적 문맥은 얕은 layer에서 캡쳐하고 전역적 문맥은 깊은 layer에서 캡쳐한다.
mesh를 nx3 tensor로 생각할 수 있다. 하지만 conv를 적용하면 디멘션이 달라진다.
mesh sampling operation을 적용하면 이웃 vertex 컨테스트를 유지한다.
quadric matrices [20] Surface simplification using quadric error metrics. Computer graphics and interactive
techniques 1997.

m개의 vertex를 가지는 mesh를 down-sampling 한다고 하면
down-sample transform metrices Qd ∈ {0,1}^nxm, up-sample transform matrices Qu ∈ R^mxn, m > n.
다운 샘플링은 정점 쌍을 반복적으로 축소하여 얻습니다. quadric matrices [20]를 이용하여 표면 오차 근사를 유지하도록 축소.
아래그림 (a) 에서 빨간점이 축소된다. 남은 파란 점들이 원본메쉬의 서브셋이다. Vd ⊂ V.
q: 원본 vertex, m개
p: down sample된 vertex, n개
Qd (p, q) ∈ {0, 1} 은 down-sampling되는 동안 q vertex를 살릴지 버릴지를 나타낸다.
무손실 다운샘플링 업샘플링은 일반곡면에 구현불가능하기때문에 다운샘플링하면서 업샘플링 매트릭스를 구축한다.
Vd에 convolution이 적용된다(b -> c). (c)에 남은 vertex들은 업샘플링하는동안 유지된다(c->d).
다운샘플링되었던 빨강 vertex들은 다운샘플된 메쉬 면에 barycentric coordinates를 이용해서 맵핑된다.
(b)에서 버려진 빨강 v는 가장 가까운 tri (i, j, k)로 프로젝션되어 부터 barycentric 가중치 wivi + wjvj + wkvk로 표현된다.
이 가중치는 Qu에 업데이트되어서 Qu (q, i) = wi , Qu (q, j) = wj , and Qu(q, k) = wk , and Qu (q, l) = 0 otherwise.
Vu = Qu Vd.

Chebyshev convolutional filters with K = 6 Chebyshev polynomials.
[21] biased ReLU[21], Deep sparse rectifier neural networks, Artificial Intelligence and Statistics (2011)

3D Shape Regression for Real-time Facial Animation TOG2013

Small Octopus — Sat, 27 Aug 2022 21:26:06 +0900

얼굴 검출기 없이 주어진 카메라에 얼굴이 크기 범위로 나온다는 가정으로 학습되고 사용되어질수 있게 알고리즘이 설계되었다.

User-specific face model
-- 15 rigid motion
yaw: -90, -60, -30, 0, 30, 60, 90
pitch: -30, -15, 15, 30
roll: -30, -15, 15, 30
-- 45 non-rigid motion
yaw: -30, 0, 30
mouth strech, smile, brow raise, disgust, anger,
squeeze left/right eye, jaw left/right, grin,
chin raise, lip pucker, lip funnel, cheek blowing, eye closed.

User-specific Blendshape Generation
FaceWarehouse contain 150 individuals with 46 FACS blendshapes.
11K mesh vertices x 50 identity knobs x 47 expression knobs.

카메라 내부 파라미터는 알고있다고 가정, User-specific face image가 주워졌을때
3D 모델의 vertex 사영과 2D 이미지 랜드마크 사이의 거리를 coordinate-descent method로 최적화.

1. for each input image, find Mi, Widi, Wexpi.
2. refine Wid, which should be same for all images. (fixing Mi and Wexp,i)
모든 이미지의 한 사람을 위한 Wid를 찾기위한 최적화식

위 두 과정이 수렴할때 까지 반복된다(3번 정도면 수렴). Yang et al. 2011의 알고리즘을 써서 버텍스 인덱스를 알맞게 업데이트 해준다. Wid가 구해지면 Expression Blendshape을 구축할 수있다. FaceWarehouse에 있는 표정모드 중 47개를 사용. di는 i만 1인 원핫 벡터이다.

Training Set Construction
3D shape regressor학습을 위한 3D landmark 학습셋이 필요하다.
최적화 과정을 통해서 학습셋을 만든다. 이제 blendshape alpha 값(expression coefficent)을 찾아내는 문제로 변형되었다.

레귤러 텀으로 사전에 정의된 표정들에 대해서 어느정도 정답이 있다고 볼수 있다.
Li el al 2010에서 사용되었던 a*의 값과 유사해야한다.

위 두가지 텀을 합쳐서 아래와 같은 수식을 풀어내면 된다.

이 식을 coordinate-descent method 방법을 사용하여 두 파라미터을 번갈아 고정하여 반복 최적화했다.
a의 초기값을 a*로 하였다.
M을 계산 할 때는 POSIT algorithm을 사용하였다.
a를 계산 할 때는 BFGS solver기반의 gradient projection algorithm을 사용하였다. (0~1사이로 제한)
Wreg의 값을 10으로 고정하여 사용하였다.
매 최적화 반복마다 버텍스의 인덱스를 업데이트 하였다.
카메라 좌표계의 3D mesh을 아래식으로 계산할 수 있다. 그리고 3D landmark를 뽑아낸다 {S_i^o}.

Data Augmentation
3D shape을 카메라 코디네이트에서 x,y,z로 translation했다. 이미지당 m-1개의 부가적인 shaped얻을 수 있게,
원본 포함 이미지당 m개씩(Sij), 1<= j <= m. S_i^o = Si0 이된다.
이미지를 직접 변화 해서 학습하기 보다 M변환 매트릭스에 저장을 해서 원본으로 복원 될수 있게했다.
즉 3D shape 이동변환과 이에 대응하는 M을 같이 저장했고 대응하는 이미지는 그대로이다.

Temperal Inital Shape Dataset
실시간으로 동작할때 우리는 이전 프레임의 값으로 부터 초기 3D shape을 시작할 수 있기때문에 학습셋에도
이전 프레임에서 계산된것 같은 효과의 3D shape으로 쌍을 지어줬다(S_ij^c).
60개의 원본 3D shape중에서 G개의 가장 유사한 shape을 선택하고 (Sig, 1 <= g <= G)
Data Augmentation 스텝에서 구해진 것중에서 랜덤하게 H개를 선택했다 (Sigjh, 1 <= h <= H).
이 과정은 총 GH개의 초기 3D shape을 만들어준다. Sij를 위해서. 각 학습 샘플은 아래 수식과 같이 나타내어진다.

실제로 사용된 n = 60, m = 9, G = 5, H = 4 이다.

Camera Calibration
일반적인 캘리브레이션 대신 사용자 설정 이미지로부터 캘리브레이션을 할 수 있는 방법.
가장 심플한 핀홀 모델을 가정, fx=fy=f, cx=c_imgx, cy=c_imgy, shear=0 그럼 f만 구하면된다.
f 값을 조절하면서 User-specific Blendshape Generation 생성 방법으로 fitting 해보면서 적은 값이 나오는 f를 사용할 수 있다.
이 논문에서 만족할만한 결과를 보여줬다.

Face Tracking
3D Shape regression 결과로 부터 변환 M과 expression coefficient를 뽑는 방법.

이 과정에서는 버텍스를 업데이트 할 필요가 없는데, 3D Shape regression 결과가 이미지 위에 보이는 좌표가 아니라
실제 3D의 좌표라고 생각하면 되기때문이다. 그래서 실제로 내부파라미터(Q)도 안곱해지고 있고 vk, k는 고정 인덱스.
animation prior GMM for temporal coherence in tracking, Weise et al 2011과 같은 방법.

Wprior는 1로 사용됨.
1. 이전 프레임에서 계산된 a 값을 초기값으로 사용하여 regression된 S와 Blendshape S 사이의 M을 계산.
이문제는 3D registration문제이고 SVD on cross-covariance matrix [Besl and McKay 1992] 의 방법으로 품.
2. 이젠 M을 고정하고 expression coefficient를 위해 iterative gradient solver로 최적화를 품.
Eprior에 대한 사전 gradient를 풀어놨음. gradient projection algorithm 기반 BFGS solver로 품 0~1 사이로 제한하면서.

위 두 스텝을 수렴할때까지 반복, 2번이면 충분했다.

Evaluation and Comparison
수작업으로 2D위치가 어노테이션된 키넥트에서 구해진 3D값과 비교하였다. 키넥트 뎁스와 프로젝션 매트리스을 사용.
오차는 1센티 이하였다.
2D regression 결과에 있어서 Face alignment by explicit shape regression과 optical floaw 기반 Face transfer with multilinear models 두 방법과 정성적으로 비교.

Facial Retargeting with Automatic Range of Motion Alignment

Small Octopus — Sat, 20 Aug 2022 00:09:21 +0900

TOG 2017

INTRODUCTION
facial animation retargeting address the general problem of animation transfer between vitual charactors, with the transfer of performance capture to virtual characters being the main application.
Recent developments in vision- and depth-sensor-based facial motion capture
---Cao et al. 2014;
Displaced Dynamic Expression Regressionfor Real-time Facial Tracking and Animation TOG 2014.
---Ichim et al. 2015;
Dynamic 3D Avatar Creation from Hand-held Video Input TOG 2015.
---Li et al. 2013;
Realtime Facial Animation with On-the-fly Correctives, TOG 2013.
---Thies et al. 2016;
Face2Face: Real-Time Face Capture and Reenactment of RGB Video, CVPR 2016.
Weise et al. 2011;
Realtime Performance-based Facial Animation TOG 2011.
made accurate captures of an actor, traditionally limited to big film or game studios, affordable to a much broader audience.
current real-time capture systems typically adapt a realistic generic blendshape model to the actor.
since the modified and the original character have semantically equivalent blendshapes, the captured actor performance is then transferred between the characters by directly mapping the blendshape weights.
The special case of equvalent blendshapes between two characters is often named parallel parameterization in retargeting context.
In practice, it is uncommon to encounter facial rigs with a complete set of semantically equivalent blendshapes.
creating facial rigs for animation is time consuming and requires highly skilled artists.
therefore, a rig is carefully designed to fit the animation needs, only modeling the necessary expressions.
in addition, expressive digital characters are often stylized and exaggerate the facial proportions of humans.
An effective retargeting method must either transfer animation from facial motion capture markers to a blendshape rig or between faces with different blendshape sets.
several retargeting approaches generate their own parallel parameterization, by transferring the blendshapes of the character face rig to align with the actor's proportions.
However, especially for stylized characters the step often fails, due to differences in range of motion or the shortcomings of current methods.
The subsequent blendshape estimation becomes erroneous, which has been addressed so far by incorporating additional priors.
의미론적으로 캐릭터의 페이스 리그와 대응되는 연기자의 얼굴 모션 학습 시퀀스로부터 특정 연기자의 블렌드쉐입을 생성하는 방법을 제안한다.
언수퍼바이스드 한 방법으로 우리는 학습 시퀀스가 충분히 연기자의 모션 범위를 표현할 수 있다는 것을 보인다.
페이스 리그와 연기자가 매우 얼굴의 비율이 달라도 parallel parameterization이 가능함을 보인다.
주요 관찰은 얼굴의 모션은 다른 스타일 레벨이더라도 얼굴모션은 FACS에 따라 유사하다는 것이다.
FACS는 얼굴 표정을 얼굴 근육을 기저로 설명한다.
그리고 이 시스템은 블렌드쉐입 스타일라이즈 또는 리어리스틱 캐릭터의 생성과정에서 일반적으로 참고된다.
새로운 매니폴드 얼라인먼트 접근법에 기반하여 그리고 새로운 에너지 유사도 측정법에 기반하여 우리는 성공적으로 모션의 범위를 연기자와 캐릭터 리그 사이에서 정렬했다. 이것은 결론적으로 정확한 리타게팅으로 연결된다.
우리의 두번째 기여한점은 prior energy based on physically inspired deformations. 이것은 리얼타임 환경에서도 효율적으로 계산되어질 수 있다.
우리의 기하학적 사전 지식은 정확한 병렬 매개변수화의 경우에도 남아 있는 몇 가지 아티팩트를 해결합니다.
현재 SOTA offlne 방법과 대등하거나 낫다
--- Seol et al. 2012
Spacetime Expression Cloning for Blendshapes. TOG 2012.

RELATED WORK
As a key element of human-centerd applications, research on virtual faces and face animation has been an active field of research for decades, resulting in a wide range of publications on this topic. For a general overview we recommend the
--- Parke and Wanters 2008, BOOK.
Computer Facial Animation. AK Peters Ltd.
--- Orvalho et al. 2012, surveys focusing on rigging
A Facial Rigging Survey. In Eurographics State of the Art Reports.
--- Lewis et al. 2014. , Blendshape animation
Practice and Theory of Blendshape Facial Models, In Eurographics State of the Art Reports.

Cross-Mapping
의미론적으로 대응되는 캡쳐된 얼굴 표정과 타겟 리그를 직접적으로 학습한다.
그리고 새로운 포즈의를 예제기반으로 합성한다.
--- Buck el al, 2000, piece-wise linear mapping
--- Wang et al. 2004, locally linear embedding
--- Deng et al. 2006. RBFs.
--- Song et al. 2011. kCCA.
--- Kholgade et al. 2011. simplicial basis.
--- Bouaziz and Pauly 2014. Gaussian Process Laten Variable Models.
크로스 맵핑의 장점은 심지어 다른 눈의 개수를 가져도 어느 캐릭터든지 적용가능하다는 것이다.
하지만 이 방법의 단점은 주어진 학습 예제의 품질과 개수에 따라 성능이 달라진다는것이다.
종종 15-20 개의 대응 예제가 충분한 결과를 위해 요구된다.
40개의 블렌드 쉐입에 600-800개의 파라미터가 반드시 수동 정의 되어야한다.
학습 예제가 일관된경우, 결과 표정은 복잡한 보간 학습 데이터의 보간으로 남는다?...
이는 종종 학습 예제와 매우 다른 부정확한 결과를 보여준다.

Parallel Parameterization
semantically equivalent facial rigs를 만들어서 간단하게 애니메이션을 다른 캐릭터로 전달할수 있다.
이 일은 노동 집약적 업무이다. 탁월한 모델링 스킬과 얼굴의 의학적 지식을 알아야한다. 시간이 많이 든다.

이 과정을 자동으로 하기위해 몇몇 접근방법들은 generic face model 에서 neutral face target 으로 전달하는 방법을 제안했다.
--- Noh and Neumann 2001, dense correspondences, trasfer per-vertex displacements for each expression.
--- Sumner and Popović 2004, deformation gradients.
--- Orvalho et al. 2008; Seol et al. 2012, 2011, Radial Basis Function.
--- Li et al. 2010. ranging from incorporating examples.
--- Saito 2013. contact constraints.
--- Xu et al. 2014, interactive editing
--- Bouaziz et al. 2013. Ichim et al. 2015. Seol et al. 2016. iterative refinement schemes for real humans.
만약 소스와 타겟의 형상이 많이 다르면 실패한다. 과한 표정이 전달되거나 약한 표정이 전달된다.
--- Seol et al. [2012] 는 propotional mismatch 문제를 속도를 이용해서 해결하려고했다.
우리는 연기자의 모션 범위를 자동으로 적응하게하여 성능을 향상했다.
연기자의 모션 시퀀스와 희박한 대응 점이 주어지면 우리의 방법은 페이스 리그의 블렌드 쉐입을 연기자의 공간으로 자동으로 전달한다.

Manifold-based Techiques.

Realtime Performance-Based Facial Animation

Small Octopus — Sun, 14 Aug 2022 10:39:44 +0900

ACM transactions on graphics (TOG) 2011

Abstract
키넥트를 이용해서 실시간으로 사용자가 Performance-based character animation을 캐릭터에 적용할 수 있는 기술.
키넥트는 노이즈가 많다. 효율적으로 적은 해상도 이미지와 노이즈 3D 데이터를 실제같은 표정으로 바꾸기 위해
기하학정보와 텍스쳐정보를 등록하여 사용 및 사전에 기록된 애니매이션 priors를 같이 사용하여 하나의 최적화 문제를 푼다.
줄어든 파라미터 공간에서 공식화된 maximum a posteriori estimation을 푼다.
compelling 설득력있는 삼차원 얼굴 다이나믹스 재구성 될수 있음을 마커나 intrusive lighting, scanning hardware 없이 가능함을 보인다.

Overview
track rigid and non-rigid motion of user's face
map the extracted tracking parameters to suiable animation controls
solve the parameters of user specific expression model given the observed 2D and 3D data.
a suitable probabilistic prior from prerecorded animation sequences that define the space of realistic facial expressions.

Blendshape Representation
사람의 블렌드 쉐입 가중치는 다른 캐릭터간에 전달되도록 충분한 추상력를 제공한다는 것이 근본적 가설이다.

Acquisition Hardware
Kinect 사용, low resolution and high noise levels of input data is the primary challenge.

Realtime Tracking
rigid motion과 non-rigid motion을 분리했다.

Rigid Tracking
이전 프레임의 메쉬를 뎁스맵에 ICP를 이용해서 point-plane constraints를 줘서 얼라인했다.
얼라인먼트를 안정화하기위해 볼위쪽만 가지고 얼굴에서 변화가 심한곳을 빼고 pre-segmented template을 사용했다.
트랜슬레이션과 회전에 하이프리퀀시 플릭커링 필터를 사용했다.

Non-rigid Tracking
가능한 가까이 사용자의 표정과 유사하며 현실적인 사람표정의 공간안에 들어있는 블렌드쉐입을 만드는게 목표이다.
블렌드쉐입 파라미터는 현실적인 표정을 분별하지 못하고 무의미한 형상을 만들기 쉽다.
기하학적 조건과 텍스쳐 조건으로 피팅하면 노이즈 때문에 만족할 만한 결과를 얻기는 힘들다.

Statistical Model
unrealistic face pose를 막기위해 블렌드쉐입 웨이트를 regularize 한다.
dynamic expression prior는 이미 존재하는 블렌드쉐입 애니메이션 $$ \textbf{A}=\{A_1, ..., A_l\} $$
으로 부터 계산된다.
$$ A_j = \{a^1_j, ..., a^k_j\}, a^i_j \in \mathbb{R}^m $$
m-dimensional blendshape space. n 크기 윈도우 안 연속되는 프레임을 고려해 temporal coherence를 이용함.
얼굴의 기하학적 구조와 모션에 효과적이다.

MAP Estimation
$$ input \ data: D_i = (G_i, I_i), depth \ map: G_i, color \ image: I_i $$
$$ blendshape \ weights: \textbf{x}_i \in \mathbb{R}^m $$
$$ previous \ blendshape \ weights: X_n^i = \{ x_{i-1}, ..., x_{i-n} \} $$

Using Bayes' rule

Assuming that D is conditionally independent of Xn given x