DETAILED ACTION
This office action is in response to the applicant’s amendments and response filed on 10/30/2025.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, filed 10/30/2025, with respect to the rejection(s) of claim(s) 1-20 under 35 USC § 103 have been fully considered and are persuasive. The amendments to independent claims 1, 11, and 17 have overcome the existing rejection; therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Oz et al. (US 20230247180 A1, hereinafter "Oz"), which teaches the additional limitations in the amended independent claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-3, 8, and 17-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares et al. (US 11830182 B1, hereinafter "Soares") in view of Oz et al. (US 20230247180 A1, hereinafter "Oz") and Gafni et al. (Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction, hereinafter "Gafni").
Regarding claim 1, Soares discloses: a method, comprising:
obtaining, by a computing system comprising one or more processor devices, video data that depicts a face of a particular user (para. 18 (col. 4 lines 37-44) “According to one or more embodiments, training module 122 may train a model, such as a neural network, based on image data from a single subject or multiple subjects. In one or more embodiment, network device may capture image data of a person or people presenting one or more facial expressions. In one or more embodiments, the image data may be in the form of still images, or video images, such as a series of frames.”);
processing, by the computing system, the video data with a plurality of machine-learned models (fig. 2 shows a process including training 2 distinct models (elements 220 and 240) as part of generating the final dynamic texture model, para. 22 (col. 5 lines 30-33) “Referring to FIG. 2, a flow diagram is illustrated in which mesh and texture autoencoders are trained from a given sequence”, para. 23-28 (cols. 5-6) explains the process, fig. 6 shows the use of the machine-learned models to render an avatar, para. 43-44 (cols. 9-10) explains the process) of a user-specific model ensemble (models are specific to a user, para. 25 (col. 6 lines 9-11) “In one or more embodiments, the specific user's expression model may be stored for use during runtime”) for photorealistic facial representation (para. 9 (col. 1 lines 56-57) states that the goal is “to generate a photorealistic avatar”) to obtain a corresponding plurality of model outputs (fig. 6 shows the use of the machine-learned models to render an avatar, para. 43-44 (cols. 9-10) explains the process, para. 45 (col. 10 lines 19-37) states that each rendered avatar is unique to a particular user), wherein the plurality of machine-learned models comprises one or more of:
a machine-learned mesh representation model trained to generate a Three-Dimensional (3D) polygonal mesh representation of the face of the particular user (fig. 2, expression model 220 based on mesh representation 210, para. 23 (col. 5 lines 38-41) “the mesh and texture autoencoders may be trained from a series of images of one or more users in which the users are providing a particular expression”);
a machine-learned texture representation model trained to generate a plurality of textures representative of the face of the particular user (fig. 2, texture model 240, para. 23 (col. 5 lines 38-41) “the mesh and texture autoencoders may be trained from a series of images of one or more users in which the users are providing a particular expression”); or
one or more subsurface anatomical representation models trained to generate one or more respective sub-surface model outputs, each comprising a representation of a different sub-surface anatomy of the face of the particular user (para. 33 (col. 7 lines 44-58) “the training module 122 trains a neural network to map facial expression to the blood flow texture, such as a texture autoencoder. As described above, in one or more embodiments, the facial expression may be represented by latent variables descriptive of the 3D geometry of the face presenting the facial expression. In one or more embodiments, at 335, training module 122 generates a texture map from the calculated offset for the one or more facial expressions. The blood flow texture may be a 2D blood flow map that indicates a coloration offset from the albedo texture for the subject. The trained texture autoencoder and corresponding texture decoder may receive as input image data and/or an indication of the expression (e.g., the latent vector described above), and output the 2D blood flow texture map”); and
optimizing, by the computing system, at least one machine-learned model of the plurality of machine-learned models (Soares para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data”, para. 23 (col. 5 lines 38-41) “the mesh and texture autoencoders may be trained from a series of images of one or more users in which the users are providing a particular expression”).
Soares does not explicitly teach: identifying, by the computing system, a difference between the video data and prior video data depicting the face of the particular user, wherein the difference comprises a change of the face of the particular user; and performing the previously listed steps responsive to identifying the change.
Oz teaches: identifying, by the computing system, a difference between the video data and prior video data depicting the face of the particular user, wherein the difference comprises a change of the face of the particular user ([0192] “During the conference call, at the Encoder, an additional mechanism is added. This mechanism, a Change Detector (CD), analyzes the expressions made by the participant user and finds when the participant makes a new expression—one that was not taken into account when the models were created. This can easily be performed as expressions—as mentioned above—can be modeled using a 100-component vector. Therefore, the CD can identify situations when the participant's expressions are composed of vectors that were not used for training. The CD may decide that a new vector detected may require re-modelling if the newly detected vector is distant from all previously used vectors by some metric.”); and updating a facial model responsive to identifying the change ([0192] “The CDs would store these unmodelled conditions and at some time after the conference call ends, would either create new models locally and then update a central server with the new models, or would send the relevant pictures as captured by a camera to the central server which would be able to improve the existing models faster”).
Soares and Oz are both analogous to the claimed invention because they are in the same field of 3D facial model generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Soares with the teachings of Oz to only perform the machine-learning model update steps when necessitated by a change in a user’s face. The motivation would have been to increase efficiency by eliminating unnecessary computation.
The combination of Soares in view of Oz does not explicitly teach optimizing at least one machine-learned model based on a loss function that evaluates at least one model output of the plurality of model outputs.
Gafni teaches optimizing at least one machine-learned model based on a loss function that evaluates at least one model output of the plurality of model outputs (pg. 5 col. 2 equations (4) and (5) provides a loss function for optimizing model output as part of a process of generating a 3D animated avatar from video input (fig. 1)).
Gafni and the combination of Soares in view of Oz are both analogous to the claimed invention because they are in the same field of 3D facial model generation using machine learning. Furthermore, the concept of optimizing a machine learning model using a loss function is well known and commonplace in the art. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the avatar generation method of Soares in view of Oz with the mathematical optimization techniques of Gafni in order to provide what is considered the standard quality of optimization for a machine learning model.
Regarding claim 2, the combination of Soares in view of Oz and Gafni teaches: the method of claim 1, wherein the method further comprises:
generating, by the computing system, at least one optimized model output with the at least one machine-learned model of the user-specific model ensemble (Soares para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data.”, para. 43-44 (cols. 9-10) describes generating mesh and texture outputs using the optimized latent variables).
Regarding claim 3, the combination of Soares in view of Oz and Gafni teaches: The method of claim 2, wherein the method further comprises:
updating, by the computing system, a user-specific model output repository (Oz [0192] “The CDs would store these unmodelled conditions and at some time after the conference call ends, would either create new models locally and then update a central server with the new models, or would send the relevant pictures as captured by a camera to the central server which would be able to improve the existing models faster”; [0190] suggests that users’ models are stored for significant periods of time across multiple video conference sessions) for photorealistic facial representation (Soares para. 9 (col. 1 lines 56-57) states that the goal is “to generate a photorealistic avatar”) based on the at least one optimized model output (Soares para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data.”, para. 43-44 (cols. 9-10) describes generating mesh and texture outputs using the optimized latent variables), wherein the user-specific model output repository stores an optimized instance of each of the plurality of model outputs (Soares col. 4 lines 65-67 “The result of the training may be a model that provides the blood flow texture maps. The model or models may be stored in model store 145.”).
Soares, Oz, and Gafni are analogous to the claimed invention because they are in the same field of 3D facial model generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Soares in view of Oz and Gafni with the additional teachings of Oz to update a model repository with newer model versions so that the models do not need to be re-updated or generated from scratch, saving time and computation.
Regarding claim 8, the combination of Soares in view of Oz and Gafni teaches: the method of claim 1, wherein processing the video data with the plurality of machine-learned models of the user-specific model ensemble for photorealistic facial representation to obtain the corresponding plurality of model outputs comprises:
processing, by the computing system, the video data with a blood-flow mapping model of the one or more subsurface anatomical representation models to obtain a sub-surface model output indicative of a mapping of a blood flow anatomy of the face of the particular user (para. 33 (col. 7 lines 44-58) “the training module 122 trains a neural network to map facial expression to the blood flow texture, such as a texture autoencoder. As described above, in one or more embodiments, the facial expression may be represented by latent variables descriptive of the 3D geometry of the face presenting the facial expression. In one or more embodiments, at 335, training module 122 generates a texture map from the calculated offset for the one or more facial expressions. The blood flow texture may be a 2D blood flow map that indicates a coloration offset from the albedo texture for the subject. The trained texture autoencoder and corresponding texture decoder may receive as input image data and/or an indication of the expression (e.g., the latent vector described above), and output the 2D blood flow texture map”).
Regarding claim 17, it is rejected with the same references, rationale, and motivation to combine as claim 1 because its limitations substantially correspond to the limitations of claim 1, with the additional limitation of: A non-transitory computer-readable storage medium that includes executable instructions (col. 13 lines 18-30 “Storage 865 may include one more non-transitory computer-readable storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs), and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM). Memory 860 and storage 865 may be used to tangibly retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 805 such computer program code may implement one or more of the methods described herein.”).
Regarding claims 18 and 19, they are rejected with the same references, rationale, and motivation to combine as claims 2 and 3 respectively because their limitations substantially correspond to the limitations of claims 2 and 3 respectively.
Claim(s) 4-7 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Oz (US 20230247180 A1) and Gafni (Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction) as applied to claims 1 and 17 above, and further in view of Ogrinz et al. (US 11361062 B1, hereinafter "Ogrinz").
Regarding claim 4, the combination of Soares in view of Oz and Gafni teaches: The method of claim 1, wherein the method further comprises:
receiving, by the computing system from a computing device associated with the particular user, information descriptive of second video data (Soares distinguishes between two sets of video/image data, one for training (second half of col. 4) and one for avatar generation (col. 8 last paragraph, col. 9 last paragraph)), wherein the second video data depicts the face of the particular user performing an expression (Soares para. 43 (col. 9 lines 53-54) “…the expression presented on the subject's face in the image data 600”), and wherein the second video data is captured for display to a teleconference session that includes the computing device and one or more second computing devices (Oz [0034] “For example, referring to a 3D video conference that involves multiple participants. A first participant is imaged, and a second participant wishes to view a first avatar (or any other 3D visual representation) of the first participant within a virtual 3D video conference environment.”); and
using, by the computing system, the plurality of model outputs () to render a photorealistic (Soares, see claim 1) animation (Soares para. 46 (col. 10 lines 40-43) “Electronic device 700 may be used to acquire user images (e.g., a temporal sequence of image frames) and generate and animate an avatar in accordance with this disclosure.”) of the face of the particular user performing the expression depicted by the second video data (Soares fig. 6, para. 43 (col. 9 lines 46-54) “FIG. 6 depicts and example flow diagram of generating an avatar utilizing a blood flow map from a trained neural network, such as a texture. The flow diagram begins at 600 where an image of a subject is captured. In one or more embodiments, the image may be captured, for example, by camera 176 of client device 175. The image data 600 may be input into an expression model 605 to determine a geometry of the expression presented on the subject's face in the image data 600”).
Soares and Oz are both analogous to the claimed invention because they are in the same field of 3D facial model generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Soares with the teachings of Oz to apply it to the field of teleconferencing to replace a participant’s video with an avatar. The motivation would have been to reduce the amount of transmitted data when bandwidth is low (Oz [0124]).
The combination of Soares in view of Oz and Gafni does not explicitly teach that the expression is a microexpression that is unique to the particular user.
Ogrinz teaches the use of machine learning to analyze video data and recognize a microexpression that is unique to the particular user (description para. 12 (col. 9 lines 47-54) “the AI engine 114 captures microexpressions 104 expressed by a user 102 reacting to one or more training media items 152. This process is described further below. The microexpressions 104 may represent brief and involuntary emotional responses of the user 102 to media stimuli, e.g., training media items 152. The microexpressions 104 may correspond to a unique microexpressions fingerprint for the user 102”).
Ogrinz and the combination of Soares in view of Oz and Gafni are all analogous to the claimed invention because they each pertain to the same issue of capturing and processing a user’s expression based on video data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine-learned 3D avatar model generation of Soares in view of Oz and Gafni with the invention of Ogrinz to add the level of precision and detail required to capture users’ microexpressions and emulate them via the generated avatar. Expressing these small-scale facial movements would add to the believability of the avatar’s movement, furthering the goal of photorealism.
Regarding claim 5, the combination of Soares in view of Oz and Gafni and further in view of Ogrinz teaches: the method of claim 4, wherein the method further comprises:
transmitting, by the computing system, the photorealistic animation of the face of the particular user performing the microexpression to the one or more second computing devices of the teleconference session (Soares para. 45 (col. 10 lines 19-37) teaches a method of indirectly transmitting avatar data to a recipient device in the form of a blood flow map, texture map, and an expression latent vector, in order to recreate a photorealistic animation of the face of the particular user performing an expression).
Regarding claim 6, the combination of Soares in view of Oz and Gafni and further in view of Ogrinz teaches: the method of claim 4, wherein receiving the information descriptive of the second video data from the computing device comprises:
receiving, by the computing system from the computing device associated with the particular user, the information descriptive of the second video data, wherein the information descriptive of the second video data comprises a plurality of key frames (Soares para. 23 (col. 5 lines 44-50) “… the training module 122 captures or otherwise obtains expression images. In one or more embodiments, the expression images may be captured as a series of frames, such as a video, or may be captured from still images or the like. The expression images may be acquired from numerous individuals, or a single individual”) from the second video data (Oz [0117] “During the communication session, i.e., a 3D video conference call between several users, a 2D or 3D camera (or several cameras) grabs videos of the users. From these videos a 3D model (for example—the best fitting 3D model) of the user may be created at a high frequency, e.g., at a frame rate of 15 to 120 fps.”; [0192] “These unmodelled expressions and their corresponding parameters, would typically last for a few frames as captured by a camera.”).
Soares, Oz, Gafni, and Ogrinz are all analogous to the claimed invention because they each pertain to the same issue of capturing and processing a user’s expression based on video data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine-learned 3D avatar model generation of Soares in view of Oz and Gafni with the invention of Ogrinz to add the level of precision and detail required to capture users’ microexpressions and emulate them via the generated avatar. Expressing these small-scale facial movements would add to the believability of the avatar’s movement, furthering the goal of photorealism. Additionally, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the expression-capture systems of Soares and Oz to be able to train a machine-learning model using video data from a separate training session. This would have given the option for a user’s avatar to be pre-generated and available for use at any time.
Regarding claim 7, the combination of Soares in view of Oz and Gafni and further in view of Ogrinz teaches: The method of claim 4, wherein receiving the information descriptive of the second video data from the computing device comprises:
receiving, by the computing system from the computing device associated with the particular user ([0038] “The generation of the first avatar and/or the inclusion of the first avatar may be responsive to information gained by the device of the first user or to a camera or sensor associated with the device of the first user.”; [0124] “The model's parameters may be transmitted to all other user devices directly or to a central server.”), the information descriptive of the second video data from the computing device, wherein the information descriptive of the second video data comprises a motion capture information derived from the second video data (Oz [0120] to [0124] “Since this may be a parametric model, it may be represented by a small number of parameters. Typically, less than 300 parameters may be used to create a high-quality model of the face including each person's shape, expression and pose.
These parameters may be further compressed using quantization and entropy coding such as a Huffman or arithmetic coder.
The parameters may be ordered according to their importance and the number of parameters that may be transmitted and the number of bits per parameter may vary according to the available bandwidth.
In addition, instead of coding the parameters' values, the differences of these values between consecutive video frames may be coded.”).
Regarding claim 20, it is rejected with the same references, rationale, and motivation to combine as claim 4 because its limitations substantially correspond to the limitations of claim 4.
Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Oz (US 20230247180 A1) and Gafni (Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction) as applied to claim 1 above, and further in view of Raman et al. (Mesh-Tension Driven Expression-Based Wrinkles for Synthetic Faces, hereinafter "Raman").
Regarding claim 9, the combination of Soares in view of Oz and Gafni teaches: the method of claim 1, but does not explicitly teach: wherein processing the video data with the plurality of machine-learned models of the user-specific model ensemble for photorealistic facial representation to obtain the corresponding plurality of model outputs comprises:
processing, by the computing system, the video data with a skin tension mapping model of the one or more subsurface anatomical representation models to obtain a sub-surface model output indicative of a mapping of a skin tension anatomy of the face of the particular user.
Raman teaches: wherein processing the video data with the plurality of machine-learned models of the user-specific model ensemble for photorealistic facial representation to obtain the corresponding plurality of model outputs comprises:
processing, by the computing system, the video data with a skin tension mapping model of the one or more subsurface anatomical representation models to obtain a sub-surface model output indicative of a mapping of a skin tension anatomy of the face of the particular user (pg. 2 col. 1 “Our central idea is to capture complex wrinkling effects for an identity from high-resolution scans of their posed expressions. We store all these possible wrinkles into albedo and displacement textures we refer to as wrinkle maps. At synthesis, for any arbitrary expression beyond those represented in the source scans, we blend between the neutral and wrinkle textures using a notion of the tension in the face mesh to obtain dynamic wrinkling effects.”).
Raman and the combination of Soares in view of Oz and Gafni are analogous to the claimed invention because they are in the same field of 3D facial model generation using machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the avatar generation method of Soares in view of Oz and Gafni with the skin tension mapping of Raman. The motivation would have been to further improve the photorealism of the generated avatars by modeling detailed skin wrinkles.
Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Oz (US 20230247180 A1) and Gafni (Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction) as applied to claim 1 above, and further in view of Chen et al. (US 20230222721 A1, hereinafter "Chen").
Regarding claim 10, the combination of Soares in view of Oz and Gafni teaches the method of claim 1, but may not explicitly teach wherein the method further comprises:
generating, by the computing system, model update information descriptive of optimizations made to the at least one machine-learned model; and transmitting, by the computing system, the model update information to a computing device associated with the particular user.
Chen teaches wherein the method further comprises:
generating, by the computing system, model update information descriptive of optimizations made to the at least one machine-learned model; and transmitting, by the computing system, the model update information to a computing device associated with the particular user (fig. 4, [0051] “In step 420, an electronic version or copy the trained machine learning network may be distributed to multiple client devices. For example, the trained machine learning network may be transmitted to and locally stored on client devices. The machine learning network may be updated and further trained from time to time and the machine learning network may be distributed to a client device 150, 151, and stored locally.”).
Chen and the combination of Soares in view of Oz and Gafni are both analogous to the claimed invention because they are in the same field of 3D avatar generation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine-learned 3D avatar model generation of Soares in view of Oz and Gafni with the invention of Chen to add the ability to update machine learning models stored on a user’s personal device. The motivation would have been to reduce the invention’s need for centralized computing hardware, saving on costs and upkeep.
Claim(s) 11-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Chen (US 20230222721 A1), Oz (US 20230247180 A1) and Ogrinz (US 11361062 B1).
Regarding claim 11, Soares discloses a computing system, comprising:
a memory; and one or more processor devices coupled to the memory (fig. 7, para. 47 (col. 11 lines 48-55) “Storage 755 may store media (e.g., audio, image and video files), computer program instructions or software, preference information, device profile information, pre-generated models, frameworks, and any other suitable data. When executed by processor module 730 and/or graphics hardware 735 such computer program code may implement one or more of the methods described herein (e.g., see FIGS. 1-6)”) configured to:
use a plurality of optimized model outputs (para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data”, para. 23 (col. 5 lines 38-41) “the mesh and texture autoencoders may be trained from a series of images of one or more users in which the users are providing a particular expression”, para. 24 (col. 5 lines 59-61) “As part of the training process of the expression mesh autoencoder, mesh latents may be obtained”, para. 27 (col. 6 lines 30-32) “In one or more embodiments, texture latents may be obtained based on the training of the texture autoencoder”) to generate a Three- Dimensional (3D) photorealistic representation (para. 9 (col. 1 lines 56-57) states that the goal is “to generate a photorealistic avatar”) of the face of the particular user (para. 20 (col. 5 lines 1-3) “avatar module 186 renders an avatar, for example, depicting a user of client device 175 or a user of a device communicating with client device 175”), wherein the plurality of optimized model outputs are obtained from a corresponding plurality of machine-learned models (para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data”, para. 23 (col. 5 lines 38-41) “the mesh and texture autoencoders may be trained from a series of images of one or more users in which the users are providing a particular expression”) of a user-specific model ensemble (models are unique to a user, para. 25 (col. 6 lines 9-11) “In one or more embodiments, the specific user's expression model may be stored for use during runtime”) for photorealistic facial representation, and wherein the plurality of machine-learned models comprises one or more of:
a machine-learned mesh representation model trained to generate a 3D polygonal mesh representation of the face of the particular user (fig. 2, expression model 220 based on mesh representation 210, para. 22 (col. 5 lines 31-32) “mesh and texture autoencoders are trained from a given sequence”);
a machine-learned texture representation model trained to generate a plurality of textures representative of the face of the particular user (fig. 2, texture model 240, para. 22 (col. 5 lines 31-32) “mesh and texture autoencoders are trained from a given sequence”); or
one or more subsurface anatomical representation models trained to generate one or more respective sub-surface model outputs, each comprising a representation of a different sub-surface anatomy of the face of the particular user (para. 33 (col. 7 lines 44-58) “the training module 122 trains a neural network to map facial expression to the blood flow texture, such as a texture autoencoder. As described above, in one or more embodiments, the facial expression may be represented by latent variables descriptive of the 3D geometry of the face presenting the facial expression. In one or more embodiments, at 335, training module 122 generates a texture map from the calculated offset for the one or more facial expressions. The blood flow texture may be a 2D blood flow map that indicates a coloration offset from the albedo texture for the subject. The trained texture autoencoder and corresponding texture decoder may receive as input image data and/or an indication of the expression (e.g., the latent vector described above), and output the 2D blood flow texture map”);
generate a rendering of the 3D photorealistic representation of the face of the particular user (para. 35 (col. 8 lines 13-14) “Referring to FIG. 5, a flow chart is depicted in which an avatar is rendered utilizing a blood texture map.”) performing an expression (para. 36 (col. 8 lines 21-25) “The flowchart begins at 505, in which an expression to be represented by an avatar is determined. At 510, the avatar module 186 optionally determines a head pose, latent vector, spherical harmonics, and view vector in determining an expression to be represented by the avatar.”).
Soares does not explicitly disclose that the computing system is configured to obtain, from a computing device associated with a particular user, motion capture information indicative of a face of the particular user performing a microexpression unique to the particular user;
transmit the rendering of the 3D photorealistic representation of the face of the particular user to one or more second computing devices of a teleconference session that includes the computing device and the one or more second computing devices,
Chen teaches a computing system configured to obtain, from a computing device associated with a particular user ([0053] “In step 440, each frame from the video (or the identified group of pixels) is input into the local version of the machine learning network stored on the client device.”), motion capture information indicative of a face of the particular user performing an expression ([0054] “At step 450, the machine learning network determines facial expression values such as one or more action unit values with an associated action intensity value. In some embodiments, only an action unit value is determined. For example, an image of a user may depict that the user's eyes are closed, and the user's head is slightly turned to the left. The trained machine learning network may output two pairs of action unit values and corresponding intensity values of 43, 1 and 51, 0.5. Action unit value 43 would indicate that the eyes are closed, and the intensity values 1 would maximum action (i.e., eyes closed all the way). Action unit value 51 would indicate head turned to the left, and the intensity value 0.5 would indicate pronounced action (i.e., head turned half-way to the left).”), and
transmit the rendering of the face of the particular user to one or more second computing devices of a teleconference session that includes the computing device and the one or more second computing devices ([0063] “The modified video stream depicting the video conference participant in an avatar form may be transmitted to other video conference participants for display on their local device”).
Soares and Chen are both analogous to the claimed invention because they are in the same field of 3D avatar generation using machine learning. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine-learned 3D avatar model generation of Soares with two aspects of the invention of Chen: the quantitative motion-capture system for expressions (which would have reduced the amount of data necessary to represent an expression animation, benefiting users with poor network connections), and the ability to directly transmit the modified video feed to other teleconference participants’ devices (which would have provided the option to render the modified video feed containing the avatar on a centralized computing platform, benefiting users with devices that have poor processing power).
The combination of Soares in view of Chen does not explicitly teach: identify a difference between the motion capture information and prior motion capture information indicative of the face of the particular user, wherein the difference comprises a change in the face of the particular user; and execute the previously mentioned steps responsive to the change.
Oz teaches: identify a difference between the motion capture information and prior motion capture information indicative of the face of the particular user, wherein the difference comprises a change in the face of the particular user ([0192] “During the conference call, at the Encoder, an additional mechanism is added. This mechanism, a Change Detector (CD), analyzes the expressions made by the participant user and finds when the participant makes a new expression—one that was not taken into account when the models were created. This can easily be performed as expressions—as mentioned above—can be modeled using a 100-component vector. Therefore, the CD can identify situations when the participant's expressions are composed of vectors that were not used for training. The CD may decide that a new vector detected may require re-modelling if the newly detected vector is distant from all previously used vectors by some metric.”); and update a facial model responsive to the change ([0192] “The CDs would store these unmodelled conditions and at some time after the conference call ends, would either create new models locally and then update a central server with the new models, or would send the relevant pictures as captured by a camera to the central server which would be able to improve the existing models faster”).
Oz and the combination of Soares in view of Chen are analogous to the claimed invention because they are in the same field of 3D avatar generation. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Soares in view of Chen with the teachings of Oz to only perform the machine-learning model update steps when necessitated by a change in a user’s face. The motivation would have been to increase efficiency by eliminating unnecessary computation.
The combination of Soares in view of Chen and Oz does not explicitly teach that the expression performed by the particular user is a microexpression unique to the particular user.
Ogrinz teaches the use of machine learning to analyze video data and recognize a microexpression that is unique to the particular user (col. 9 lines 47-54 “the AI engine 114 captures microexpressions 104 expressed by a user 102 reacting to one or more training media items 152. This process is described further below. The microexpressions 104 may represent brief and involuntary emotional responses of the user 102 to media stimuli, e.g., training media items 152. The microexpressions 104 may correspond to a unique microexpressions fingerprint for the user 102”).
Soares, Chen, Oz, and Ogrinz are all analogous to the claimed invention because they each pertain to the same issue of capturing and processing a user’s expression based on video data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the machine-learned 3D avatar model generation of Soares in view of Chen with the invention of Ogrinz to add the level of precision and detail required to capture users’ microexpressions and emulate them via the generated avatar. Expressing these small-scale facial movements would add to the believability of the avatar’s movement, furthering the goal of photorealism.
Regarding claim 12, the combination of Soares in view of Chen, Oz, and Ogrinz teaches the computing system of claim 11, wherein using the plurality of optimized model outputs to generate the 3D photorealistic representation of the face of the particular user comprises:
obtaining a model output comprising the 3D polygonal mesh representation of the face of the particular user (Soares fig. 6, para. 44 (col. 10 lines 16-19) “…a 3D mesh 625, which may be output from the expression neural network model 605”); and
applying a model output comprising the plurality of textures representative of the face of the particular user to the 3D polygonal mesh representation of the face of the particular user (Soares fig. 6, para. 43 (col. 9 lines 46-48) “FIG. 6 depicts and example flow diagram of generating an avatar utilizing a blood flow map from a trained neural network, such as a texture”, para. 44 (col. 10 lines 10-19) “As described above, the blood flow neural network model may map the expression (e.g., the latent vector representing the expression) to the 2D blood flow map 620. The flow diagram continues at 630, where the avatar module generates the avatar. According to one or more embodiments, the avatar module renders the avatar by applying the 2D blood flow map 620 to a 3D mesh 625, which may be output from the expression neural network model 605”).
Regarding claim 13, the combination of Soares in view of Chen, Oz, and Ogrinz teaches the computing system of claim 12, wherein using the plurality of optimized model outputs to generate the 3D photorealistic representation of the face of the particular user further comprises:
applying one or more sub-surface model outputs to the 3D polygonal mesh representation of the face of the particular user, wherein each of the one or more sub-surface model outputs represents a different sub-surface anatomy of the face of the particular user (Soares fig. 6, para. 43 (col. 9 lines 46-48) “FIG. 6 depicts and example flow diagram of generating an avatar utilizing a blood flow map from a trained neural network, such as a texture”, para. 44 (col. 10 lines 10-19) “As described above, the blood flow neural network model may map the expression (e.g., the latent vector representing the expression) to the 2D blood flow map 620. The flow diagram continues at 630, where the avatar module generates the avatar. According to one or more embodiments, the avatar module renders the avatar by applying the 2D blood flow map 620 to a 3D mesh 625, which may be output from the expression neural network model 605”).
Regarding claim 14, the combination of Soares in view of Chen, Oz, and Ogrinz teaches the computing system of claim 13, wherein generating the rendering of the 3D photorealistic representation of the face of the particular user comprises:
obtaining a microexpression animation for the microexpression unique to the particular user (Chen [0055] “At step 460, the system 100 applies the determined action unit value and corresponding intensity value pairs to an avatar model. Blendshapes of the avatar model are then selected based on the determined action unit values.”, Ogrinz teaches the use of microexpressions unique to the particular user as described for claim 11); and
animating the 3D polygonal mesh representation of the face of the particular user based on the microexpression animation (Chen [0055] “A 3D animation of the avatar model is then rendered using the selected blendshapes. The selected blend shapes morph or adjust the mesh geometry of the avatar model.”, Ogrinz teaches the use of microexpressions unique to the particular user as described for claim 11).
Soares, Chen, Oz, and Ogrinz are all analogous to the claimed invention because they each pertain to the same issue of capturing and processing a user’s expression based on video data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the 3D facial expression animation of Soares in view of Chen and Oz with the invention of Ogrinz to add the level of precision and detail required to capture users’ microexpressions and emulate them via the generated avatar. Expressing these small-scale facial movements would add to the believability of the avatar’s movement, furthering the goal of photorealism.
Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Chen (US 20230222721 A1), Oz (US 20230247180 A1) and Ogrinz (US 11361062 B1) as applied to claim 14 above, and further in view of Orvalho et al. (US 20210328954 A1, hereinafter "Orvalho").
Regarding claim 15, the combination of Soares in view of Chen, Oz, and Ogrinz teaches the computing system of claim 14, as well as the microexpression unique to the particular user (Ogrinz, see claim 11), but does not explicitly teach wherein obtaining the microexpression animation for the microexpression unique to the particular user comprises:
processing the motion capture information with an animator model of the plurality of machine-learned models of the user-specific model ensemble for photorealistic facial representation to obtain a model output comprising the microexpression animation.
Orvalho teaches wherein obtaining the microexpression animation for the microexpression unique to the particular user comprises:
processing the motion capture information (fig. 8, transform matrix 802, [0089] “In this example, the transform matrix 802 generates a matrix based on decomposition of the captured data stream (710 in FIG. 7) into symbols and one or more matrices”) with an animator model of the plurality of machine-learned models of the user-specific model ensemble for photorealistic facial representation to obtain a model output comprising the microexpression animation ([0090] “In various embodiments, based on the matrix from the transform matrix 802, a selection is made of one or more applicable base expressions 804 from a set of base expressions and a determination is made of a combination of two or more of those base expressions that most closely mimics the expression of the first user 734 per the data stream 710. Machine learning may be used to fine tune the determination. In various embodiments, expression symbols and weights are determined.”, [0101] “Referring to FIG. 8, the time 814 aspect is identified and may be represented by a plurality of bytes (eight bytes in the example) in order to provide expression stream and corresponding time information for the expression stream for customizing the 3D animatable model.”).
Soares, Chen, Oz, Ogrinz, and Orvalho are all analogous to the claimed invention because they each pertain to the same issue of capturing and processing a user’s expression based on video data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the invention of Soares in view of Chen, Oz, and Ogrinz with the invention of Orvalho to add a machine learning-based method of translating a user’s motion capture data into a facial animation. The motivation would have been to expand upon and improve Chen’s method of animating an avatar based on motion capture data.
Claim(s) 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Soares (US 11830182 B1) in view of Chen (US 20230222721 A1), Oz (US 20230247180 A1) and Ogrinz (US 11361062 B1) as applied to claim 14 above, and further in view of Beijing Deepscience Technology Co., Ltd. (CN 110992455 B, hereinafter "Beijing Deepscience").
Regarding claim 16, the combination of Soares in view of Chen, Oz, and Ogrinz teaches the computing system of claim 14, wherein obtaining the microexpression animation for the microexpression unique to the particular user comprises:
retrieving the microexpression from a user-specific model output repository that stores an optimized instance of each of the plurality of model outputs.
Soares teaches storing an optimized instance of each of the plurality of model outputs (para. 12 (col. 2 lines 25-29) “The aim of an autoencoder is to learn a representation for a set of data in an optimized form. A trained autoencoder will have an encoder portion, a decoder portion, and latent variables, which represent the optimized representation of the data.”, para. 43-44 (col. 9-10) describes generating mesh and texture outputs using the optimized latent variables) in a model output repository (para. 19 (col. 4 lines 65-67) “The result of the training may be a model that provides the blood flow texture maps. The model or models may be stored in model store 145”).
Chen teaches a user-specific avatar repository (fig. 1A, elements 130 “Avatar Model Repository” and 134 “Avatar Model Customization Repository”, [0022] “The avatar model repository may store and/or maintain avatar models for selection and use with the video communication platform 140… The avatar model customization repository 134 may include customizations, style, coloring, clothing, facial feature sizing and other customizations made be a user to a particular avatar”, [0028] “The changes made to the particular avatar are stored or saved in the avatar model customization repository 134”), and retrieving an avatar from the avatar repository ([0033] describes rendering a video using an avatar from the repository, which necessitates retrieval).
Ogrinz teaches a user-specific microexpression repository (description para. 20 (col. 11 lines 58-65) “The user profile database 134 stores user profiles 150 and test profiles 172. The user profiles 150 includes the first user profile 150a. In the first user profile 150a, baseline features 170 associated with the user 102 are stored. The baseline features 170 may be represented by a feature vector that comprises numerical values that represent microexpressions 104 expressed by the user 102 captured in a training process which is described further below.”), and retrieving microexpression data from the repository (description para. 35 (col. 14 lines 54-64) and 40 (col. 16 lines 1-34) describe using the stored baseline features to perform user authentication, which necessitates retrieval from the repository).
The combination of Soares in view of Chen, Oz, and Ogrinz does not explicitly teach a repository for animation.
Beijing Deepscience teaches a facial animation repository ([n0023] “The facial animation data-driven method provided by the present invention uses real actor performances to create a large facial animation database, which is built based on FACS (Facial Action Coding System), reducing the workload required to create various facial expressions while maintaining the naturalness of human facial expressions and allowing real-time manipulation of data before visualization.”)
Soares, Chen, Oz, Ogrinz, and Beijing Deepscience are all analogous to the claimed invention because they each pertain to the same issue of capturing user’s facial movements based on video data. Furthermore, the FACS system used by the facial animation repository of Beijing Deepscience is the same system used by Chen to quantify users’ expressions from motion capture data. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the machine learning model output repository of Soares, the user-specific avatar repository and retrieval system of Chen, the microexpression database of Ogrinz, and the facial animation database of Beijing Deepscience to create a repository for machine-learned, user-specific facial animations capable of the precision and detail required to capture microexpressions and generate a corresponding avatar animation. The motivation would have been to create a system to reuse user’s captured and recorded animations whenever possible, reducing the overall amount of computational processing required by the invention.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BENJAMIN TOM STATZ/ Examiner, Art Unit 2611
/TAMMY PAIGE GODDARD/ Supervisory Patent Examiner, Art Unit 2611