Office Action Analysis: 18628476 — FACE RELIGHTING OF AVATARS WITH HIGH-QUALITY SCAN AND MOBILE CAPTURE

Office Action

§102 §103
+DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Amendment filed February 5th, 2026 has been entered. Claims 1-20 remain pending in the application. The Examiner agrees that Claim 1 does not invoke 35 U.S.C. 112(f), and going forward will no longer be viewed as invoking 35 U.S.C. 112(f).

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 16-17 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Sevastopolskiy (US 20220157014 A1).
Sevastopolskiy teaches a method comprising:
Retrieving a plurality of images of a subject from a plurality of view directions (Paragraph [0019]: “For generating a relighted 3D portrait of a person the disclosed system may utilize a sequence of images featuring a person. Such sequence of images may be captured by conventional cameras of currently available handheld devices (e.g. by a smartphone camera)”; Paragraph [0020]: “The sequence of images may be captured by a camera with blinking flash while moving the camera at least partly around the person (e.g., partly around an upper body of the person)”);
Forming a plurality of synthetic views of the subject for each view (Paragraph [0023]: “A 3D point cloud 10 may be generated based on the sequence of captured images. The 3D point cloud 10 may be generated using, for example, and without limitation, a Structure-from-Motion (SfM) technique or any other known techniques allowing to reconstruct 3D structure of a scene or object based on the sequence of 2D images of that scene or object. Each point in the 3D point cloud 10 may be augmented with latent descriptor, for example, a multi-dimensional latent vector characterizing properties of the point. Latent descriptors may be sampled from a predefined probability distribution, e.g. from a unit elementwise Gaussian, and may later be fitted at the training stage. Each latent descriptor may serve as a memory vector for the deep neural network (also referred to as the rendering network and described in greater detail below) and may be used by the network to infer geometric and photometric properties of each point. The 3D point cloud is generated and camera viewpoints are estimated based on the set of flash images or the set of no-flash images”; Paragraph [0024]: “Latent descriptors of the generated 3D point cloud may be rasterized at different resolutions according to a requested new camera viewpoint to obtain rasterized images 15. The rasterization may be made using, for example, and without limitation, a Z-buffering technique or any other known techniques allowing to represent images of objects located in 3D space from a particular camera viewpoint”; Paragraph [0036]: “An image and a camera viewpoint corresponding to the image are randomly sampled at S100 from the captured sequence of images. A predicted image (e.g., predicted albedo, normals, environmental shadow maps, and segmentation mask) for the camera viewpoint is obtained at S105 by the deep neural network”. Notes: each image of the plurality of image has an associated viewpoint from which a synthetic image is reconstructed from via rasterization utilizing 3D point cloud data augmented by latent descriptors); and
Training a model with the plurality of images of the subject and the plurality of synthetic views of the subject to determine at least a reflectance based on the plurality of images (Paragraph [0036]: “An image and a camera viewpoint corresponding to the image are randomly sampled at S100 from the captured sequence of images. A predicted image (e.g., predicted albedo, normals, environmental shadow maps, and segmentation mask) for the camera viewpoint is obtained at S105 by the deep neural network”; Paragraph [0026]: “Neural rendering may be performed. Neural rendering may be performed by processing the rasterized images 15 with, for example, a deep neural network trained to predict albedo, normals, environmental shadow maps, and segmentation mask for the received camera viewpoint… The training stage of the disclosed system and the architecture of used deep neural network are described in greater detail below with reference to FIG. 4. Predicted albedo, normals, environmental shadow maps, and segmentation mask are collectively indicated in FIG. 1 with the numeral 20. Predicted albedo describes the spatially varying reflectance properties (albedo ρ(x)) of the head surface (skin, hair, or other parts)”).
Regarding Claim 17, the method of Claim 16 is rejected over Sevastopolskiy.
Sevastopolskiy teaches a plurality of images wherein the plurality of images of the subject is under a plurality of illumination configurations (Paragraph [0020]: “The sequence of images may be captured by a camera with blinking flash while moving the camera at least partly around the person (e.g., partly around an upper body of the person). Thus, the sequence of images comprises a set of flash images and a set of no-flash images” Notes: the camera has a plurality of illumination configurations consisting of no flash and flash).
Regarding Claim 19, the method of Claim 16 is rejected over Sevastopolskiy.
Sevastopolskiy teaches using a mobile device to capture at least some of the plurality of images of the subject from the plurality of view directions using a single point light source (Paragraph [0019]: “For generating a relighted 3D portrait of a person the disclosed system may utilize a sequence of images featuring a person. Such sequence of images may be captured by conventional cameras of currently available handheld devices (e.g. by a smartphone camera)”; Paragraph [0020]: “The sequence of images may be captured by a camera with blinking flash while moving the camera at least partly around the person (e.g., partly around an upper body of the person)”).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 11-13 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) in view of Debevec (Acquiring the Reflectance Field of a Human Face, 2000).
Regarding Claim 11, Chen teaches a method, comprising:
retrieving a plurality of stage images including a plurality of views of a subject (Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras… and instruct our subjects to make diverse expressions with head movements”); 
retrieving a plurality of self-images of the subject by using a mobile device while the subject is being moved with respect to a point light source (Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone… We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”. Point light source in its broadest reasonable interpretation is any source of light originating from a point, which is inherent in a description of various lighting conditions, as demonstrated in Figure 1); and 
generating a three-dimensional (3D) mesh of a head of the subject based on the stage images (Section 3.1: “the 3D facial geometry and appearance that are captured from a multi-view capture light-stage...The decoder, D, can generate instances of a person’s face by taking, as input, a latent code z, which encodes the expression”. Latent code z is derived from the input image, as demonstrated in Figure 2).
Chen does not teach determining a reflectance using the plurality of self-images of the subject.
However, Debevec teaches a determining a reflectance using an image (Figure 3 picture and description: “Reflectance Functions for a Face: This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions. The functions have been brightened by a factor of four from the original data”. Equations 4 and 5 demonstrate how the images are processed, and the reflectance used to obtain a rendered pixel and subsequent rendered image in Figure 6. Notes: Reflectance can be determined from an image through a sampling scheme of pixels. It is also worth noting that the source or method of obtaining the image(s) is irrelevant to the calculation of the reflectance). 
Chen and Debevec are considered to be analogous to the claimed invention because both are in the same field of rendering relit images of a subject’s head. A common motivation in the art is to use light values such as reflectance and lighting to improve the depiction and rendering of a subject’s head. 
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the plurality of self-images of a subject of Chen with the determining of a reflectance using an image of Debevec; Doing so would yield the predictable result of being able to generate a more representative and detailed 3D mesh of a head of the subject.
Regarding Claim 12, the method of Claim 11 is rejected over Chen as modified. Chen as modified teaches generating a texture map for the subject based on the stage images and the self-images (Chen, Figure 2 demonstrates a pipeline in which a texture is derived using information from input images; Chen, Equation 1 supplements this by describing the components of the input images used to derive the texture. Figure 8 further illustrates how a captured image is used as input to generate a texture map. Additionally, in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan are used in a training pipeline as demonstrated in Figure 2. Lastly, in Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone… We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”. This demonstrates that videos captured via mobile device can utilize the pipeline of Figure 2 to generate a texture map based on the self-images).
Regarding Claim 13, the method of Claim 12 is rejected over Chen as modified. Chen as modified teaches generating a texture map wherein the texture map comprises a view-dependent (Section 3.1: “the view-dependent texture”, as referenced in Chen, Equation 1) and illumination-dependent texture map (Chen, Section 3.2: “where illumination variations are represented using gain and bias maps, g and b”, as referenced in Chen, Equation 2 and Chen, Figure 2, wherein the gain and bias maps are used to produce a relit texture).
Claims 1-6 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) as modified by Debevec (Acquiring the Reflectance Field of a Human Face, 2000), in view of Lombardi (Deep Appearance Models for Face Rendering, 2018). 
Regarding Claim 1, Chen as modified teaches a system comprising:
a mobile device operable to generate a mobile capture of a subject ((Chen, Figure 1: “Input image captured by an iPhone”; Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”);
	a plurality of cameras configured to provide a multi-view scan of the subject under a fully lit condition (Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”; Chen, Section 3.2: “we extend the capture system in [27] to include 460 controllable lights that are synchronized with the multi-view camera system. The captured sequence was extended to include a portion where nonoverlapping groups of approximately 10 lights were turned on, interleaved with fully lit frames that were used for tracking. This data was used to build a relightable face model using the scheme illustrated in Figure 2”. Notes: As defined in applicant specifications in Paragraph 0035: “The multi-view scan 212 generates multiple images that are used by a processor to form a 3D mesh 216 of the subject’s head under a (fixed) uniform lighting configuration. The multi-view scan 212 is performed simply by simultaneously taking a single picture by each of multiple cameras under a uniform lighting condition provided by several light sources”; Therefore, multi-view scan is taken to mean a process for taking a plurality of images under uniform lighting in its broadest reasonable interpretation, wherein the images are taken simultaneously)); and 
	a pipeline configured to perform a plurality of processes using the mobile capture and the multi-view scan to generate a relightable avatar (Chen, Figure 1 demonstrates processing to get from captured image taken on an iPhone to relit avatar. Chen, Figure 2 visualizes the training pipeline: “Training the lighting model on the light-stage data. We update the lighting model G and per-frame expression code z while fixing the other parameters”. Chen, Figure 3 and 4 also demonstrate the visual steps of the processing that occurs during the pipeline. Notes: light-stage data is defined by Chen as being the images captured from the multi-view light-stage defined in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan. Are used in a training pipeline as demonstrated in Figure 2),
wherein the mobile capture includes a video captured while the subject is moved relative to a light source (Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”), and
wherein the pipeline corresponds to a processor (a processor is inherent to be able to perform the pipeline as demonstrated in Chen, Figure 2) configured for:
	a first processing stage configure for determining a reflectance using the mobile capture ((Debevec, Figure 3 picture and description: “Reflectance Functions for a Face: This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions. The functions have been brightened by a factor of four from the original data”. Debevec, Equations 4 and 5 demonstrate how the images are processed, and the reflectance used to obtain a rendered pixel and subsequent rendered image in Debevec, Figure 6. Notes: Reflectance can be determined from an image through a sampling scheme of pixels. It is also worth noting that the source or method of obtaining the image(s) is irrelevant to the calculation of the reflectance); and
	a second processing stage for determining a relightable model of a head of the subject using the multi-view scan (Chen, Figure 1 demonstrates processing to get from captured image taken on an iPhone to relit avatar. Chen, Figure 2 visualizes the training pipeline: “Training the lighting model on the light-stage data. We update the lighting model G and per-frame expression code z while fixing the other parameters”. Chen, Figure 3 and 4 also demonstrate the visual steps of the processing that occurs during the pipeline. Notes: light-stage data is defined by Chen as being the images captured from the multi-view light-stage defined in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan are used in a training pipeline as demonstrated in Chen, Figure 2)
Chen as modified does not explicitly teach uniform illumination, although it does teach a “fully lit” lighting condition.
However, Lombardi teaches a plurality of cameras configured to provide a multi-view scan under a uniform illumination (Section 3: “The device contains 40 machine vision cameras capable of synchronously capturing 5120×3840 images at 30 frames per second … [and] evenly place 200 directional LED point lights directed at the face to promote uniform illumination.”).
Chen as modified and Lombardi are considered analogous in the art, since both references teach generating relightable avatars using multi-view scans taken through a multi camera apparatus. It should be noted that Chen directly references Lombardi, in that Chen expands on Lombardi’s camera system (Chen, Section 3.2: “we extend the capture system in [27]”, wherein reference 27 is Lombardi). Consequently, since Chen extends Lombardi, uniform illumination is present in both Chen and Lombardi, and the “fully lit pattern” described by Chen in Section 4.1 in the context of Lombardi is taken to mean uniform illumination. Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the system of Chen with the uniform illumination of Lombardi to yield the predictable result of generating relit avatars.
Regarding Claim 2, The system of Claim 1 is rejected over Chen as modified. 
Chen as modified teaches a system, wherein the plurality of cameras are fixed around the subject (Lombardi, Section 3: “The device contains 40 machine vision cameras capable of synchronously capturing 5120×3840 images at 30 frames per second. All cameras lie on the frontal hemisphere of the face and are placed at a distance of about one meter from it”), and the uniform illumination is provided by a plurality of light sources (Lombardi, Section 3: “200 directional LED point lights directed at the face to promote uniform illumination”).
Regarding Claim 3, the system of Claim 1 is rejected over Chen as modified. 
Chen as modified teaches a system, wherein the plurality of cameras are configured to simultaneously take images of the multi-view scan (Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images at 1334×2048 / 90 fps”).
Regarding Claim 4, the system of Claim 3 is rejected over Chen as modified. 
Chen as modified teaches a system, wherein the images of the multi-view scan comprise a coarse geometry of a face including at least eyes, a nose and a mouth of the subject, and hairs of the subject (Chen, Figure 2 shows that the mesh and texture map both show at least eyes, a nose, a mouth, and hairs of the subject).
Regarding Claim 5, the system of Claim 1 is rejected over Chen as modified. Chen as modified teaches a first processing stage further configured to determine at least a pose and lighting parameters based on the mobile capture and the multi-view scan (Chen, Section 3.2: “we extend the capture system in [27] to include 460 controllable lights that are synchronized with the multi-view camera system. The captured sequence was extended to include a portion where nonoverlapping groups of approximately 10 lights were turned on, interleaved with fully lit frames that were used for tracking. This data was used to build a relightable face model using the scheme illustrated in Figure 2” Chen, Figure 2 demonstrates the pipeline, wherein pose and lighting are represented as h and l, and defined as such in Chen, Section 3.2: “the lighting, head pose … These inputs, represented by l, h … are processed by MLPs”. Chen further elaborates in Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”, where the phone captured images are demonstrated as having been fed into the pipeline in Chen, Figure 6).
Regarding Claim 6, the system of Claim 5 is rejected over Chen as modified. 
Chen as modified teaches a second processing stage further configured to generate a relightable model of a head of the subject (Chen, Figure 4) 
based on the reflectance (Debevec, Figure 3 picture and description: “Reflectance Functions for a Face: This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions. The functions have been brightened by a factor of four from the original data”. Debevec, Equations 4 and 5 demonstrate how the images are processed, and the reflectance used to obtain a rendered pixel and subsequent rendered image in Debevec, Figure 6), 
pose and the lighting parameters (Chen, Figure 2)
Regarding Claim 10, the system of Claim 1 is rejected over Chen as modified. Chen as modified teaches a light source that comprises a point light source (Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone… We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”. Notes: Point light source in its broadest reasonable interpretation is any source of light originating from a point, which is inherent in a description of various lighting conditions, as demonstrated in Chen, Figure 1)
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) as modified by Debevec (Acquiring the Reflectance Field of a Human Face, 2000) and Lombardi (Deep Appearance Models for Face Rendering, 2018), in view of Geng (Single-view facial reflectance inference with a differentiable renderer, 2021). 
Regarding Claim 7, the system of Claim 5 is rejected over Chen as modified.
Chen as modified teaches a pipeline configuration, comprising:
	A differentiable renderer configured to combine the pose and the lighting parameters with images of the multi-view scan to provide a rendered image (Chen, Figure 2).
Chen does not teach a differentiable renderer configured to take as input the reflectance.
However, Geng teaches a differentiable renderer configured to combine the reflectance from an input image to provide a rendered image (Figure 2 demonstrates the pipeline including the use of reflectance as a component in the differentiable renderer: “The overall algorithm pipeline. The whole process can be divided into two phases. In the initialization phase, we extract a textured 3D facial mesh from the input image. Multiple encoders then convert the unwrapped color texture to a latent representation. In the optimization phase, separate reflectance components are decoded from the latent representation and fed to a differentiable renderer to produce a reconstructed image, based on which a reconstruction loss can be computed and guide the iterative updating of latent coefficients for skin reflectance and lighting via backprop”) 
Chen and Geng are considered to be analogous to the claimed invention because they are in the same field of rendering relit images of a subject’s head. A common motivation in the art is to use light values such as reflectance and lighting to improve the depiction and rendering of a subject’s head. 
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the differentiable renderer to generate an image of the subject’s head using pose and lighting of Chen with the differentiable renderer to generate a rendered image of a subject’s head using reflectance of Geng; Doing so would result in the predictable result of a more accurately rendered image of the subject’s head according to the light of a given environment. 
Claims 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) as modified by Debevec (Acquiring the Reflectance Field of a Human Face, 2000), Lombardi (Deep Appearance Models for Face Rendering, 2018), and Geng (Single-view facial reflectance inference with a differentiable renderer, 2021), in view of Khakhulin (Pub. No. US 2023/0154111 A1). 
Regarding Claim 8, the system of Claim 5 is rejected over Chen as modified.
Chen teaches a pose, wherein the pose comprises a head pose (Chen, Section 3.2: “The gain and bias maps depend on the lighting, head pose, viewpoint, and expression”, further described in Chen, foot note 1 on page 13061: “Here, the rigid head pose consists of two parts: rigid rotation r ∈ R3 and camera viewpoint vector v, v ∈ R3”).
Chen does not teach a camera pose along with a head pose.
However, Khakhulin teaches estimating a head pose and camera pose from a captured image (Paragraph [0060]: “the head pose, the facial expression, and the camera pose estimated from the target image xt”).
Chen and Khakhulin are considered analogous to the claimed invention because both are in the same field of rendering images of a subject’s head using pose estimation. A common motivation in the art is to use utilize pose as a baseline for reconstructing an image of a subject’s head; pose information pertaining to the camera is useful to consider in relation to the pose of the head, as the view of the head originates from the camera’s position, and is influenced by the camera’s orientation. 
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the distinction of a head and camera pose of Khakhulin with the pose calculation involving a viewpoint vector of Chen; Doing so would predictably result in a more definitive pose parameter, allowing for a more effectively rendered head of a subject.
Regarding Claim 9, the system of Claim 8 is rejected over Chen as modified. Chen teaches a pose comprised of the head pose, wherein the camera pose comprises a first distance between the mobile device and a fixed point, and wherein the head pose comprises a second distance between the mobile device and the fixed point (Chen, foot note 1, pg. 13061: “Here, the rigid head pose consists of two parts: rigid rotation r ∈ R3 and camera viewpoint vector v, v ∈ R3. Similar to [27], we assume that the viewpoint vector is relative to the rigid head orientation that is estimated from the tracking algorithm”. Notes: Camera viewpoint vector v provides the distance between the camera and a fixed point, being the head, and the mobile device. The camera is noted as being an iPhone in Chen, Figure 1; Khakhulin, Paragraph [0060]: “the head pose, the facial expression, and the camera pose estimated from the target image xt”. Notes: Chen demonstrates the use of the head as a fixed point, along with a camera viewpoint vector v that provides a distance between the fixed point and the camera. The same viewpoint vector v inherently defines the pose of the camera as well, considering it provides the distance between the head and mobile device providing the camera viewpoint.).
Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) as modified by Debevec (Acquiring the Reflectance Field of a Human Face, 2000), Lombardi (Deep Appearance Models for Face Rendering, 2018), Geng (Single-view facial reflectance inference with a differentiable renderer, 2021), and Khakhulin (Pub. No. US 2023/0154111 A1), in view of Saragih (Pub. No. US 2022/0237843 A1). 
Regarding Claim 14, the method of Claim 13 is rejected over Chen as modified. 
Chen as modified teaches generating, based on the texture map and the 3D mesh, a view of the subject illuminated by a synthetic light source (Chen, Figure 1 and 2 demonstrate generating a view of a subject illuminated by a synthetic light source using a texture map and 3d mesh derived from stage and self-images). 
Chen as modified does not teach a view of the subject illuminated by a synthetic light source associated with an environment in an immerse reality (IR) application.
However, Saragih teaches generating an expression-dependent texture map and view-dependent texture for the subject derived from images, where lighting configurations can be selected to get a view of the subject illuminated by a synthetic light source, wherein the synthetic light source is associated with an environment in an immerse reality (IR) application (Figure 12 demonstrates a pipeline, in which a step is to “Generate, based on the expression-dependent texture map and the view-dependent texture map, a view of the subject illuminated by a light source selected from an environment in an immersive reality application”); 
Chen and Saragih are considered to be analogous to the claimed invention because both are in the same field of rendering synthetically relit images of a subject using textures and 3D meshes. A common motivation within the art is to model how light visually effects the appearance of a subject through the use of textures and 3D meshes; a commonly known application for 3D modeling is for use in virtual applications, such as IR. Hence, one ordinarily skilled in the art would be motivated to apply synthetic lighting to a model within the context of a source of synthetic lighting in an IR application.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of generating a 3D mesh and texture map of a subject from stage images and mobile capture for use in rendering a view of a subject illuminated by a synthetic light source of Chen with the motivation and teachings related to immersive reality application of Saragih to yield the predictable result of rendering images of subjects by using a 3D mesh and texture map derived from captured images, and synthetically illuminating them with regards to an environment in an IR application.
Regarding Claim 15, the method of Claim 14 is rejected over Chen as modified.
Chen as modified teaches providing a view of the subject (Chen Figure 1, Figure 2)
	to the IR application running on a headset (Saragih, Figure 12).
Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Sevastopolskiy (US 20220157014 A1).
Regarding Claim 18, the method of Claim 17 is rejected over Sevastopolskiy. 
Sevastopolskiy teaches forming the plurality of synthetic views of the subject under a plurality of illumination configurations (Paragraph [0020]: “The sequence of images may be captured by a camera with blinking flash while moving the camera at least partly around the person (e.g., partly around an upper body of the person). Thus, the sequence of images comprises a set of flash images and a set of no-flash images”).
Sevastopolskiy does not explicitly teach forming the plurality of synthetic views of the subject for each illumination configuration of the plurality of illumination configurations.
However, Sevastopolskiy’s illumination configuration is clearly capable of forming the plurality of synthetic views of the subject for each illumination configuration of the plurality of illumination configurations. Doing so would simply require the sequence to be taken with the flash turned on, and the flash turned off separately.
A common motivation in the art is to obtain sufficient image data for a learning model. Obtaining more image data would allow the learning model more data to train on, resulting in better results.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to modify the method of retrieving images of Sevastopolskiy such that forming the plurality of synthetic views of the subject are for each illumination configuration of the plurality of illumination configurations; Doing so would yield the predictable result of a larger amount of flash and no flash image data for the model to train on, leading to better results.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Sevastopolskiy (US 20220157014 A1) in view of Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021).
Regarding Claim 20, the method of Claim 16 is rejected over Sevastopolskiy. 
Sevastopolskiy does not teach using a plurality of cameras and a plurality of light sources to provide a uniform illumination to capture at least some of the plurality of images of the subject.
However, Chen teaches using a plurality of cameras and a plurality of light sources to provide a uniform illumination to capture at least some of the plurality of images of the subject (Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images at 1334×2048 / 90 fps and a total of 460 white LED lights … There are 50 different lighting patterns and one fully-lit pattern. We record a total of 13 minutes video sequence of one subject”).
Sevastopolskiy and Chen are considered analogous in the art with respect to forming 3D representations of a subject via collected 2D images. A motivation in the art for learning models that generate 3D representations of subjects through images is to diversify data collection for testing purposes to allow the model to be able to generalize more efficiently. Furthermore, Chen utilizes both mobile device captures and a dedicated camera set up that is able to capture subjects in uniform illumination to obtain images (Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images at 1334×2048 / 90 fps and a total of 460 white LED lights … There are 50 different lighting patterns and one fully-lit pattern. We record a total of 13 minutes video sequence of one subject; Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the method of collecting images via a mobile device of Sevastopolskiy with the more detailed method of collecting images via a mobile device and a plurality of cameras and light sources under uniform illumination of Chen; Doing so would yield the predictable result of a comprehensive image data collection method that would improve the results of a learning model that is used to generate a 3D representation of a subject. 
Response to Arguments
The Examiner agrees that Claim 1 does not invoke 35 U.S.C. 112(f), and going forward will no longer be viewed as invoking 35 U.S.C. 112(f).

Applicant’s arguments, see pages 7-8, filed 02/05/2026, with respect to the rejection of Claim 11 under 35 U.S.C. 102 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) in view of Debevec (Acquiring the Reflectance Field of a Human Face, 2000).
Applicant amended independent Claim 11 to recite, inter alia, “determining a reflectance using the plurality of self-images of the subject”. While the Examiner agrees that Chen fails to disclose “determine at least a reflectance” with regards to original Claim 5, the amended limitation constitutes a new ground for rejection. 

Chen teaches retrieving a plurality of stage images including a plurality of views of a subject (Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras… and instruct our subjects to make diverse expressions with head movements”); 
retrieving a plurality of self-images of the subject by using a mobile device while the subject is being moved with respect to a point light source (Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone… We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”. Point light source in its broadest reasonable interpretation is any source of light originating from a point, which is inherent in a description of various lighting conditions, as demonstrated in Figure 1); and 
generating a three-dimensional (3D) mesh of a head of the subject based on the stage images (Section 3.1: “the 3D facial geometry and appearance that are captured from a multi-view capture light-stage...The decoder, D, can generate instances of a person’s face by taking, as input, a latent code z, which encodes the expression”. Latent code z is derived from the input image, as demonstrated in Figure 2).
As Applicant argues, while Chen doesn’t teach “determining a reflectance using the plurality of self-images of the subject”, Debevec teaches determining a reflectance using the plurality of self-images of the subject (Figure 3 picture and description: “Reflectance Functions for a Face: This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions. The functions have been brightened by a factor of four from the original data”. Equations 4 and 5 demonstrate how the images are processed, and the reflectance used to obtain a rendered pixel and subsequent rendered image in Figure 6. Notes: Reflectance can be determined from an image through a sampling scheme of pixels. It is also worth noting that the source or method of obtaining the image(s) is irrelevant to the calculation of the reflectance). 
Chen and Debevec are considered to be analogous to the claimed invention because both are in the same field of rendering relit images of a subject’s head. A common motivation in the art is to use light values such as reflectance and lighting to improve the depiction and rendering of a subject’s head. 
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the plurality of self-images of a subject of Chen with the determining of a reflectance using an image of Debevec; Doing so would yield the predictable result of being able to generate a more representative and detailed 3D mesh of a head of the subject. 
Thus, amended Claim 11 is rejected under 35 U.S.C. 103 over Chen in view of Debevec. With respect to the dependent Claims of Claim 11, Claims 12-13 are rejected over Chen in view of Debevec; Claims 14-15 are rejected over Chen as modified by Debevec, Lombardi (Deep Appearance Models for Face Rendering, 2018), Geng (Single-view facial reflectance inference with a differentiable renderer, 2021), and Khakhulin (Pub. No. US 2023/0154111 A1), in view of Saragih (Pub. No. US 2022/0237843 A1). The addition of new references in the rejection of Claims 12-15 are necessitated by the new ground introduced in Claim 11.
Applicant’s arguments, see pages 8-9, filed 02/05/2026, with respect to the rejection of Claim 1 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made over Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021) and Debevec (Acquiring the Reflectance Field of a Human Face, 2000), in view of Lombardi (Deep Appearance Models for Face Rendering, 2018).
Claim 1 is amended to recite, inter alia, “a first processing stage for determining a reflectance using the mobile capture,” and “a second processing stage for determining a relightable model of a head of the subject using the multi-view scan”. Applicant argues that Chen as modified by Lombardi fails to disclose or suggest at least “determining a reflectance” as recited in currently amended independent Claim 1. The examiner agrees with the applicant’s argument. However, the new limitations introduced in amended Claim 1 constitute new grounds for rejection.
Chen as modified by Debevec teaches a system comprising:
a mobile device operable to generate a mobile capture of a subject ((Chen, Figure 1: “Input image captured by an iPhone”; Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”);
	a plurality of cameras configured to provide a multi-view scan of the subject under a fully lit condition (Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”; Chen, Section 3.2: “we extend the capture system in [27] to include 460 controllable lights that are synchronized with the multi-view camera system. The captured sequence was extended to include a portion where nonoverlapping groups of approximately 10 lights were turned on, interleaved with fully lit frames that were used for tracking. This data was used to build a relightable face model using the scheme illustrated in Figure 2”. Notes: As defined in applicant specifications in Paragraph 0035: “The multi-view scan 212 generates multiple images that are used by a processor to form a 3D mesh 216 of the subject’s head under a (fixed) uniform lighting configuration. The multi-view scan 212 is performed simply by simultaneously taking a single picture by each of multiple cameras under a uniform lighting condition provided by several light sources”; Therefore, multi-view scan is taken to mean a process for taking a plurality of images under uniform lighting in its broadest reasonable interpretation, wherein the images are taken simultaneously)); and 
	a pipeline configured to perform a plurality of processes using the mobile capture and the multi-view scan to generate a relightable avatar (Chen, Figure 1 demonstrates processing to get from captured image taken on an iPhone to relit avatar. Chen, Figure 2 visualizes the training pipeline: “Training the lighting model on the light-stage data. We update the lighting model G and per-frame expression code z while fixing the other parameters”. Chen, Figure 3 and 4 also demonstrate the visual steps of the processing that occurs during the pipeline. Notes: light-stage data is defined by Chen as being the images captured from the multi-view light-stage defined in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan. Are used in a training pipeline as demonstrated in Figure 2),
wherein the mobile capture includes a video captured while the subject is moved relative to a light source (Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”), and
wherein the pipeline corresponds to a processor (a processor is inherent to be able to perform the pipeline as demonstrated in Chen, Figure 2) configured for:
	a first processing stage configure for determining a reflectance using the mobile capture ((Debevec, Figure 3 picture and description: “Reflectance Functions for a Face: This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions. The functions have been brightened by a factor of four from the original data”. Debevec, Equations 4 and 5 demonstrate how the images are processed, and the reflectance used to obtain a rendered pixel and subsequent rendered image in Debevec, Figure 6. Notes: Reflectance can be determined from an image through a sampling scheme of pixels. It is also worth noting that the source or method of obtaining the image(s) is irrelevant to the calculation of the reflectance); and
	a second processing stage for determining a relightable model of a head of the subject using the multi-view scan (Chen, Figure 1 demonstrates processing to get from captured image taken on an iPhone to relit avatar. Chen, Figure 2 visualizes the training pipeline: “Training the lighting model on the light-stage data. We update the lighting model G and per-frame expression code z while fixing the other parameters”. Chen, Figure 3 and 4 also demonstrate the visual steps of the processing that occurs during the pipeline. Notes: light-stage data is defined by Chen as being the images captured from the multi-view light-stage defined in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan are used in a training pipeline as demonstrated in Chen, Figure 2)
As noted previously, Chen and Debevec are considered to be analogous to the claimed invention because both are in the same field of rendering relit images of a subject’s head. A common motivation in the art is to use light values such as reflectance and lighting to improve the depiction and rendering of a subject’s head. 
Chen as modified does not explicitly teach uniform illumination, although it does teach a “fully lit” lighting condition.
However, Lombardi teaches a plurality of cameras configured to provide a multi-view scan under a uniform illumination (Section 3: “The device contains 40 machine vision cameras capable of synchronously capturing 5120×3840 images at 30 frames per second … [and] evenly place 200 directional LED point lights directed at the face to promote uniform illumination.”).
Chen as modified and Lombardi are considered analogous in the art, since both references teach generating relightable avatars using multi-view scans taken through a multi camera apparatus. It should be noted that Chen directly references Lombardi, in that Chen expands on Lombardi’s camera system (Chen, Section 3.2: “we extend the capture system in [27]”, wherein reference 27 is Lombardi). Consequently, since Chen extends Lombardi, uniform illumination is present in both Chen and Lombardi, and the “fully lit pattern” described by Chen in Section 4.1 in the context of Lombardi is taken to mean uniform illumination. Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to combine the system of Chen with the uniform illumination of Lombardi to yield the predictable result of generating relit avatars.
Applicant argues that even if, ad arguendo, Debevec suggests “a reflectance” recited in claim 1, Debevec states “[e]ach 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere.” In its broadest reasonable interpretation, reflectance is an attribute value or values quantifying the reflection of light of a material. Therefore, Debevec’s reflectance function determines a reflectance.
Applicant argues that Chen in view of Debevec fails to disclose “determining a reflectance using the mobile capture”. As previously stated, Debevec discloses a method for determining a reflectance (Debevec, Figure 3: “This mosaic is formed from the reflectance functions of a 15x44 sampling of pixels from the original 480x720 image data. Each 64x32 reflectance function consists of the corresponding pixel location’s appearance under two thousand lighting directions distributed throughout the sphere. The inset shows the same view of the face under a combination of three lighting directions”). While Debevec does not teach disclose determining a reflectance using a mobile capture, Debevec provides a method for determining a reflectance for an image, and a person having ordinary skill in the art would appreciate that the source of an image is irrelevant to the calculation of the reflectance of the image. Hence, the method of Debevec can be used for determining the reflectance of a mobile capture. Furthermore, Chen teaches “a mobile capture of a subject” (Chen, Section 4.1: “The in-the-wild video test were gathered using the frontal camera of an iPhone. We captured videos for 10 subjects. We collected around 5 video clips for each subject, performing different facial expressions and head movements, under various lighting conditions and environments”). Therefore, one ordinarily skilled in the art would be able to combine the mobile capture method of Chen with the method for determining a reflectance of Debevec, with the predictable result of obtaining reflectance of a mobile capture for use in generating relightable avatars.
Applicant further argues that Chen in view of Debevec fails to disclose “determining a relightable model of a head of the subject using the multi-view scan”. As previously stated, Chen teaches determining a relightable model of a head of the subject using the multi-view scan (Chen, Figure 2 visualizes the training pipeline: “Training the lighting model on the light-stage data. We update the lighting model G and per-frame expression code z while fixing the other parameters”. Chen, Figure 3 and 4 also demonstrate the visual steps of the processing that occurs during the pipeline. Notes: light-stage data is defined by Chen as being the images captured from the multi-view light-stage defined in Chen, Section 4.1: “We recorded our light-stage data in a calibrated multi-view light-stage consisting of 40 machine vision cameras capable of synchronously capturing HDR images … and a total of 460 white LED lights … We record a total of 13 minutes video sequence of one subject”. Therefore, a plurality of images of a subject resulting from a multi-view scan are used in a training pipeline as demonstrated in Chen, Figure 2). While Chen does not explicitly state a “multi-view scan” is obtained, Chen records videos (sequence of images) of the subject from different angles that are defined by the set up in Chen, Section 4.1. As noted by the applicant in the Specification: “The multi-view scan 212 generates multiple images that are used by a processor to form a 3D mesh 216 of the subject’s head under a (fixed) uniform lighting configuration. The multi-view scan 212 is performed simply by simultaneously taking a single picture by each of multiple cameras under a uniform lighting condition provided by several light sources”. While it is implicit that Chen records the videos under uniform lighting, Chen as modified by Lombardi teaches performing the capture of the multiple images under uniform lighting (Lombardi, Section 3: “The device contains 40 machine vision cameras capable of synchronously capturing 5120×3840 images at 30 frames per second … [and] evenly place 200 directional LED point lights directed at the face to promote uniform illumination). It is worth noting that Chen references the system of Lombardi for its own use. Thus, Chen as modified by Debevec and Lombardi teaches “determining a relightable model of a head of the subject using the multi-view scan”.
Regarding the dependent claims of amended Claim 1, Claims 2-4 and 10, and amended Claims 5-6 are rejected over Chen in as modified by Debevec, in view of Lombardi; Claim 7 is rejected over Chen as modified by Debevec and Lombardi, in view of Geng (Single-view facial reflectance inference with a differentiable renderer, 2021); Claims 8-9 are rejected over Chen in view of Debevec, Lombardi, and Geng, in view of Khakhulin (Pub. No. US 2023/0154111 A1).
With respect to amended Claim 16 and its dependents, amended Claim 16 and Claims 17-19 are rejected over Sevastopolskiy as necessitated by a new ground for rejection of a new limitation; Claim 20 is rejected over Sevastopolskiy in view of Chen (High-fidelity Face Tracking for AR/VR via Deep Lighting Adaptation, 2021).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAYMOND CHUN LAM LI whose telephone number is (571)272-5124. The examiner can normally be reached M-F 8:30-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at 571-272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RAYMOND CHUN LAM LI/Examiner, Art Unit 2614                                                                                                                                                                                                        
/KENT W CHANG/Supervisory Patent Examiner, Art Unit 2614
Read full office action
FACE RELIGHTING OF AVATARS WITH HIGH-QUALITY SCAN AND MOBILE CAPTURE

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

FACE RELIGHTING OF AVATARS WITH HIGH-QUALITY SCAN AND MOBILE CAPTURE

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email