Last updated: April 19, 2026
Application No. 18/357,400
METHOD AND DEVICE WITH MODEL FOR 3D SCENE GENERATION

Non-Final OA §103
Filed
Jul 24, 2023
Examiner
PROVIDENCE, VINCENT ALEXANDER
Art Unit
2617
Tech Center
2600 — Communications
Assignee
The Board Of Trustees Of The Leland Stanford Junior University
OA Round
3 (Non-Final)
Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 18 resolved cases, 2023–2026
Examiner Intelligence

PROVIDENCE, VINCENT ALEXANDER View full profile →
Grants 83% — above average
Career Allow Rate
15 granted / 18 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 5m
Avg Prosecution
38 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
0.9%
-39.1% vs TC avg
§103
82.4%
+42.4% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
0.9%
-39.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 18 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The Amendment September 4th, 2025 filed has been entered. Claims 1-6, 8-15, and 17-20 are pending in the application. Claims 7 and 16 are cancelled. Applicant’s amendments to the Claims 1, 15 and 20 have overcome the 103 rejections set forth in the previous Final Office Action. Further search has been performed to address the material amended in Claims 1, 15 and 20. Newly found references Watson (NPL: NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS) and Qin (NPL: Learning-by-Novel-View-Synthesis for Full-Face Appearance-Based 3D Gaze Estimation) alongside previously cited reference Chan B (NPL: Efficient Geometry-aware 3D Generative Adversarial Networks) were used for the amended claim limitations.

Response to Arguments
The Examiner appreciates Applicant’s thorough response to the previous Final Office Action. Applicant’s arguments with respect to claims 1, 15 and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 3, 4, 5, 6, 8, 9, 10, 14, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Schwarz et al.: (NPL: GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis) in view of Chan et al.: (NPL: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis, hereinafter Chan A), DeVries et al.: (NPL: Unconstrained Scene Generation with Locally Conditioned Radiance Fields), Watson (NPL: NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS), and Chan et al.: (NPL: Efficient Geometry-aware 3D Generative Adversarial Networks, hereinafter Chan B).

Regarding claim 1:
Schwarz teaches:
A method of training a neural network model (Schwarz: The radiance field is represented by a deep fully-connected neural network, Pg. 3, Conditional Radiance Field) to generate a three-dimensional (3D) model (see Note 1A) of a scene (Schwarz: Our model allows for generating 3D consistent images at high spatial resolution, Pg. 2, Figure 1), the method comprising: 
generating the 3D model based on a latent code (Schwarz: The generator Gθ takes […] shape/appearance codes zs ∈ Rm/za ∈ Rn as input, Pg. 4, Section 3.2: Generative Radiance Fields, par. 2; see Note 1B);
based on the 3D model, sampling a camera view (Schwarz: We sample the camera pose ξ = [R|t] from a pose distribution, Pg. 4, Section 3.2.1 Generator, par. 1) comprising a camera position and a camera angle (Schwarz: In our experiments, we use a uniform distribution on the upper hemisphere for the camera location with the camera facing towards the origin of the coordinate system, Pg. 4, Section 3.2.1: Generator; par. 1; see Note 1C) corresponding to the 3D model of the scene; 
generating a two-dimensional (2D) image based on the 3D model (Schwarz: we instead predict a fixed patch of size K × K pixels which is randomly scaled and rotated to provide gradients for the entire radiance field, Pg. 4, Section 3.2 Generative Radiance Fields, par. 2) and the sampled camera view (Schwarz: The generator Gθ takes camera matrix K, camera pose ξ, […] as input and predicts an image patch, Pg. 4, Section 3.2 Generative Radiance Fields, par. 2); and 
training the neural network model (Schwarz: we aim at learning a model for synthesizing novel scenes by training on unposed images. More specifically, we utilize an adversarial framework to train a generative model for radiance fields (GRAF, Pg. 4, Section 3.2 Generative Radiance Fields, par. 1) to, using the 3D model, generate a scene (Schwarz: we propose a generative model for neural radiance fields (bottom) which represent the scene as a continuous function gθ; Pg. 2, Figure 1) corresponding to the sampled camera view (Schwarz: represent the scene as a continuous function gθ that maps a […] viewing direction d to a color value c and a volume density σ, Pg. 2, Figure 1; see Note 1D) based on the generated 2D image and a real 2D image (Schwarz: The discriminator Dφ compares the synthesized patch P’ to a real patch P extracted from a real image I, Pg. 5, Figure 2; see Note 1E).  

Note 1A: The radiance field is analogous to a 3D model, as Chan showcases in Figure 9 that a “3D structure can be extracted and visualized using the marching cubes algorithm [36] on the density output of the conditioned radiance field to produce a surface mesh.” (Pg. 7, “Interpreting the 3D representation”). Chan further teaches that the “3D structure that represents a proxy shape of the scene”, indicating that 3D structure is part of the scene described by the radiance field.

Note 1B: The shape/appearance codes are latent codes, as Schwarz describes on Pg. 5, “Conditional Radiance Field”: “gθ is conditioned on two additional latent codes: a shape code zs ∈ RMs which determines the shape of the object and an appearance code za ∈ RMa which determines its appearance”.

Note 1C: The camera pose taught by Schwarz in Pg. 4, Section 3.2.1: Generator; par. 1 comprises a position (“we use a uniform distribution on the upper hemisphere for the camera location”) and angle (“the camera facing towards the origin of the coordinate system”).

Note 1D: A viewing direction is analogous to a sampled camera view.

Note 1E: The discriminator is utilized during training to assist with inference (generation) of the radiance field, which in turn may be used to generate a 3D model (see Note 1A): “We introduce a patch-based discriminator that samples the image at multiple scales and which is key to learn high-resolution generative radiance fields efficiently.” (Pg. 2, par. 2 (ii)).
Note that a “patch” is analogous to a 2D image, as Schwarz teaches that a patch may be a “of size K × K pixels” (Pg. 4, Section 3.2 Generative Radiance Fields, par. 2).

Schwarz fails to explicitly teach:
	generating the 3D model based on a noise vector;
based on the 3D model, sampling a camera view comprising a camera position and a camera angle corresponding to the 3D model of the scene;
wherein the sampling of the camera view comprises:
initially sampling, for a predetermined number of times, a fixed camera view that is based on a fixed camera pose corresponding to the 3D model: and 
for each training iteration after a lapse of the predetermined number of times, sampling the camera view using both the fixed camera view and a random camera view that is based on a camera pose randomly determined based on a specific camera view distribution corresponding to the 3D model.

Chan A teaches:
	generating a radiance field based on a noise vector (Chan A: Instead of directly generating a 2D image from the input noise, z, our generator GθG (z, ξ) produces an implicit radiance field conditioned on z, Pg. 3, Section 3: Methods, par. 1; see Note 1H).

Note 1H: Chan A teaches that their GAN may take a variable z as input, described as input noise: “Traditional 2D GANs, such as StyleGAN [26], take in a latent vector z ∼ pz and directly produce a 2D image. Instead of directly generating a 2D image from the input noise, z, our generator GθG (z, ξ) produces an implicit radiance field conditioned on z,” (Pg. 3, Section 3: Methods, par. 1).
The “noise” taught by Chan A is a noise vector, because Chan A teaches: “we leverage a StyleGAN-inspired mapping network, which conditions the entire MLP on a single input noise vector,” (Pg. 3, par. 2).
Schwarz teaches a similar derivation of the z variable on Pg. 4, Section 3.2.1: Generator, par. 2: “The shape and appearance variables zs and za are drawn from shape and appearance distributions zs ∼ ps and za ∼ pa, respectively. In our experiments we use a standard Gaussian distribution for both ps and pa.” 
Chan A also draws a direct comparison from themselves to Schwarz: “The work most similar to ours is GRAF” (Pg. 3, par. 2) and elaborates that “… GRAF conditioned its MLP generator on both a shape noise code and an appearance noise code by concatenation,” Pg. 3, par. 2). Therefore, one of ordinary skill in the art would understand that the shape/appearance codes zs and za, taught by Schwarz to represent a latent code, may also be a noise vector.
It was previously shown in Note 1A that a radiance field is analogous to a 3D model. Therefore, when the teachings of Chan A are combined with Schwarz, it would be obvious to one of ordinary skill in the art to generate a 3D model based on a noise vector.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Chan A with Schwarz. Generating a 3D mesh from the radiance field, as in Chan A, would benefit the Schwarz teachings by enabling easier visualization of the surface represented by the radiance field.

Schwarz in view of Chan A fails to explicitly teach:
based on the 3D model, sampling a camera view comprising a camera position and a camera angle corresponding to the 3D model of the scene;
wherein the sampling of the camera view comprises:
initially sampling, for a predetermined number of times, a fixed camera view that is based on a fixed camera pose corresponding to the 3D model: and 
for each training iteration after a lapse of the predetermined number of times, sampling the camera view using both the fixed camera view and a random camera view that is based on a camera pose randomly determined based on a specific camera view distribution corresponding to the 3D model.

DeVries teaches:
based on the 3D model (DeVries: To overcome the issue of sampling invalid locations we perform stochastic weighted sampling over a an empirical pose distribution pT […], where each pose is weighted by the occupancy (i.e., the σ value predicted by the model) at that location, Pg. 5, col. 1, par. 1), sampling a camera view (DeVries: camera poses T = [R|t] ∈ SE(3) need to be sampled from pose distribution pT, Pg. 4, Section 3.3. Sampling Camera Poses, par. 1) comprising a camera position and a camera angle (DeVries: the camera is constrained to move on a viewing sphere around the object and oriented towards the origin, Pg. 4, Section 3.3: Sampling Camera Poses, par. 1; see Note 1G) corresponding to the 3D model of the scene (see Note 1F);

Note 1F: DeVries teaches that the camera pose is based on the object, as cited in Pg. 4, Section 3.3 above. Therefore, it is reasonable to conclude that the camera position and camera angle that make up the camera pose correspond to the 3D model of the object in the scene. 

Note 1G: The camera pose taught by DeVries in Pg. 4, 3.3. Sampling Camera Poses; par. 1 comprises a position (“the camera is constrained to move on a viewing sphere around the object”) and angle (“oriented towards the origin”).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of DeVries with Schwarz in view of Chan A. Based on the 3D model, sampling a camera view comprising a camera position and a camera angle corresponding to the 3D model of the scene, as in DeVries, would benefit the Schwarz in view of Chan A teachings by enabling the neural network to properly capture objects not placed at the origin of a scene: “Therefore, camera poses […] need to be sampled from pose distribution pT in addition to the latent code z ∼ pz, […]. GRAF (Schwarz et al, as cited above) [49] and π-GAN [3] (Chan et al, as cited above) avoid this issue by training on datasets containing objects placed at the origin, where the camera is constrained to move on a viewing sphere around the object and oriented towards the origin.” (DeVries, Pg. 4, Section 3.3, Sampling Camera Poses, par. 1)

Schwarz in view of Chan A and DeVries fails to explicitly teach:
wherein the sampling of the camera view comprises:
initially sampling, for a predetermined number of times, a fixed camera view that is based on a fixed camera pose corresponding to the 3D model: and 
for each training iteration after a lapse of the predetermined number of times, sampling the camera view using both the fixed camera view and a random camera view that is based on a camera pose randomly determined based on a specific camera view distribution corresponding to the 3D model.

Watson teaches:
wherein the sampling of the camera view comprises:
initially sampling, for a predetermined number of times, a fixed camera view (Watson: We start with a set of conditioning views X ={x1,...,xk} of a static scene, where typically k = 1 or is very small, Pg. 4, Section 2.2: 3D Consistency via Stochastic Conditioning) that is based on a fixed camera pose corresponding to the 3D model (see Note 1H) and 
for each training iteration after a lapse of the predetermined number of times, sampling the camera view using a random camera view (Watson: We then generate a new frame by running a modified version of the standard denoising diffusion reverse process […] where, crucially, i ∼ Uniform({1,...,k}) is re-sampled at each denoising step. In other words, each individual denoising step is conditioned on a different random view from X (the set that contains the input view(s) and the previously generated samples) (Pg. 4, Section 2.2: 3D Consistency via Stochastic Conditioning) based on a specific camera view distribution corresponding to the 3D model (Watson: data distribution q(x1,x2) of pairs of views from a common scene at poses p1,p2 ∈ SE(3), Pg. 3, Section 2.1: Image-To-Image Diffusion Models with Pose Conditioning; see Note 1I).

Note 1H: The static scene contains 3D models, as Watson teaches: “Given a complete description of a 3D scene S, for any pose p, the view x(p) at pose p is fully determined from S, i.e., views are conditionally independent given S.” (Pg. 3, Section 2, Pose-Conditional Diffusion Models). The fixed camera view is based on a fixed camera pose, as Figure 3 showcases under “conditioning set” that the conditioning views consist of a view rendered from a camera viewing a 3D model. It follows that said views are based on the camera pose corresponding to the 3D model.
Note 1I: As best understood by the Examiner, the “data distribution q(x1,x2) of pairs of views from a common scene” is analogous to a camera view distribution. Furthermore, the views are “from a common scene” indicating they correspond to the 3D model.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Watson with Schwarz in view of DeVries and Chan A. Generating first renderings of the 3D model according to the camera views and subsequently generating second renderings of the 3D model according to randomized camera views, as in Watson, would benefit the Schwarz in view of DeVries and Chan A teachings by enhancing the 3D consistency of the rendered model: “traditional NeRFs (Mildenhall et al., 2020) can be 3D inconsistent as the model allows for view-dependent radiance.” (Watson, Pg. 21, Section 7.4: 3D Consistency Scoring)

Schwarz in view of Chan A, DeVries, and Watson still fails to teach:
for each training iteration after a lapse of the predetermined number of times, sampling the camera view using both the fixed camera view and a random camera view that is based on a camera pose randomly determined based on a specific camera view distribution corresponding to the 3D model.

Chan B teaches:
sampling the camera view using both the fixed camera view and a random camera view that is based on a camera pose randomly determined (Chan B: As described in Section 4.4 of the main paper, we regularize generator pose conditioning by randomly swapping the conditioning pose of the generator with another random pose with 50% probability, Pg. 13, Section 1.3: Regularizing generator pose conditioning, par. 1) based on a specific camera view distribution corresponding to the 3D model (Chan B: input camera poses matrices are corrupted with 1, 2, 3, and 4 standard deviations of Gaussian noise, Pg. 13, col. 2, par. 1; see also Pg. 13, Fig. 4; see Note 1J).

Note 1J: Chan B teaches that input camera poses may be randomly “corrupted” with Gaussian noise at various standard deviations. The Examiner understands such modifications to require a Gaussian distribution that is used to modify the specific camera view corresponding to the 3D model.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Chan B with Schwarz in view of Chan A,  DeVries, and Watson. Generating sample images by alternating between sampling the 3D model from random camera views and sampling the 3D model from patterned camera views; and generating the image of the physical space using the 3D model based on the target camera view, the sample images, and the coordinate system, as in Chan B would benefit the Schwarz in view of Chan A, DeVries, and Watson teachings by preventing “degenerate solutions where the GAN produces 2D billboards angled towards the camera,” (Chan B, Pg. 6, par. 2).

Regarding claim 2:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the sampling of the camera view comprises: sampling the camera view using a camera pose or a camera direction randomly determined based on the specific camera view distribution corresponding to the 3D model (DeVries: we perform stochastic weighted sampling over a an empirical pose distribution pT composed by a set of candidate poses, where each pose is weighted by the occupancy (i.e., the σ value predicted by the model) at that location, Pg. 5, par. 1; see Note 2A).

Note 2A: DeVries teaches that the camera pose is determined via stochastic weighted sampling, which is analogous to random determination (with an additional weight component). The weight component is determined by the model, as cited above: “each pose is weighted by the occupancy (i.e., the σ value predicted by the model) at that location”, i.e., the camera view (pose) is sampled randomly based on a specific camera view distribution corresponding to the 3D model.

Regarding claim 3:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 2 (as shown above), wherein the sampling of the camera view using the camera pose or the camera direction comprises: 
determining the camera pose by the specific camera view distribution at a center of the 3D model (Schwarz: We sample the camera pose ξ = [R|t] from a pose distribution pξ. […] we use a uniform distribution on the upper hemisphere for the camera location with the camera facing towards the origin of the coordinate system, Pg. 4, Section 3.2.1 Generator; par. 1; DeVries: GRAF [49] (referring to Schwarz) and π-GAN [3] avoid this issue by training on datasets containing objects placed at the origin, where the camera is constrained to move on a viewing sphere around the object and oriented towards the origin, Pg. 4, Section 3.3 Sampling Camera Poses, par. 1); or 
determining the camera direction by a random azimuth angle and by an altitude angle determined according to the specific camera view distribution with respect to a horizontal plane.

Regarding claim 4:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 2 (as shown above), wherein the sampling of the camera view using the camera pose or the camera direction comprises: 
determining the camera pose by the specific camera view distribution based on a position separated a predetermined distance from a center of a specific object included in the 3D model (Chan: At training time, we randomly sample camera poses ξ from a distribution pξ. […] In our experiments, we constrained camera positions to the surface of a unit sphere and directed the camera to point towards the origin, Pg. 4, Section 3.4 Training Details, par. 1; see Note 4A); or determining the camera direction by the specific camera view distribution in a direction toward the center of the specific object (Chan: At training time, we randomly sample camera poses ξ from a distribution pξ. […] we constrained camera positions to the surface of a unit sphere and directed the camera to point towards the origin, Pg. 4, Section 3.4 Training Details, par. 1, see Note 4B).

Note 4A: Chan teaches that the distribution of camera positions is constrained to a “unit sphere”, thereby requiring the camera remains a predetermined distance from the center of an object included in the 3D model.
Note 4B: DeVries teaches that the method of Chan orients the camera towards the object in the scene: “π-GAN [3] avoid this issue by training on datasets containing objects placed at the origin, where the camera is […] oriented towards the origin,” (Pg. 4, Section 3.3 Sampling Camera Poses, par. 1). Chan supports this claim with the quote: “we constrained camera positions to the surface of a unit sphere and directed the camera to point towards the origin” as cited above.

Regarding claim 5:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 2 (as shown above), wherein the sampling of the camera view comprises: 
selecting the camera view based on determining whether the sampled camera view is inside an object included in the 3D model (DeVries: we perform stochastic weighted sampling over a an empirical pose distribution pT composed by a set of candidate poses, where each pose is weighted by the occupancy (i.e., the σ value predicted by the model) at that location, Pg. 5, par. 1; see Note 5A). 

Note 5A: DeVries teaches that the motivation for performing this sampling method is to avoid “the possibility of sampling invalid locations, such as inside walls or other solid objects that sporadically populate the scene,” (Pg. 5, par. 1)

Regarding claim 6:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
	The method of claim 2 (as shown above), wherein the specific camera view distribution comprises either a Gaussian distribution or a uniform distribution (Chan: At training time, we randomly sample camera poses ξ from a distribution pξ. The pose distributions for each dataset are known a priori and approximated as either Gaussian, for CelebA and Cats, or uniform, for CARLA, Pg. 4, Section 3.4 Training Details, par. 1).

Regarding claim 8:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the sampling of the camera view using both the fixed camera view and the random camera view comprises either: 
alternately sampling the fixed camera view and the random camera view (Chan B: we randomly swap the conditioning pose in P with another random pose with 50% probability during training, Pg. 6, col. 2, par. 1; see Note 8A); or 
sampling the camera view while gradually expanding a range of the camera view from the fixed camera view to the random camera view.  

Note 8A: Claim 8 lists limitations in the alternative, and so the Examiner has opted to map the limitations already taught by the prior art of record.

Regarding claim 9:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the generating of the 2D image comprises: 
generating first patches including a portion of the 2D image corresponding to the 3D model according to the camera view (Schwarz: The generator Gθ takes camera matrix K, camera pose ξ, 2D sampling pattern ν and shape/appearance codes zs ∈ Rm/za ∈ Rn as input and predicts an image patch P’, Pg. 4, Section 3.2, Generative Radiance Fields, par. 2; see Note 9A), and wherein the training of the neural network model comprises: 
training a discriminator (see Note 9C) of the neural network model to discriminate between the generated 2D image and the real 2D image based on a degree of discrimination between the first patches and second patches comprising respective portions of the real 2D image (Schwarz: The discriminator Dφ compares the synthesized patch P’ to a patch P extracted from a real image, Pg. 4, Section 3.2: Generative Radiance Fields, par. 2; see Note 9B)

Note 9A: Schwarz teaches that the generator generates a portion of a 2D image (a “image patch”) according to the camera view (“camera pose ξ”). The 2D image corresponds to a 3D model, as Schwarz teaches the generator also utilizes shape/appearance codes, which correspond to the object in the 2D image: “For Cars and Chairs the appearance code controls the color of the object while for Faces it encodes skin and hair color,” (Pg. 9, “Are Generative Radiance Fields able to disentangle shape from appearance?”)

Note 9B: Patch P’ corresponds to the patch generated by the generator, as described in Note 9A.
Because the discriminator samples the image patches: “We introduce a patch-based discriminator that samples the image,” (Pg. 2, par. 2), it is reasonable to conclude that the discriminator compares the first patch P and the second patch P’ to obtain a degree of discrimination. Furthermore, Schwarz teaches that Patch P comprises a portion of the real 2D image: “patch P extracted from a real image,” (Pg. 4, Section 3.2: Generative Radiance Fields, par. 2)

Note 9C: Schwarz teaches that the discriminator is part of a trained neural network model: “we aim at learning a model for synthesizing novel scenes by training on unposed images. More specifically, we utilize an adversarial framework to train a generative model for radiance fields (GRAF). Fig. 2 shows an overview over our model.” Figure 2 showcases the discriminator on the right half of the figure.

Regarding claim 10:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), further comprising: 
receiving a camera view corresponding to the real 2D image (Chan B: Our method expects a dataset in which each image is labeled with an approximate camera pose, in order to enable sampling camera poses from the dataset distribution, Pg. 13, Section 1.4: Robustness to imprecise camera poses, par. 1), wherein the sampling of the camera view comprises: 
sampling the camera view using a perturbed camera view by randomly perturbing at least one of a camera position or a camera direction according to the camera view corresponding to the real 2D image (Chan B: We train four models with “imprecise” camera poses: (1 σ, 2 σ, 3 σ, 4 σ) where the input camera poses matrices are corrupted with 1, 2, 3, and 4 standard deviations of Gaussian noise, respectively, Pg. 13, col. 2, par. 1; see also Pg. 13, Figure 4; see also Note 1J).

Regarding claim 14: 
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the scene of the 3D model includes at least one of a still image or a moving image (Chan: our inverse rendering results only reconstruct static images, the method could be extended to generate fake photos or videos of real people (DeepFakes), Pg. 8, Ethical considerations)

Regarding claim 19:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
A non-transitory computer-readable storage medium (DeVries: In this setup, it is often useful to equip models with a memory mechanism to aggregate the set of incoming observations, Pg. 3, col. 1, par. 2; see Note 19A) storing instructions that, when executed by a processor (see Note 19B), cause the processor to perform the method of claim 1 (as shown above).

Note 19A: A memory mechanism is analogous to a non-transitory computer readable medium. 
Note 19B: DeVries teaches that their method “would be a practical tool for tackling a wide range of problems in machine learning and computer vision”, i.e., the method may be performed with a computer. A computer inherently comprises a processor, and processors are known in the art to execute instructions from a memory.

Claims 11, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Schwarz et al.: (NPL: GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis) in view of Chan et al.: (NPL: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis, hereinafter Chan A), DeVries et al.: (NPL: Unconstrained Scene Generation with Locally Conditioned Radiance Fields), Watson (NPL: NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS), and Chan et al.: (NPL: Efficient Geometry-aware 3D Generative Adversarial Networks, hereinafter Chan B), and Shechtman (US 20190251401 A1).

Regarding claim 11:
Schwarz in view of Chan A, DeVries, Watson, Chan B, and Shechtman teaches:
The method of claim 10 (as shown above), wherein the sampling of the camera view comprises: 

Schwarz in view of Chan A, DeVries, Watson, and Shechtman fails to teach:
initially sampling, the predetermined number of times, the fixed camera view that is based on the fixed camera pose corresponding to the 3D model; and 
for each training iteration after the lapse of the predetermined number of times, sampling the camera view using both the fixed camera view and the perturbed camera view corresponding to the real 2D image.

Chan B teaches:
initially sampling, the predetermined number of times (Chan B: We use the ‘cats’ split, which contains approximately 5000 images, for our experiments. As with FFHQ, we assume fixed camera intrinsics across the dataset, Pg. 23, AFHQv2; see Note 7A), the fixed camera view that is based on the fixed camera pose corresponding to the 3D model (Chan B: To prevent the scene from shifting with camera pose during inference, we condition the generator on a fixed camera pose when rendering from a moving camera trajectory, Pg. 6, col. 1, par. 1); and 
for each training iteration after the lapse of the predetermined number of times (Chan B: the swapping probability is linearly decayed to 50% over the first 1M images, Pg. 13, Section 1.3. Regularizing generator pose conditioning, par. 1; See Note 7B), sampling the camera view using both the fixed camera view and the perturbed camera view (see Note 11A) that is based on a camera pose randomly determined based on a specific camera view distribution corresponding to the real 2D image (Chan B: For the remainder of training, we maintain 50% swapping probability, Pg. 13, Section 1.3. Regularizing generator pose conditioning, par. 1; see Note 7B).

Note 11A: When combined with the teachings of Chan B, it would be obvious to one of ordinary skill in the art to use the perturbed camera view of Shechtman as the random camera view taught by Chan B.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Chan B with Schwarz in view of Chan, DeVries, and Shechtman. Initially sampling, a predetermined number of times, a fixed camera view that is based on a fixed camera pose corresponding to the 3D model; and for each training iteration after a lapse of a predetermined number of times, sampling the camera view using both the fixed camera view and the perturbed camera view corresponding to the real 2D image, as in Chan B, would benefit the Schwarz in view of Chan, DeVries, and Shechtman teachings by preventing “degenerate solutions where the GAN produces 2D billboards angled towards the camera” (Chan B, Pg. 6, col. 1, par. 2).

Regarding claim 12:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the training of the neural network model comprises: 

Schwarz in view of Chan A, DeVries, Watson, and Chan B fails to explicitly teach:
calculating a first loss based on a degree of discrimination between the generated 2D image and the real 2D image; 
calculating a second loss based on a degree of similarity between the camera view corresponding to the real 2D image and the perturbed camera view; and 
training the neural network model to generate a scene of the 3D model corresponding to the perturbed camera view based on the first loss and/or the second loss.

Shechtman teaches:
calculating a first loss based on a degree of discrimination between the generated 2D image and the real 2D image (Shechtman: In some embodiments, adversarial learning includes employing a minimax function (e.g., a minimax objective function) that […] maximizes a second type of loss […] the image composite system employs adversarial learning to […] maximize discrimination of an adversarial discrimination neural network against non-realistic images generated by the geometric prediction neural network [0043]; see Note 12A); 
calculating a second loss based on a degree of similarity between the camera view corresponding to the real 2D image and the perturbed camera view (Shechtman: by comparing the generated warp parameters for the perturbed foreground object to the ground truth parameters for the same object given the same background scene, the image classification loss model 408 can identify an amount of error loss for the generated warp parameters [0100]; see Note 12B); and 
training the neural network model to generate a scene of the 3D model corresponding to the perturbed camera view based on the first loss and/or the second loss (Shechtman: The process of providing error loss via back propagation to the supervised geometric prediction neural network 400 can continue until the supervised geometric prediction neural network 400 converges [0100]).

Note 12A: Shechtman teaches that the discriminator is trained to classify images based on whether they are “real” or “fake” (generated): “the adversarial discrimination neural network learns, based on real images, to determine whether an input image is a real image or a fake image” [0028].

Note 12B: The real 2D image is considered to be analogous to the ground truth parameters.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Shechtman with Schwarz in view of Chan and DeVries. Calculating first and second loss based on the similarity between the camera view and images, as in Shechtman, would benefit the Schwarz in view of Chan and DeVries teachings by ensuring that minor differences in viewpoint are learned, increasing consistency within the radiance field.

Regarding claim 13:
Schwarz in view of Chan A, DeVries, Watson, and Chan B teaches:
The method of claim 1 (as shown above), wherein the training of the neural network model comprises: 

Schwarz in view of Chan A, DeVries, Watson, and Chan B fails to teach:
training a generator of the neural network model to generate a scene corresponding to the sampled camera view, using a third loss that is based on a degree of similarity between the generated 2D image and the real 2D image; and 
training a discriminator of the neural network model to discriminate between the generated 2D image and the real 2D image, using a first loss that is based on a degree of discrimination between the generated 2D image and the real 2D image.  

Shechtman teaches:
training a generator of the neural network model to generate a scene corresponding to the sampled camera view, using a third loss that is based on a degree of similarity between the generated 2D image and the real 2D image (Shechtman: by comparing the generated warp parameters for the perturbed foreground object to the ground truth parameters for the same object given the same background scene, the image classification loss model 408 can identify an amount of error loss for the generated warp parameters [0100]; see Note 12B); and 
training a discriminator of the neural network model to discriminate between the generated 2D image and the real 2D image, using a first loss that is based on a degree of discrimination between the generated 2D image and the real 2D image (Shechtman: In some embodiments, adversarial learning includes employing a minimax function (e.g., a minimax objective function) that […] maximizes a second type of loss […] the image composite system employs adversarial learning to […] maximize discrimination of an adversarial discrimination neural network against non-realistic images generated by the geometric prediction neural network [0043]; see Note 12A).

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Shechtman with Schwarz in view of Chan and DeVries. Training a generator of the neural network model to generate a scene corresponding to the sampled camera view, using a third loss that is based on a degree of similarity between the generated 2D image and the real 2D image; and training a discriminator of the neural network model to discriminate between the generated 2D image and the real 2D image, using a first loss that is based on a degree of discrimination between the generated 2D image and the real 2D image, as in Shechtman, would benefit the Schwarz in view of Chan and DeVries teachings by ensuring that the discriminator properly separates real and fake images while also ensuring the generator produces high quality data.

 In some embodiments, adversarial learning includes employing a minimax function (e.g., a minimax objective function) that both minimizes a first type of loss and maximizes a second type of loss. For example, the image composite system employs adversarial learning to minimize loss for generating warp parameters by a geometric prediction neural network and maximize discrimination of an adversarial discrimination neural network against non-realistic images generated by the geometric prediction neural network.

Claims 15, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over DeVries et al.: (NPL: Unconstrained Scene Generation with Locally Conditioned Radiance Fields) in view of Chan (NPL: Efficient Geometry-aware 3D Generative Adversarial Networks; hereinafter Chan B) and Qin (NPL: Learning-by-Novel-View-Synthesis for Full-Face Appearance-Based 3D Gaze Estimation).

Regarding claim 15:
DeVries teaches:
A method of generating an image of a three-dimensional (3D) model (DeVries, Pg. 3, Section 3: Method, par. 1), the method comprising: 
receiving images of a 3D scene respectively corresponding to camera views in a physical space (DeVries: We evaluate the generative performance of our model on three datasets: […] the Active Vision Dataset (AVD) [1] consisting of 20k images with noisy depth measurements from 9 real world scenes, Pg. 5, Section 4.1: Generation Performance, par. 1; see Note 15A); 
generating a 3D model of the physical space based on the images of the 3D scene (DeVries: Generative performance of state-of-the-art approaches for generative modelling of radiance fields on 3 scene-level datasets, Pg. 6, Table 2; see Note 15B); 
obtaining a coordinate system for the 3D model (DeVries: To decompose the scene into a grid of independent radiance fields, we enable our model to perform spatial sharing of local latent codes, Pg. 4, Section 3.2 Locally Conditioned Radiance Field; see Note 15C); 
receiving a target camera view of an image to be generated of the physical space (DeVries: We […] render observations using the camera poses of the target views T, Pg. 8, Section 4.3: View Synthesis, par. 1); and 
generating the image (DeVries: render observations, Pg. 8, Section 4.3: View Synthesis, par. 1) of the physical space using the 3D model (see Note 15B) based on the target camera view (DeVries: render observations using the camera poses of the target views T, Pg. 8, Section 4.3: View Synthesis, par. 1) and the coordinate system (DeVries: use the resulting latent to locally condition the radiance field, Pg. 8, Section 4.3: View Synthesis, par. 1; see Note 15C).  

Note 15A: DeVries teaches that the model will generated based on datasets containing images of real world scenes (as cited above). Real world scenes are analogous to physical spaces. Furthermore, DeVries teaches: “an interactive agent explores a scene collecting RGB and depth observations as well as camera poses,” (Pg. 5, Section 4.1: Generation Performance, par. 1), indicating that camera poses corresponding to the RGB and depth observations may be collected, thus indicating that camera views corresponding to the real world scene (“3D scene”) may be received.
Note 15B: DeVries teaches that radiance fields will be generated based on the three data sets cited above in Pg. 5, Section 4.1: Generation Performance, par. 1, namely the VizDoom, Replica, and AVD datasets. The radiance field taught by DeVries is analogous to a 3D model, as DeVries teaches: “we model scenes with a locally conditioned radiance field” (Pg. 4, Section 3.2: Locally Conditioned Radiance Field), (i.e., the radiance field is a model) and “Radiance fields are usually defined over R3 considering a global coordinate system [34, 49, 3] (i.e. a coordinate system that spans the whole scene/object),” (Pg. 4, col. 2, par. 1) (i.e., the radiance field is parameterized in 3D space). 
Note 15C: DeVries teaches: “it is also necessary to decompose the global coordinate system into multiple local coordinate systems (one for each local latent wij),” (Pg. 4, col. 2, par. 1) and “each code is used to locally condition a radiance field network,” (Pg. 4, col. 1, par. 2), i.e., the generator maps latent codes for a global coordinate system to latent codes for a local coordinate system (see also Figure 2) and uses these to generate radiance fields. As shown in Note 15B, the radiance fields are analogous to a 3D model.

DeVries fails to teach:
generating sample images by alternating between sampling the 3D model from random camera views and sampling the 3D model from patterned camera views;
generating the image of the physical space using the 3D model based on the target camera view, the sample images, and the coordinate system,
wherein the obtaining the coordinate system comprises:
generating a 3D model corresponding to an initial camera view in the physical space; and
correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view.

Chan B teaches:
	generating sample images (Chan B: 32-channel feature images, Pg. 4, Section 4: 3D GAN framework, par. 2) by alternating between sampling the 3D model from random camera views and sampling the 3D model (Chan B; 32-channel tri-planes, Pg. 4, Section 4: 3D GAN framework, par. 2; see Note 15D) from patterned camera views (Chan B: As described in Section 4.4 of the main paper, we regularize generator pose conditioning by randomly swapping the conditioning pose of the generator with another random pose with 50% probability, Pg. 13, Section 1.3: Regularizing generator pose conditioning, par. 1; see also Note 15E);
generating the image of the physical space (Chan B: refine the 32-channel feature image IF into the final RGB image I+RGB, Pg. 5, Section 4.2, Super resolution, par. 2) using the 3D model (Chan B: 32-channel tri-planes, Pg. 4, Section 4: 3D GAN framework, par. 2) based on the target camera view (Chan B: the GAN setting our neural renderer aggregates features from each of the 32-channel tri-planes and predicts 32-channel feature images from a given camera pose, Pg. 4, Section 4: 3D GAN framework, par. 2), the sample images (Chan B: we render 32-channel feature images, Pg. 5, par. 3), and the coordinate system (Chan B: This hybrid representation can be queried for continuous coordinates and outputs a scalar density σ as well as a 32-channel feature, both of which are then processed by a neural volume renderer to project the 3D feature volume into a 2D feature image, Pg. 5, par. 2; see Note 15F).

Note 15D: The Examiner interprets Chan B’s tri-plane representation to be analogous to a 3D model, because Chan B teaches: “Training a high-resolution GAN requires a 3D representation that is both efficient and expressive. In this section, we introduce a new hybrid explicit–implicit tri-plane representation that offers both of these advantages,” (Pg. 3, Section 3: Tri-plane hybrid 3D representation). That is, the tri-plane taught by Chan B is designed to represent three-dimensional data, similar to a 3D model.
Note 15E: The specification of the present application does not explicitly mention “patterned camera views” and only appears to describe “fixed-random alternation” (e.g., paragraph [0108] of the present specification). Accordingly, the Examiner reads this limitation with respect to any camera view that relates to any sort of pattern. Given that the other claims teach alternating between random and fixed camera views, the Examiner interprets fixed camera views to be patterned camera views.
Note 15F: Chan B teaches that their tri-plane representation (the hybrid representation referred to in Pg. 5, par. 2 recited above) may be queried via “continuous coordinates”. The use of coordinates inherently requires a coordinate system. Therefore, when the image is generated using the coordinates, the output image must be generated “based on” a coordinate system.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Chan B with DeVries. Generating sample images by alternating between sampling the 3D model from random camera views and sampling the 3D model from patterned camera views; and generating the image of the physical space using the 3D model based on the target camera view, the sample images, and the coordinate system, as in Chan B would benefit the DeVries teachings by preventing “degenerate solutions where the GAN produces 2D billboards angled towards the camera,” (Chan B, Pg. 6, par. 2).

DeVries in view of Chan B still fails to teach:
wherein the obtaining the coordinate system comprises:
generating a 3D model corresponding to an initial camera view in the physical space; and
correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view.

Qin teaches:
A method of generating an image of a three-dimensional (3D) model (Qin: We propose a learning-by-synthesis appearance-based gaze estimation approach based on single-image 3D face reconstruction, (Pg. 1, Figure 1); see Note 15G), the method comprising: 
receiving images of a 3D scene respectively corresponding to camera views in a physical space (Qin: the input image, Pg. 3, par. 1; see Note 15H); 
generating a 3D model of the face based on the images of the 3D scene (Qin: this work utilizes 3D face reconstruction methods that sample texture directly from the input image, Pg. 3, par. 1); 
obtaining a coordinate system for the 3D model (Qin: we obtain the 3D face mesh Vc in the original camera coordinate system, Pg. 4, Section 3.3: Training Data Synthesis);
receiving a target camera view of an image to be generated of the physical space (Qin: new camera coordinate system which is defined with […] gaze target position g, Pg. 4, Section 3.3: Training Data Synthesis); and 
generating the image of the physical space using the 3D model based on the target camera view, and the coordinate system (Qin: render a face image with a target head pose* Rt,tt in a new camera coordinate system given the source head pose Rs,ts, Pg. 4, col. 2, par. 1) 
wherein the obtaining the coordinate system comprises:
generating a 3D model corresponding to an initial camera view in the physical space (Qin: we obtain the 3D face mesh Vc in the original camera coordinate system, Pg. 4, Section 3.3: Training Data Synthesis; see Note 15I); and
correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view (Qin: we convert the mesh via the proposed projective matching to align with the ground-truth gaze position in the input camera coordinate system, Pg. 3, Fig. 1; see Note 15J). 

Note 15G: Qin teaches multiple images may be generated from a 3D mesh, as shown in Figure 1.
Note 15H: Qin teaches that there may be multiple input images: “We assume that the source gaze dataset consists of 1) face images” (Pg. 3, Section 3.1: Overview).
Note 15I: Qin generates a 3D face mesh via 3D reconstruction: “Our approach utilizes 3D face reconstruction to synthesize training datasets with novel head poses”, (Pg. 8, Section 6: Conclusion). 
Note 15J: Qin teaches a projective matching process that aligns the coordinate system based on the input coordinate system, as showcased in Figure 1 on Pg. 3. The diagram showcases that the projective matching is based on a 3D model corresponding to the initial view, with said 3D model being based off of the images of the face. Accordingly, the Examiner understands Qin to teach correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view.

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Qin with DeVries in view of Chan B. Correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view, as in Qin, would benefit the DeVries in view of Chan B teachings by improving the accuracy of the perspective projection: “since many prior works rely on orthogonal or weak perspective projection models, we discuss how to precisely align the reconstruction results with the source camera coordinate system.” (Qin, Pg. 3, par. 1)

Regarding claim 17:
DeVries in view of Chan B and Qin teaches:
The method of claim 15 (as shown above), wherein the correcting of the coordinate system comprises:
correcting the coordinate system based on at least one criterion among: 
a coordinate system input by a user (Qin: we convert the mesh via the proposed projective matching to align with the ground-truth gaze position in the input camera coordinate system, Pg. 3, Fig. 2), 
a specific image among the images of the 3D scene (Qin showcases in Figure 2 that an “Original Image” is utilized as input. See also Note 15F), 
a floorplan portion of the physical space (DeVries: the 2D grid of local latent codes can be interpreted as a latent floorplan representation of a scene, where each code is used to locally condition a radiance field network, Pg. 4, col. 1, par. 2), 
a specific object included in the physical space (Qin showcases in Figure 2 that Projective Matching is performed on a face mesh obtained from a photo taken in a physical space), and 
bilateral symmetry of the physical space (DeVries: In Fig. 17-18 we manipulate the local latent codes by mirroring them along the horizontal axis to produce unique scenes, Pg. 16, par. 2, see Note 17D).  


Regarding claim 18:
DeVries in view of Chan B and Qin teaches:
The method of claim 17 (as shown above), wherein the correcting of the coordinate system further comprises: 
defining a camera transform matrix (Qin: Transform Matrix T, Pg. 3, Fig. 2; see also Pg. 3, Section 3.2: Projective Matching; Equation 2) that sets the coordinate system according to the at least one criterion (see Note 18A).

Note 18A: Qin teaches a transform matrix T defined based on variables such as cx, cy, sx, sy, w, and h. The variables are defined in Section 3.1: Overview, which teaches: “3D face reconstruction methods usually take a cropped face patch as input and output a 3D facial mesh […] we assume that the face reconstruction method takes a face bounding box defined with center (cx,cy), width wb, and height hb in pixels and then resized to a fixed input size by factor (sx, sy)”. As best understood by the Examiner, the transform matrix is defined based on dimensions of the specific face image, and therefore the Examiner interprets Qin to teach setting/correcting the coordinate system based on a specific object included in the physical space (the face), a specific image among the images of the 3D scene (the cropped face patch).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over DeVries et al.: (NPL: Unconstrained Scene Generation with Locally Conditioned Radiance Fields) in view of Watson (NPL: NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS) and Qin (NPL: Learning-by-Novel-View-Synthesis for Full-Face Appearance-Based 3D Gaze Estimation).

DeVries teaches:
A device (see Note 19B) comprising: 
memory storing images (see Note 19B) of a three-dimensional (3D) scene respectively corresponding to camera views in a physical space (DeVries: We evaluate the generative performance of our model on three datasets: […] iii) the Active Vision Dataset (AVD) [1] consisting of 20k images with noisy depth measurements from 9 real world scenes, Pg. 5, Section 4.1: Generation Performance, par. 1; see Note 15A) and 
storing a target camera view for an image of the physical space to be generated (see Note 20A); and 
a processor (see Note 19B) configured to 
generate a 3D model of the physical space based on the images of the 3D scene (DeVries: Generative performance of state-of-the-art approaches for generative modelling of radiance fields on 3 scene-level datasets, Pg. 6, Table 2; see Note 15B), 
perform a process of obtaining a coordinate system for the 3D model (DeVries: To decompose the scene into a grid of independent radiance fields, we enable our model to perform spatial sharing of local latent codes, Pg. 4, Section 3.2 Locally Conditioned Radiance Field; see Note 15C), and 
generate an image (DeVries: render observations, Pg. 8, Section 4.3: View Synthesis, par. 1) corresponding to the target camera view using the 3D model (see Note 15B) based on the target camera view (DeVries: render observations using the camera poses of the target views T, Pg. 8, Section 4.3: View Synthesis, par. 1) and the coordinate system (DeVries: use the resulting latent to locally condition the radiance field, Pg. 8, Section 4.3: View Synthesis, par. 1; see Note 15C above).

Note 20A: DeVries teaches that images of the physical space may be generated from a target view: “render observations using the camera poses of the target views T,” (Pg. 8, Section 4.3: View Synthesis, par. 1). As the method taught by DeVries is implemented by computer (see Note 19B), the target views must be stored on the computer.

DeVries fails to teach:
generate first renderings of the 3D model according to the camera views, 
generate second renderings of the 3D model according to randomized camera views, and 
wherein the processor is further configured to 
generate a 3D model corresponding to an initial camera view in the physical space, and to 
correct the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view.

Watson teaches:
generate first renderings of the 3D model according to the camera views (Watson: We start with a set of conditioning views X ={x1,...,xk} of a static scene, where typically k = 1 or is very small, Pg. 4, Section 2.2: 3D Consistency via Stochastic Conditioning, par. 2), 
generate second renderings of the 3D model according to randomized camera views (Watson: each individual denoising step is conditioned on a different random view from X (the set that contains the input view(s) and the previously generated samples, Pg. 4, Section 2.2: 3D Consistency via Stochastic Conditioning, par. 2), and 

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Watson with DeVries. Generating first renderings of the 3D model according to the camera views and generating second renderings of the 3D model according to randomized camera views, as in Watson, would benefit the DeVries teachings by enhancing the 3D consistency of the rendered model: “traditional NeRFs (Mildenhall et al., 2020) can be 3D inconsistent as the model allows for view-dependent radiance.” (Watson, Pg. 21, Section 7.4: 3D Consistency Scoring)

Qin teaches:
generate a 3D model of the face based on the images of the 3D scene (Qin: this work utilizes 3D face reconstruction methods that sample texture directly from the input image, Pg. 3, par. 1); 
perform a process of obtaining a coordinate system for the 3D model (Qin: we obtain the 3D face mesh Vc in the original camera coordinate system, Pg. 4, Section 3.3: Training Data Synthesis);
generate the image of the physical space using the 3D model based on the target camera view, and the coordinate system (Qin: render a face image with a target head pose* Rt,tt in a new camera coordinate system given the source head pose Rs,ts, Pg. 4, col. 2, par. 1) 
wherein the obtaining the coordinate system comprises:
generating a 3D model corresponding to an initial camera view in the physical space (Qin: we obtain the 3D face mesh Vc in the original camera coordinate system, Pg. 4, Section 3.3: Training Data Synthesis; see Note 15I); and
correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view (Qin: we convert the mesh via the proposed projective matching to align with the ground-truth gaze position in the input camera coordinate system, Pg. 3, Fig. 1; see Note 15J). 

Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Qin with DeVries in view of Watson. Correcting the coordinate system using the initial camera view, the images of the 3D scene, and the 3D model corresponding to the initial camera view, as in Qin, would benefit the DeVries in view of Watson teachings by improving the accuracy of the perspective projection: “since many prior works rely on orthogonal or weak perspective projection models, we discuss how to precisely align the reconstruction results with the source camera coordinate system.” (Qin, Pg. 3, par. 1)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT ALEXANDER PROVIDENCE whose telephone number is (571)270-5765. The examiner can normally be reached Monday-Thursday 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571)270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VINCENT ALEXANDER PROVIDENCE/Examiner, Art Unit 2617                                                                                                                                                                                                        /KING Y POON/Supervisory Patent Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

Jul 24, 2023
Application Filed
May 29, 2025
Non-Final Rejection — §103
Sep 04, 2025
Response Filed
Oct 20, 2025
Final Rejection — §103
Nov 25, 2025
Applicant Interview (Telephonic)
Nov 25, 2025
Examiner Interview Summary
Dec 15, 2025
Response after Non-Final Action
Jan 13, 2026
Request for Continued Examination
Jan 27, 2026
Response after Non-Final Action
Mar 20, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/165,619
Patent 12586303
GEOMETRY-AWARE THREE-DIMENSIONAL SYNTHESIS IN ALL ANGLES
2y 5m to grant Granted Mar 24, 2026
18/100,546
Patent 12530847
IMAGE GENERATION FROM TEXT AND 3D OBJECT
2y 5m to grant Granted Jan 20, 2026
18/270,591
Patent 12530808
Predictive Encoding/Decoding Method and Apparatus for Azimuth Information of Point Cloud
2y 5m to grant Granted Jan 20, 2026
18/268,027
Patent 12524946
METHOD FOR GENERATING FIREWORK VISUAL EFFECT, ELECTRONIC DEVICE, AND STORAGE MEDIUM
2y 5m to grant Granted Jan 13, 2026
18/481,552
Patent 12380621
COMPUTER-IMPLEMENTED SYSTEMS AND METHODS FOR GENERATING ENHANCED MOTION DATA AND RENDERING OBJECTS
2y 5m to grant Granted Aug 05, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 5m
Median Time to Grant
High
PTA Risk
Based on 18 resolved cases by this examiner. Grant probability derived from career allow rate.