DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
This is in response to applicant’s amendment/response filed on 01/30/2026, which has been entered and made of record. Claims 1, 3, 5, 7-9, and 19-20 have been amended. Claims 1-20 are pending in the application.
Response to Arguments
Applicant's arguments filed on 01/30/2026 have been fully considered but they are rendered moot in view of the new grounds of rejection presented below (as necessitated by the amendment to claims 1 and 19-20).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-3, 7-9, 12, and 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al..
Regarding claim 1, Nishiyama teaches an image processing apparatus comprising: one or more memories storing instructions; and one or more processors executing the instructions to perform (Fig 2, par 0021-0024):
obtaining a plurality of captured images obtained by a plurality of imaging devices (Fig 1, par 0030-0031, “the image acquisition unit 301 acquires a silhouette image group of an object corresponding to a plurality of different image capturing positions”);
obtaining three-dimensional shape data representing a three-dimensional shape of each object of a plurality of objects based the obtained plurality of captured images (par 0020, “an object normally refers to an object (moving object) that is moving (whose absolute position may change) in a case where image capturing is performed from the same direction in a time series, for example, such as the player 102 or a ball (not shown schematically) in a game in which a ball is used. However, in the present embodiment, it is possible to adopt an arbitrary object specification method and it is also possible to handle a still object, such as a background, as an object”, par 0034-0038, “the position acquisition unit 303 derives three-dimensional coordinates of a point or a voxel representative of the object as information indicating the approximate position of the object. As a point representative of the object, it is possible to use the position of the center of gravity of the object or a part of vertexes of a bounding box including the object. As a specific method of deriving the approximate position of the object, mention is made of, for example, the shape-from-silhouette using voxels whose resolution is low. Further, it is also possible to perform distance estimation in which object recognition is performed and the stereo matching method is used for a part of the recognized object …. the shape generation unit 306 generates shape data by the same method as the shape-from-silhouette based on the condition determined at S406 by using the silhouette image group”, par 0046-0047, “in a case where the projection of the voxel V is included inside the silhouette in all the silhouette images S1 to S4, the voxel V is left as a voxel configuring the object OB. By performing this series of processing for all the voxels within the bounding box, a visual hull (abbreviated to VH), which is a set of linked convex voxels, is generated”).
But Nishiyama keeps silent for teaching setting a learning area for each object of the plurality of object based on the obtained three-dimensional shape data; and learning a three-dimensional field in accordance with the captured image for obtaining virtual viewpoint data or a three-dimensional shape of an object corresponding to a virtual viewpoint from the plurality of captured images by taking the learning area set for each object of the plurality of objects as a target.
In related endeavor, Mattews et al. teach setting a learning area for each object of the plurality of object based on the shape data (par 0051, “ the systems and methods can include processing the image with a segmentation model to generate one or more segmentation outputs. The foreground may be the object of interest for the image segmentation model. The segmentation output can include one or more segmentation masks. In some implementations, the segmentation output can be descriptive of the foreground object being rendered”, par 0132, “FIG. 7 further depicts a segmentation mask 904 & 908 for each input image. The segmentation masks 904 & 908 can be generated by an image segmentation model that processes the input images. The segmentation masks 904 & 908 can be associated with a foreground object in the input images. In some implementations, the segmentation masks 904 & 908 can isolate the object from the rest of the input image in order to evaluate the object rendering of a generated view rendering”); and learning a three-dimensional field in accordance with the captured image for obtaining virtual viewpoint data or a three-dimensional shape of an object corresponding to a virtual viewpoint from the plurality of captured images by taking the learning area set for each object of the plurality of objects as a target (par 0047, “the reconstruction output can include a three-dimensional reconstruction based on a learned volumetric representation. The reconstruction output can include a volume rendering generated based at least in part on the respective set of one or more camera parameters for the image. Alternatively and/or additionally, the reconstruction output can include a view rendering. The generative neural radiance field model can include a foreground model (e.g., a foreground neural radiance field model) and a background model (e.g., a background neural radiance field model)”, par 0112-0123, “the machine-learned model 200 is trained to receive a set of input data 202 descriptive of one or more training images and, as a result of receipt of the input data 202, provide output data 216 that can be descriptive of predicted density values and predicted color values. Thus, in some implementations, the machine-learned model 200 can include a generative neural radiance field model, which can include a foreground model 210 and a background model 212 that are operable to generate predicted color values and predicted density values based at least in part on a latent table 204 …the volume rendering 216 and/or the view rendering may be generated based at least in part on one or more camera parameters determined using the landmark estimator model 206 and the fitting model 208 …the generative neural radiance field model 200 can include a foreground model 210 (e.g., a foreground neural radiance field model) and a background model 212 (e.g., a background neural radiance field model). In some implementations, the training data 202 can be processed by a landmark estimator model 206 to determine one or more landmark points. In particular, the training data 202 can include one or more images including an object. The one or more landmark points can be descriptive of characterizing features for the object. The one or more landmark points can be processed by a camera fitting block 208 to determine the camera parameters of the one or more images of the training data 202”)
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama to include setting a learning area for each object based on the generated rough shape data; and learning a three-dimensional field in accordance with the captured image for obtaining virtual viewpoint data or a three-dimensional shape of an object corresponding to a virtual viewpoint from the plurality of captured images by taking the learning area set for each object as a target as taught by Mattews et al. to accurately produce the corresponding three-dimensional structure such that they can be rendered from different views based on learned features from image datasets depicting different faces using a neural radiance field model.
But Nishiyama as modified by Mattews et al. keep silent for teaching setting a learning area for each object of the plurality of object based on the obtained three-dimensional shape data.
In related endeavor, Masumura et al. teach setting a learning area for each object of the plurality of object based on the obtained three-dimensional shape data and learning a three-dimensional field in accordance with the captured image for obtaining virtual viewpoint data or a three-dimensional shape of an object corresponding to a virtual viewpoint from the plurality of captured images by taking the learning area set for each object of the plurality of objects as a target (Fig 4, par 0036-0039, “(b) the object image is optionally subjected to predetermined correction processing, for example, resolution, brightness, and contrast processing, and input to the Mask R-CNN model M. As a result, as illustrated in part (c), an existence region E of each of the plurality of objects and a label L are obtained…..The Mask R-CNN model M has been trained in advance so that when the object image, which is an image obtained by photographing objects stacked in bulk from the work direction, is received, the Mask R-CNN model M recognizes each object, and in the image, shows the pixels occupied by the recognized object, that is, the existence region E, as a segment and at the same time, output a label L indicating a coverage state of the recognized object by other objects, that is, whether or not the recognized object is covered by another object”, Fig 5, par 0046-0049, “FIG. 5 is a diagram for illustrating processing executed by the machine learning device 1 when learning data is automatically generated and learned. First, as illustrated in part (a) of FIG. 5, the virtual object arranger 101 arranges a plurality of virtual objects in a virtual three-dimensional space….. object information on a plurality of objects is obtained. This object information is information including the position, orientation, and shape of each object arranged in the virtual three-dimensional space …. the virtual object image generator 102 generates a virtual object image, which is an image of the plurality of virtual objects viewed from an imaging direction D′, as illustrated in part (b). The imaging direction D′ illustrated in part (a) of FIG. 5 is a direction defined in the three-dimensional space so as to correspond to the imaging direction of the object imaging unit 202 in the real work system 2, which is indicated by the work direction D of FIG. 3. In this way, the virtual object image generator 102 generates, based on the object information, the virtual object image as if real objects were photographed”, par 0082-0086, “following part (b) or in parallel therewith, as illustrated in part (c), the partial mask image generator 103a generates a partial mask image viewed from the imaging direction D′ based on designated region information on the plurality of virtual objects in the virtual three-dimensional space. That is, as exemplified in FIG. 11, a designated region 402 is set in advance in a virtual object 400. The designated region 402 is set in a portion of the surface of the virtual object 400, and the size, position, and shape of the designated region 402 may be freely set. ….the machine learning device la uses the sets of the virtual object image, the partial mask images, and the classes L obtained in this way as teacher data in the learning unit 105a to train a Mask R-CNN model Ma. The Mask R-CNN model Ma has the same architecture as that of the Mask R-CNN model M, but the teacher data used for learning is different, and thus the model is particularly referred to here as “Mask R-CNN model Ma.””).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Chen et al. to include setting a learning area for each object of the plurality of object based on the obtained three-dimensional shape data as taught by Masumura et al. to learning a 3D field using R-CNN model to improve computation speed during image processing.
Regarding claim 2, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and further teaches wherein in the learning: the three-dimensional field is stored in a storage unit; an image corresponding to each image capturing viewpoint having the same viewing angle as that of each captured image is drawn based on camera parameters corresponding to each of the plurality of captured images and the three-dimensional field stored in the storage unit; and the three-dimensional field is updated based on a plurality of rendered images obtained by the rendering and the plurality of captured images (Nishiyama: Fig 1, par 0019-0020, “An image capturing system 100 has a plurality of cameras 101 and an image processing apparatus 200. As shown in FIG. 1, by using the plurality of the cameras 101 arranged so as to surround an object, image capturing of the object is performed. Each of the plurality of the cameras 101 obtains an image group by capturing the object from image capturing positions different from one another”, par 0032-0033, “the camera parameter acquisition unit 302 acquires camera parameters of each of the plurality of the cameras 101. The camera parameters include internal parameters, external parameters, and distortion parameters. The internal parameters may include at least one of the coordinate values of the image center and the focal length of the camera lens. The external parameters are parameters indicating the position and orientation of the camera “, par 0037, “At the time of determining the silhouette inside/outside determination condition, it may also be possible to acquire a threshold value determined in advance from a storage medium, such as the storage unit 204, or to acquire from the outside of the image processing apparatus 200 “, Mattews et al.: par 0045, “ the camera parameters can be determined using a fitting model. For example, the plurality of two-dimensional landmarks can then be processed with a fitting model to determine the one or more camera parameters. The one or more camera parameters can be associated with the respective image and stored for iterative training”, par 0079, “the user computing device 102 can store or include one or more generative neural radiance field models 120. For example, the generative neural radiance field models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.”, par 0089, par 0123, “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations”).
Regarding claim 3, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 2, and Mattews et al. further teach wherein a color difference is found between a captured image and a rendered image in a correspondence relationship with each other by taking a learning area of interest among the learning areas set for each object of the plurality of objects as a target and the three-dimensional field is updated so that the color difference becomes small (par 0011, par 0050, “evaluate a loss function (e.g., a red-green-blue loss or a perceptual loss) that evaluates a difference between the image and the reconstruction output and adjusts one or more parameters of the generative neural radiance field model based at least in part on the loss function “, par 0117, “the color values of the volume rendering 216 and/or the view rendering can be compared against the color values of an input training image 202 in order to evaluate a red-green-blue loss 224 (e.g., the loss can evaluate the accuracy of the color prediction with respect to a ground truth color from the training image). The density values of the volume rendering 216 can be utilized to evaluate a hard surface loss 222 (e.g., the hard surface loss can penalize density values that are not associated with completely opaque or completely transparent opacity values). Additionally and/or alternatively, the volume rendering 216 may be compared against segmented data (e.g., one or more objects segmented from training images 202 using an image segmentation model 218) from one or more training images 202 in order to evaluate a segmentation mask loss 220 (e.g., a loss that evaluates the rendering of an object in a particular object class with respect to other objects in the object class)”, par 0122, “The predicted color values and predicted density values for the foreground and the background can be concatenated and then utilized for training the machine-learned model(s) or learning the latent table 204. For example, the predicted color values and the predicted density values can be processed by a composite block 216 to generate a reconstruction output, which can be compared against one or more images from the training data 202 in order to evaluate a red-green-blue loss 224 (e.g., a perceptual loss). Additionally and/or alternatively, one or more images from the training data 202 can be processed with an image segmentation model 218 to segment the object. The segmentation data and the predicted color values and predicted density values can be compared to evaluate a segmentation mask loss 220”).
Regarding claim 7, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and further teaches wherein a solid body circumscribing the three-dimensional shape of each object of the plurality of object , which is represented by the three-dimensional shape data, is set as the learning area (Nishiyama: Fig 6, par 0034, par 0046-0047, “ the position acquisition unit 303 derives three-dimensional coordinates of a point or a voxel representative of the object as information indicating the approximate position of the object. As a point representative of the object, it is possible to use the position of the center of gravity of the object or a part of vertexes of a bounding box including the object “, Mattews et al.: par 0132, “FIG. 7 further depicts a segmentation mask 904 & 908 for each input image. The segmentation masks 904 & 908 can be generated by an image segmentation model that processes the input images. The segmentation masks 904 & 908 can be associated with a foreground object in the input images. In some implementations, the segmentation masks 904 & 908 can isolate the object from the rest of the input image in order to evaluate the object rendering of a generated view rendering”, Masumura et al.: Fig 5, par 0046-0049, “FIG. 5 is a diagram for illustrating processing executed by the machine learning device 1 when learning data is automatically generated and learned. First, as illustrated in part (a) of FIG. 5, the virtual object arranger 101 arranges a plurality of virtual objects in a virtual three-dimensional space….. object information on a plurality of objects is obtained. This object information is information including the position, orientation, and shape of each object arranged in the virtual three-dimensional space …. the virtual object image generator 102 generates a virtual object image, which is an image of the plurality of virtual objects viewed from an imaging direction D′, as illustrated in part (b). The imaging direction D′ illustrated in part (a) of FIG. 5 is a direction defined in the three-dimensional space so as to correspond to the imaging direction of the object imaging unit 202 in the real work system 2, which is indicated by the work direction D of FIG. 3. In this way, the virtual object image generator 102 generates, based on the object information, the virtual object image as if real objects were photographed”).
Regarding claim 8, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and further teaches wherein the three-dimensional shape data is data identifying the three-dimensional shape of the each object of the plurality of objects by a set of a plurality of elements and a three-dimensional area represented by a set of elements whose size is larger than that of the element is set as the learning area (Nishiyama: Fig 6, par 0034, par 0046-0047, “ the position acquisition unit 303 derives three-dimensional coordinates of a point or a voxel representative of the object as information indicating the approximate position of the object. As a point representative of the object, it is possible to use the position of the center of gravity of the object or a part of vertexes of a bounding box including the object “, Mattews et al.: par 0132, “FIG. 7 further depicts a segmentation mask 904 & 908 for each input image. The segmentation masks 904 & 908 can be generated by an image segmentation model that processes the input images. The segmentation masks 904 & 908 can be associated with a foreground object in the input images. In some implementations, the segmentation masks 904 & 908 can isolate the object from the rest of the input image in order to evaluate the object rendering of a generated view rendering”, Masumura et al.: Fig 5, par 0046-0049, “FIG. 5 is a diagram for illustrating processing executed by the machine learning device 1 when learning data is automatically generated and learned. First, as illustrated in part (a) of FIG. 5, the virtual object arranger 101 arranges a plurality of virtual objects in a virtual three-dimensional space….. object information on a plurality of objects is obtained. This object information is information including the position, orientation, and shape of each object arranged in the virtual three-dimensional space …. the virtual object image generator 102 generates a virtual object image, which is an image of the plurality of virtual objects viewed from an imaging direction D′, as illustrated in part (b). The imaging direction D′ illustrated in part (a) of FIG. 5 is a direction defined in the three-dimensional space so as to correspond to the imaging direction of the object imaging unit 202 in the real work system 2, which is indicated by the work direction D of FIG. 3. In this way, the virtual object image generator 102 generates, based on the object information, the virtual object image as if real objects were photographed”).
Regarding claim 9, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and further teaches wherein the three-dimensional shape data representing the three-dimensional shape of the each object of the plurality of objects is generated by the visual hull method using the plurality of captured images (Nishiyama: Fig 6, par 0034, “ the position acquisition unit 303 derives three-dimensional coordinates of a point or a voxel representative of the object as information indicating the approximate position of the object. As a point representative of the object, it is possible to use the position of the center of gravity of the object or a part of vertexes of a bounding box including the object “, par 0047, “in a case where the projection of the voxel V is included inside the silhouette in all the silhouette images S1 to S4, the voxel V is left as a voxel configuring the object OB. By performing this series of processing for all the voxels within the bounding box, a visual hull (abbreviated to VH), which is a set of linked convex voxels, is generated. The above is the principle of shape restoration by the SCM”, claim 6, “the generation unit generates a visual hull of the object as the three-dimensional shape data by repeating processing to determine whether a voxel of interest is a voxel belonging to the object within a range surrounding the object and leaving only voxels belonging to the object.”, Mattews et al.: par 0132, “FIG. 7 further depicts a segmentation mask 904 & 908 for each input image. The segmentation masks 904 & 908 can be generated by an image segmentation model that processes the input images. The segmentation masks 904 & 908 can be associated with a foreground object in the input images. In some implementations, the segmentation masks 904 & 908 can isolate the object from the rest of the input image in order to evaluate the object rendering of a generated view rendering”).
Regarding claim 12, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and further teach wherein the three-dimensional field is Occupancy Field representing a volume density and the virtual viewpoint data is a map representing occupancy in a case of being viewed from a virtual viewpoint (Mattews et al.: par 0028, “The systems and methods disclosed herein can leverage the plurality of single-view image datasets of the object class or scene class in order to learn a volumetric three-dimensional representation. The volumetric three-dimensional modeling representation can then be utilized to generate one or more view renderings “, par 0119, par 0141, neural radiance field model with color value and density value, Masumura et al.: Fig 4, par 0036-0038, “The Mask R-CNN model M has been trained in advance so that when the object image, which is an image obtained by photographing objects stacked in bulk from the work direction, is received, the Mask R-CNN model M recognizes each object, and in the image, shows the pixels occupied by the recognized object, that is, the existence region E, as a segment and at the same time, output a label L indicating a coverage state of the recognized object by other objects, that is, whether or not the recognized object is covered by another object”).
Regarding claim 17, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, and Mattews et al. further teach wherein the one or more processors further execute the instructions to perform: outputting the virtual viewpoint data by performing estimation by using a three-dimensional field learned by the image processing apparatus according to claim 1 (par 0006, “The method can include processing the plurality of images with a landmark estimator model to determine a respective set of one or more camera parameters for each image of the plurality of images. In some implementations, determining the respective set of one or more camera parameters can include determining a plurality of two-dimensional landmarks in each image. The method can include for each image of the plurality of images: processing a latent code associated with a respective object depicted in the image with a generative neural radiance field model to generate a reconstruction output, evaluating a loss function that evaluates a difference between the image and the reconstruction output, and adjusting one or more parameters of the generative neural radiance field model based at least in part on the loss function. In some implementations, the reconstruction output can include a volume rendering generated based at least in part on the respective set of one or more camera parameters for the image “, par 0127, “fixing the camera parameters 316 (e.g., the position in the environment and the view direction) but varying the latent code 318 can allow for the generative neural radiance field model to display the performance of view renderings for different objects in the object class 322.”).
Regarding claim 18, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 17, and further teaches wherein in the estimating: a learned three-dimensional field is stored in a storage unit and the virtual viewpoint data is generated based on the three-dimensional field stored in the storage unit in accordance with camera parameters of a virtual viewpoint (Nishiyama: Fig 1, par 0019-0020, “An image capturing system 100 has a plurality of cameras 101 and an image processing apparatus 200. As shown in FIG. 1, by using the plurality of the cameras 101 arranged so as to surround an object, image capturing of the object is performed. Each of the plurality of the cameras 101 obtains an image group by capturing the object from image capturing positions different from one another”, par 0032-0033, “the camera parameter acquisition unit 302 acquires camera parameters of each of the plurality of the cameras 101. The camera parameters include internal parameters, external parameters, and distortion parameters. The internal parameters may include at least one of the coordinate values of the image center and the focal length of the camera lens. The external parameters are parameters indicating the position and orientation of the camera “, par 0037, “At the time of determining the silhouette inside/outside determination condition, it may also be possible to acquire a threshold value determined in advance from a storage medium, such as the storage unit 204, or to acquire from the outside of the image processing apparatus 200 “, Mattews et al.: par 0045, “ the camera parameters can be determined using a fitting model. For example, the plurality of two-dimensional landmarks can then be processed with a fitting model to determine the one or more camera parameters. The one or more camera parameters can be associated with the respective image and stored for iterative training”, par 0079, “the user computing device 102 can store or include one or more generative neural radiance field models 120. For example, the generative neural radiance field models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.”, par 0089, par 0123, “The training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors. For example, a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function). Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions. Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations”).
Regarding claim 19, the method claim 19 is similar in scope to claim 1 and is rejected under the same rational.
Regarding claim 20, Nishiyama teaches a non-transitory computer readable storage medium storing a program for causing a computer to perform an image processing method comprising the steps of (par 0024, par 0080). The remaining limitations of the claim are similar in scope to claim 1 and rejected under the same rationale.
Claim(s) 4-6 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al., further in view of U.S. PGPubs 2011/0038529 to Utsugi et al..
Regarding claim 4, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 3, but do not explicitly teaches wherein the updating is performed by finding the color difference between both images for each pixel in a correspondence relationship.
In related endeavor, Utsugi et al. teach wherein the updating is performed by finding the color difference between both images for each pixel in a correspondence relationship (par 0008, “determined whether a pixel corresponding to a pixel on the seam in the main image exists in the sub image as well; the color difference and position difference between the main image and the sub image is used and added as a correction factor of energy used in a seam search in the main image to determination processing in the optimal seam search”, par 0027, “This technique searches for a pixel relationship of a small color change between connection components, that is, a pixel of little energy, from top to bottom and from left to right”, par 0075, par 0084-0085, “a path of connected pixels in the sub image that minimizes energy obtained by summing color differences and position differences between pixels of the main image and pixels of the sub image is selected in the proximity of the initial search point; recursive search processing for optimal connected pixels is performed using energy generated by the difference between the main image and the sub image as a correction factor of the pixel gradient energy of the main image”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the updating is performed by finding the color difference between both images for each pixel in a correspondence relationship as taught by Utsugi et al. to minimize energy between pixels of the first image data and pixels of the second image data through color information differences and position differences between these pixels to enables image-retargeting that reduces the breakdown of the relationship between stereo images.
Regarding claim 5, Nishiyama as modified by Mattews et al., Masumura et al., and Utsugi et al. teaches all the limitation of claim 4, and further teaches wherein the updating is performed by finding the color difference from a pixel of the captured image having visibility for an element configuring the three-dimensional shape data and a pixel of the rendered image corresponding thereto (Nishiyama: par 0055, “it may also be possible to use a silhouette image in an auxiliary manner as in the case with the first embodiment, but basically, color information on a captured image is used. In the following, a case is explained where as an evaluation value of matching using color information, the normalized cross-correlation (hereinafter, NCC) is adopted”, Mattews et al.: par 0117, “the color values of the volume rendering 216 and/or the view rendering can be compared against the color values of an input training image 202 in order to evaluate a red-green-blue loss 224 (e.g., the loss can evaluate the accuracy of the color prediction with respect to a ground truth color from the training image). The density values of the volume rendering 216 can be utilized to evaluate a hard surface loss 222 (e.g., the hard surface loss can penalize density values that are not associated with completely opaque or completely transparent opacity values). Additionally and/or alternatively, the volume rendering 216 may be compared against segmented data (e.g., one or more objects segmented from training images 202 using an image segmentation model 218) from one or more training images 202 in order to evaluate a segmentation mask loss 220 (e.g., a loss that evaluates the rendering of an object in a particular object class with respect to other objects in the object class)”, Utsugi et al.: par 0008, “determined whether a pixel corresponding to a pixel on the seam in the main image exists in the sub image as well; the color difference and position difference between the main image and the sub image is used and added as a correction factor of energy used in a seam search in the main image to determination processing in the optimal seam search”, par 0027, “This technique searches for a pixel relationship of a small color change between connection components, that is, a pixel of little energy, from top to bottom and from left to right”, par 0075, par 0084-0085, “a path of connected pixels in the sub image that minimizes energy obtained by summing color differences and position differences between pixels of the main image and pixels of the sub image is selected in the proximity of the initial search point; recursive search processing for optimal connected pixels is performed using energy generated by the difference between the main image and the sub image as a correction factor of the pixel gradient energy of the main image”).
Regarding claim 6, Nishiyama as modified by Mattews et al., Masumura et al., and Utsugi et al. teaches all the limitation of claim 5, and further teaches wherein the updating is performed by determining an initial value of the three-dimensional field based on a pixel value of a captured image of a viewpoint having visibility for the element (Mattews et al.: par 0117, “the color values of the volume rendering 216 and/or the view rendering can be compared against the color values of an input training image 202 in order to evaluate a red-green-blue loss 224 (e.g., the loss can evaluate the accuracy of the color prediction with respect to a ground truth color from the training image). The density values of the volume rendering 216 can be utilized to evaluate a hard surface loss 222 (e.g., the hard surface loss can penalize density values that are not associated with completely opaque or completely transparent opacity values). Additionally and/or alternatively, the volume rendering 216 may be compared against segmented data (e.g., one or more objects segmented from training images 202 using an image segmentation model 218) from one or more training images 202 in order to evaluate a segmentation mask loss 220 (e.g., a loss that evaluates the rendering of an object in a particular object class with respect to other objects in the object class)”, Utsugi et al.: par 0031, “ an original image P0 of X horizontal pixels by Y vertical pixels is prepared, and initialization is performed. For each pixel (coordinate position (x, y)), color information about the original image is copied to the color information area 411 of the unit structure positioned at (x, y), a default value "-1" is written into the seam number (412), "0" is written into the calculation data area D (413), the values (x, y) are written into the image position information F (414), and "0" is written into the energy E “, par 0059, “This is referred to as a corresponding initial pixel. Both x_p and y_p are decimals and coordinates indicating a position in the image with sub-pixel accuracy. Further, depth information obtained from these parallax values is stored in the memory area 917 of each pixel structure in the main image “, par 0075, par 0084-0085, “a path of connected pixels in the sub image that minimizes energy obtained by summing color differences and position differences between pixels of the main image and pixels of the sub image is selected in the proximity of the initial search point; recursive search processing for optimal connected pixels is performed using energy generated by the difference between the main image and the sub image as a correction factor of the pixel gradient energy of the main image”).
Claim(s) 10-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al., further in view of U.S. PGPubs 2022/0381558 to Round.
Regarding claim 10, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is an emitted radiance field associating a volume density and an anisotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint.
In related endeavor, Round teaches wherein the three-dimensional field is an emitted radiance field associating a volume density and an anisotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint (par 0004, “More specifically, in order to generate a 3D model of a real-world object, imaging is performed from a variety of angles with one or more cameras and one or more light sources. This dataset is then used to construct a 3D model of the object and rendered for a given virtual position relative to a viewpoint and a light source”, par 0099, “Because this 3D model is based on predictions of the underlying material properties of surfaces, rather than being based purely on image data as in conventional methods of representing a real-world object, the 3D model reacts realistically when illuminated and/or viewed from the limited number of angles which were not sampled from the actual real-world object. This means that the 3D model of the object can appear highly realistic even from a relatively small dataset of real image data about the object.” par 0092, “measure some channels of the material property, such as colour, by conventional means. However, measuring other channels of the material property, such as specular reflection intensity, roughness, sub-surface scattering and diffuse light intensity, is more difficult. More specifically, for a surface of any complexity, sensed light may come from different points on the object surface, and the spatial distribution of the light is dependent upon many channels of the material property at different points on the object surface. Furthermore, if transmission characteristics of surfaces of the object, such as transparency and index of refraction, are considered as channels of the material property, then determining the underlying material property that produces a distribution of sensed light becomes even more complex. Yet further, channels of the material property may be anisotropic, increasing complexity again”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is an emitted radiance field associating a volume density and an anisotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint as taught by Round to provide a way of generating a three-dimensional model of an object, with high realism, but from a relatively small dataset to generate a three-dimensional model of an object, with high realism, but from a relatively small dataset.
Regarding claim 11, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is an emitted radiance field associating a volume density and an isotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint.
In related endeavor, Round teach wherein the three-dimensional field is an emitted radiance field associating a volume density and an isotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint (par 0004, “More specifically, in order to generate a 3D model of a real-world object, imaging is performed from a variety of angles with one or more cameras and one or more light sources. This dataset is then used to construct a 3D model of the object and rendered for a given virtual position relative to a viewpoint and a light source”, par 0099, “Because this 3D model is based on predictions of the underlying material properties of surfaces, rather than being based purely on image data as in conventional methods of representing a real-world object, the 3D model reacts realistically when illuminated and/or viewed from the limited number of angles which were not sampled from the actual real-world object. This means that the 3D model of the object can appear highly realistic even from a relatively small dataset of real image data about the object.” par 0106, “each channel may be represented by a relative value ranging from 0 to 1 or 0% to 100%. Additionally, each channel may be associated with multiple numerical values, for example values for each of three spatial dimensions in the case that the channel is anisotropic, or, in the case of colour, red, green and blue values”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is an emitted radiance field associating a volume density and an isotropic color with each other for each coordinate in an image capturing space of the plurality of imaging devices and the virtual viewpoint data is a virtual viewpoint image showing an appearance from a virtual viewpoint as taught by Round to provide a way of generating a three-dimensional model of an object, with high realism, but from a relatively small dataset to generate a three-dimensional model of an object, with high realism, but from a relatively small dataset.
Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al., further in view of U.S. PGPubs 2022/0343522 to Bi et al..
Regarding claim 13, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is a field of a bidirectional reflectance distribution function and the virtual viewpoint data is a map representing a bidirectional reflectance distribution function in a case of being viewed from a virtual viewpoint.
In related endeavor, Bi et al. teach wherein the three-dimensional field is a field of a bidirectional reflectance distribution function and the virtual viewpoint data is a map representing a bidirectional reflectance distribution function in a case of being viewed from a virtual viewpoint (par 0003, “generating both a geometry model and an optical-reflectance model (e.g., an object reconstruction model) for a physical object, based on a sparse set of images of the object under a sparse set of viewpoints and lighting conditions. The geometry model may be a mesh model that includes a set of vertices representing discretized regions on the object's surface. Thus, the geometry model encodes a representation of a geometry of the object's surface. The reflectance model may be a spatially-varying bidirectional reflectance distribution function (SVBRDF) that is parameterized via multiple channels (e.g., diffuse albedo, surface-roughness, specular albedo, and surface-normals). For each vertex of the geometry model, the reflectance model may include a value (e.g., a scalar, vector, or any other tensor value) for each of the multiple channels. The object reconstruction model may be employed to render graphical representations of a virtualized version of the physical object (e.g., a virtual object based on a physical object) within a computation-based (e.g., a virtual or immersive) environment”, par 0017, “he enhanced geometry and reflectance models enable reconstructing graphical representations of the physical object from viewpoints and lighting conditions that are insufficiently represented in the sparse set of images that the models are based upon. A VO (corresponding to the physical object) may be fully embedded in a computing environment, such that the VO may be viewed from the arbitrary viewpoints and lighting conditions. In various embodiments, the geometry model of an object reconstruction model may be a mesh model that includes a set of vertices representing discretized regions on the object's bounding surface (e.g., a 2D manifold). The reflectance model for the object reconstruction model may be a bidirectional reflectance distribution function (BRDF) that includes multiple reflectance parameters for each vertex in the mesh model. Because the BRDF parameters may vary across object's 2D manifold, the BRDF model may be a spatially-varying BRDF (SVBRDF) model”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is a field of a bidirectional reflectance distribution function and the virtual viewpoint data is a map representing a bidirectional reflectance distribution function in a case of being viewed from a virtual viewpoint as taught by Bi et al. to generate both a geometry model and an optical-reflectance model (an object reconstruction model) for a physical object, based on a sparse set of images of the object under a sparse set of viewpoints within a computation-based (e.g., a virtual or immersive) environment (robust in the sense that various applications require rendering a graphical representation of the physical object from arbitrary viewpoints, as well as under arbitrary lighting conditions (e.g., multiple non-point light sources positioned at multiple viewpoints with multiple frequency spectrums)).
Claim(s) 14 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al., further in view of U.S. PGPubs 20190362539 to Kurz et al..
Regarding claim 14, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is a field of Light Visibility of ambient light and the virtual viewpoint data is a map representing a degree of appearance in a case of being viewed from a virtual viewpoint.
In related endeavor, Kurz et al. teach wherein the three-dimensional field is a field of Light Visibility of ambient light (par 0052, “Reflection Probes may be used as a source of reflected and ambient light for objects inside their area of influence, which may be defined by its proxy geometry. A probe of this type captures and stores its surroundings. In implementations, one or more reflection probes are used to dynamically generate cube maps for use as reflection textures. Reflection probes may be placed at the center of a CGR object, e.g., viewpoint 156 of the CGR object 155 shown in FIG. 2, or the specific mesh that will use the environment map, and the probe internally generates a ghost camera at the specific position”, par 0082, “an application may generate multiple environment maps in a CGR scene at different positions. In order to ensure consistency in the synthesized/completed parts of the environment maps obtained from the network, the unknown pixels of the multiple environment maps may be set to the same random values across all the maps. As a result, the encoder portion of the network will generate similar outputs. An application may also use known metadata about the scene, such as scene category information (e.g., indoor, outdoor, beach, living room, etc.), information from the camera (ISP, light sensor, or images, such as exposure time, exposure offset, ISO value, ambient light intensity, ambient light temperature, etc.), or derived quantities (e.g., estimated scene luminance)”), and the virtual viewpoint data is a map representing a degree of appearance in a case of being viewed from a virtual viewpoint (par 0043-0044, par 0050-0051, “ As illustrated in exemplary FIG. 2, the CGR object 155, e.g., a cube, is rendered on the surface 120, including a viewpoint 156 of the CGR object 155. The viewpoint 156 may also be referred to as the pose of CGR object 155 and it may be parametrized by a six-degree-of-freedom rigid body transform. For illustrative purposes, the viewpoint 156 includes reference Cartesian axes as a proxy geometry. As understood by one of ordinary skill in the art, implementations may employ any suitable viewpoint orientation, orientation references, and/or proxy geometry”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is a field of Light Visibility of ambient light and the virtual viewpoint data is a map representing a degree of appearance in a case of being viewed from a virtual viewpoint as taught by Kurz et al. to accurately render a reflective surface of a computer-generated reality (“CGR”) object based on the complete environment map of the CGR environment.
Regarding claim 16, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is a filed in a direction of a normal to the object surface (Normal Field) and the virtual viewpoint data is a normal map in a case of being viewed from a virtual viewpoint.
In related endeavor, Kurz et al. teach wherein the three-dimensional field is a filed in a direction of a normal to the object surface (Normal Field) (par 0066, “the geometry of the physical surfaces in the CGR environment can be approximated by a set of finite, planar surfaces (e.g. quadrangles or triangles in 3D space), and the center of projection of an environment map may be selected above the center of a such planar surface, along the surface normal at a distance that is half the extent of the planar surface. The extent of the region of impact of the reflectance probe may use a box as a proxy geometry with its center at the center of projection and aligned with the planar surface so that one side of the box coincides with the planar surface”, par 0083, “in order to produce a more realistic reflection, each of many face normals, e.g., direction a given point on a polygon is facing, of a virtual object may be used in tandem with an environment map. Thus, the angle of reflection at a given point on a CGR object can take the normal map into consideration, allowing an otherwise flat surface to appear textured, e.g., corrugated metal or brushed aluminum”), the virtual viewpoint data is a normal map in a case of being viewed from a virtual viewpoint (par 0043-0044, par 0050-0051, “ As illustrated in exemplary FIG. 2, the CGR object 155, e.g., a cube, is rendered on the surface 120, including a viewpoint 156 of the CGR object 155. The viewpoint 156 may also be referred to as the pose of CGR object 155 and it may be parametrized by a six-degree-of-freedom rigid body transform. For illustrative purposes, the viewpoint 156 includes reference Cartesian axes as a proxy geometry. As understood by one of ordinary skill in the art, implementations may employ any suitable viewpoint orientation, orientation references, and/or proxy geometry”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is a filed in a direction of a normal to the object surface (Normal Field) and the virtual viewpoint data is a normal map in a case of being viewed from a virtual viewpoint as taught by Kurz et al. to accurately render a reflective surface of a computer-generated reality (“CGR”) object based on the complete environment map of the CGR environment.
Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. PGPubs 2020/0005476 to Nishiyama in view of U.S. PGPubs 2024/0371081 to Mathews et al., further in view of U.S. PGPubs 2023/0202030 to Masumura et al., further in view of U.S. PGPubs 2023/0316646 to Chen et al., further in view of U.S. PGPubs 20220343522 to Bi et al..
.
Regarding claim 15, Nishiyama as modified by Mattews et al. and Masumura et al. teaches all the limitation of claim 1, but do not explicitly teaches wherein the three-dimensional field is a floating point field (Signed Distance Field) in which the inside of an object is represented as negative and the outside is represented as positive or a binary field (Surface Field) in which the inside of an object is represented as 0 and the outside is represented as 1 and the virtual viewpoint data is a depth map in a case of being viewed from a virtual viewpoint.
In related endeavor, Chen et al. teach wherein the three-dimensional field is a floating point field (Signed Distance Field) in which the inside of an object is represented as negative and the outside is represented as positive or a binary field (Surface Field) in which the inside of an object is represented as 0 and the outside is represented as 1 (par 0033, “the 3D function is approximated with a neural network that infers per-point labels: {in, out, null}. The label semantics can be represented using discrete numbers without loss of generality. In some embodiments, the classifier 210 learns a mapping function o: R.sup.3.fwdarw.{0, 1, NaN}, where the labels {0, 1, NaN} represent inside, outside, and null respectively”, par 0046, “with finer decomposition of 3D space, cells containing geometry distribute around the surface of interest while the null cells occupy the majority of the space. This differs from a conventional signed distance field, where the entirety of the space is filled with distances of either positive or negative sign (e.g., as illustrated in FIG. 4B)”, par 0074-0083, ” generating a three-pole signed distance field (e.g., the signed distance field 108) from the input observation using the trained classifier; (iv) generating an output mesh of the 3D object (e.g., the object mesh 112) from the three-pole signed distance field; and (v) generating a display of the 3D object from the output mesh. [0075] (A2) In some embodiments of A1, the method further includes obtaining a sampling point template (e.g., the sampling point template 102), where the three-pole signed distance field is generated using the sampling point template. In some embodiments, the sampling point template includes a regular set (e.g., grid) of sampling points for the input observation. [0076] (A3) In some embodiments of A1 or A2, the three-pole signed distance field includes a three-pole signed distance value (e.g., 1, −1, or NaN) for each sampling point in the sampling point template”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al. and Masumura et al. to include wherein the three-dimensional field is a floating point field (Signed Distance Field) in which the inside of an object is represented as negative and the outside is represented as positive or a binary field (Surface Field) in which the inside of an object is represented as 0 and the outside is represented as 1 as taught by Chen et al. to provide with methods for object surface generation, thereby increasing the effectiveness, efficiency, and user satisfaction with such systems and devices.
In related endeavor, Bi et al. teach the virtual viewpoint data is a depth map in a case of being viewed from a virtual viewpoint (par 0004, par 0021-0022, “ in the first stage, a multi-view reflectance neural network employs the set of input images (and the 2D depth maps) to regress (again by aggregating information across the multiple reference viewpoints) estimations for the SVBRDF parameters for each reference viewpoint. The multi-view reflectance network also encodes latent features for each of the input images (which are features in the “learned” reflectance space of the network). Note that for this first stage, the regressed estimations for the surface-depths and reflectance parameters are per reference viewpoint (e.g., per-view estimates)”).
It would have been obvious to a person of ordinary skill in the art at the time before the effective filing data of the claimed invention to modified Nishiyama as modified by Mattews et al., Masumura et al., and Chen et al. to include the virtual viewpoint data is a depth map in a case of being viewed from a virtual viewpoint as taught by Chen et al. to generate both a geometry model and an optical-reflectance model (an object reconstruction model) for a physical object, based on a sparse set of images of the object under a sparse set of viewpoints within a computation-based (e.g., a virtual or immersive) environment (robust in the sense that various applications require rendering a graphical representation of the physical object from arbitrary viewpoints, as well as under arbitrary lighting conditions (e.g., multiple non-point light sources positioned at multiple viewpoints with multiple frequency spectrums)).
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556. The examiner can normally be reached 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason Chan can be reached at (571)272-3022. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JIN . GE
Examiner
Art Unit 2619
/JIN GE/Primary Examiner, Art Unit 2619