DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 12 is/are objected to under 37 CFR 1.75 as being a substantial duplicate of claim 12. When two claims in an application are duplicates or else are so close in content that they both cover the same thing, despite a slight difference in wording, it is proper after allowing one claim to object to the other as being a substantial duplicate of the allowed claim. See MPEP § 608.01(m). In this case, applicant filed two Claim 12 almost identical, one depends on independent Claim 1 and the other depends on dependent Claim 11. Renumbering of Claims is advised.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 13 and 14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. In this case, dependent Claim 13 and 14 depends on Claim 12. However, applicant filed two Claim 12s, it is indefinite and cannot determine which Claim 12 that Claim 13/14 is depends on. Proper renumbering of Claims is advised.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-9, 15-17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 12322068 B1, hereinafter Kim) in view of Praveen (WO 2022108652 A1, hereinafter Praveen).
Regarding Claim 1, Kim teaches a computer-implemented method, comprising receiving training data including training images of a scene (Kim, Column 5, Lines 36-40, "a generative neural network model to be used to synthesize 3D environments can be trained using training data consisting of (N*T) number of images and camera information <read on training images of a scene>, where N is a number of cameras and T is a number of time frames") and associated camera extrinsics corresponding to three-dimensional (30) camera locations and camera directions from which the training images are captured (Kim, Column 8, Lines 30-32, "camera parameters can include at least a location and orientation of this camera, with respect to this scene or this vehicle, such that view information can be determined"; Column 4, Lines 10-15, "a voxel-based 30 representation 108 can be generated using a generative model (as may include an encoder 106) that takes in this sequence of images, relevant camera parameters (e.g., depth of focus or field of view), and camera orientation information <read on camera extrinsics corresponding to three-dimensional camera locations and camera directions>"), training, using the training data, a neural network to represent a latent model of the scene in a latent space (Kim, Column 3, Lines 36-42, "a neural network encoder 106 can analyze images 102, which may correspond to a sequence of images as may have been captured using at least one camera of a moving vehicle, extract representative features of those images, and encode those features into a latent representation 108 <read on latent model> of a scene represented in images 102"; "latent representation 108 may take form of a latent space or latent vector <read on latent space>, which may represent values in various locations of a voxel space"), wherein the neural network is configured to synthesize scene images corresponding to novel views of the scene from queried 3D viewpoints and viewing angles (Kim, Column 4, Lines 30-34, "changing camera parameters can cause this decoder to generate new images of this scene from one or more new or unique points of view <read on novel views of the scene from queried 30 viewpoints and viewing angles> that were not represented in this input image sequence 102 or set") (Kim, Column 7, Lines 61-63, "a generative model as trained herein can generate or synthesize novel views of this scene, such as from different angles or for different fields of view than were included in this input sequence"),
[[receiving view spotlight information]], [[ prioritizing the training of the neural network based upon the view spotlight information]].
But Kim does not explicitly disclose receiving view spotlight information and prioritizing the training of the neural network based upon the view spotlight information.
However, Praveen teaches receiving view spotlight information (Praveen, Paragraph [0006], "receiving gaze information about an observer of a video stream; determining a video compression spatial map... based on the received gaze information... "; [0022], "gaze information <read on view spotlight information> about an observer of a video stream is received. The gaze information can be received from a head-mounted or a display-mounted gaze tracker. The gaze information includes information about instantaneous eye position; Paragraph [0005], "The technology described herein uses gaze information to determine the regions of interest (ROls) <read on view spotlight information>. Such ROls are regions the observer/user is watching"; [0035], "Region of interest can be treated as a mask which can algorithmically define the required quality based on the distance from the ROI Center. The further the image area from the ROI Center, the less needed quality. Quality of image area is inversely proportional to the distance from the ROI Center"), prioritizing the content processing based upon the view spotlight information (Praveen, Paragraph [0020], "the server (or remote rendering servers) can use the target user gaze <read on view spotlight information>, e.g., the area/location the observer will be watching, and not only optimize the video but also generate video content <read on training of the neural network> based on the target gaze <read on view spotlight information>"; [0028], "The technology described herein can parameterize the gaze information and leverage such information for optimization of video transmission and real time video content generation <read on prioritizing the content processing based upon the view spotlight information>"; [0044], "The further an image area away from the center of the ROI, the larger the compression rate for that image area. In other words, image areas further away from the center of ROI are compressed more significantly, and thus have less fidelity").
Praveen and Kim are analogous since both of them are dealing with neural-network based generation or processing of scene imagery in which image content is generated or optimized based on view-related information. Kim provided a way of training a neural network encoder using sequences of images and associated camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views of the scene from arbitrary queried viewpoints and directions can be synthesized. Praveen provided a way of receiving real-time gaze information from an observer to identify the observer's region of interest in a video scene, and using that region-of-interest information to guide and prioritize video content generation so that image processing resources are concentrated on the scene areas currently being observed by the viewer. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to incorporate the gaze-based observer region-of-interest prioritization technique which is spotlight focus taught by Praveen into the neural network training system of Kim, such that training and/or optimization of the neural network would prioritize scene regions corresponding to viewer interest so that generation or reconstruction quality is improved in regions most relevant to viewing behavior. The motivation is to improve efficiency and fidelity allocation by emphasizing regions being observed while reducing resources for less important regions, as discussed by Praveen in Paragraph [0005]-[0014], which explains that gaze information is used to determine regions of interest and apply higher fidelity processing to those regions.
Regarding Claim 2, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the prioritizing the training includes biasing selection of the queried 30 viewpoints and viewing angles utilized during the training based upon the view spotlight information (Praveen, Paragraph [0042], "identifying a center of a region of interest (ROI) corresponding to a predicted eye position selecting a first shape for the ROI and selecting a video compression profile that includes higher compression outside the first shape <read on biasing selection of the queried 3D viewpoints and viewing angles utilized during the training based upon the view spotlight information>"; [0044], "The further an image area away from the center of the ROI, the larger the compression rate for that image area. In other words, image areas further away from the center of ROI are compressed more significantly, and thus have less fidelity <read on biasing selection of the queried 30 viewpoints and viewing angles based upon the view spotlight information, whereby regions within the spotlight are selected for higher-fidelity training emphasis while regions outside the spotlight are deprioritized>"; [0028], "The technology described herein can parameterize the gaze information and leverage such information for optimization of video transmission and real time video content generation <read on biasing selection of the queried 3D viewpoints and viewing angles utilized during the training based upon the view spotlight information>").
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Regarding Claim 3, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the view spotlight information is based at least in part on a view direction of a viewer of the scene images (Praveen, Paragraph [0005], "The technology described herein uses gaze information to determine the regions of interest (ROls) <read on view spotlight information>. Such ROls are regions the observer/user is watching <read on view direction of a viewer of the scene images>"; [0022], "The gaze information includes information about instantaneous eye position <read on view direction of a viewer of the scene images>"; [0032], "the server can translate the eye gaze trait onto the image by converting the eye look at direction <read on view direction of a viewer> onto a 2D two-dimensional point on the image the eye is looking at").
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Regarding Claim 4, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the view direction is based at least in part on a last known or predicted view direction of the viewer (Praveen, Paragraph [0032], "the position A (a, b) can be extrapolated based on eye movement direction, eye movement speed, and network latency. The extrapolated position is for predicting where the user would be looking at by the time the image is delivered <read on predicted view direction of the viewer>"; [0046], "The center of the region of interest corresponds to the instantaneous eye position <read on last known view direction of the viewer> plus an offset proportional to the instantaneous eye velocity times the network latency <read on predicted view direction of the viewer>”).
Praveen and Kim are analogous since both of them are dealing with neural-network based generation or processing of scene imagery in which image content is generated or optimized based on view-related information. Kim provided a way of training a neural network encoder using sequences of images and associated camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views of the scene from arbitrary queried viewpoints and directions can be synthesized. Praveen provided a way of receiving real-time gaze information from an observer to identify the observer's region of interest in a video scene, and using that prediction of region-of-interest information to guide and prioritize video content generation so that image processing resources are concentrated on the scene areas currently being observed by the viewer. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to base the view spotlight information on the last known or predicted view direction of the viewer as taught by Praveen, to compensate for the inherent latency between gaze measurement and neural network training or rendering response. The motivation is to ensure that the spotlight proactively reflects where the viewer will be looking by the time the generated content is delivered, thereby avoiding misallocation of training resources to scene regions the viewer has already shifted gaze away from which explains that the extrapolated eye position is specifically intended for predicting where the user would be looking at by the time the image is delivered.
Regarding Claim 5, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the view spotlight information corresponds to a region of 3D space within the scene (Kim, Column 3, Lines 61-63, "a neural network may generate a 30 representation of a scene that may be presented in virtual reality VR or augmented reality AR settings <read on region of 30 space within the scene>")
Praveen further teaches wherein the view spotlight information corresponds to a region of 3D space within the scene (Praveen, Paragraph [0002], "Video streaming consumes a large amount of bandwidth, especially in three dimensional (30) environments. For example, in virtual reality (VR), augmented reality (AR), and/or mixed reality (MR) systems, a display device (e.g., head-mounted device) will receive video stream data from a server and display the received video stream data to a user in a spatial three-dimensional (30) environment <read on region of 30 space within the scene>"; [0023], "the video display device can be a head-mounted device in ARNR/MR systems that includes one or more sensors. The one or more sensors can provide real time tracking of the gaze information, such as human eye movement data <read on view spotlight information corresponds to a region of 30 space within the scene, wherein the gaze information tracked by the ARNR/MR viewing device identifies the three-dimensional spatial region the observer is watching within the scene>").
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Regarding Claim 6, the combination of Kim and Praveen teaches the invention in Claim 5.
The combination further teaches wherein the region of 3D space is determined by a camera frustum of a virtual camera through which the scene is viewed (Praveen, Paragraph [0035], "Region of interest can be treated as a mask... based on the distance from the ROI Center... Quality of image area is inversely proportional to the distance from the ROI Center," the ROI defining a view volume corresponding to a virtual camera field of view <read on camera frustum>).
As explained in rejection of Claim 1, the obviousness for combining of gaze-based ROI selection of Praveen into Kim naturally defines spatial viewing regions corresponding to virtual camera views is provided above.
Regarding Claim 7, the combination of Kim and Praveen teaches the invention in Claim 5.
The combination further teaches the scene images include a view of a person (Kim, Column 3, Lines 36-57, "a neural network encoder 106 can analyze images 102 ... and encode those features into a latent representation 108 of a scene represented in images 102"; "a scene can include one or more objects in one or more locations, such as may include a number of vehicles, people, animals, buildings, roadways, and other such objects in a location or environment such as an urban city block").
Praveen as well further teaches the scene images include a view of a person (Praveen, Paragraph [0005], "The technology described herein provides an eye tracking based video compression transmission method.";[0022], "gaze information about an observer of a video stream is received. The gaze information can be received from a head-mounted or a display-mounted gaze tracker. The gaze information includes information about instantaneous eye position") .
Praveen and Kim are analogous since Kim provided a way of building a neural scene representation from images of physical scenes, which explicitly include people. Praveen provided a way of managing video transmission for eye-tracking-based systems, which inherently involve capturing images of a person (the user). Therefore, it would have been obvious to one of ordinary skill in the art to incorporate neural scene representation to the video communication scenes taught by Praveen into the invention of Kim such that system will be able to enable photorealistic novel view synthesis for telepresence applications, where the person in the scene is the primary subject of interest.
Regarding Claim 8, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the spotlight information is received from a viewing device configured to synthesize the scene images using the neural network (Kim, Column 10, Lines 35-38, "a scene generator component 612 on client device 602 <read on viewing device> may generate 3D representations, such as by using a generative network with network weights received from content server 620 <read on viewing device configured to synthesize the scene images using the neural network>"; Column 10, Line 23-27 "results of any of these components (e.g., generated meshes or determined network weights) can be transmitted to client device 602 <read on viewing device> using an appropriate transmission manager 622 to send by download, streaming, or another such transmission channel").
Praveen further teaches wherein the spotlight information is received from a viewing device configured to synthesize the scene images (Praveen, Paragraph (0022], "The gaze information can be received from a head-mounted or a display-mounted gaze tracker <read on spotlight information received from a viewing device>"; (0053], "the consumer device 402 can be a video display device, such as a head-mounted device or any other devices that can track the eye movement/gaze information of an observer/user. The consumer device 402 can collect the various user end traits 406 including the eye gaze traits... The consumer device 402 can transmit the collected various user end traits 406 to the video service server 450).
Praveen and Kim are analogous since Kim provided a way of building a neural scene representation from images of physical scenes, which explicitly include people. Praveen provided a way to synthesize the scene images during the process. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to configure Kim's client viewing device to also transmit gaze spotlight information back to the training system, as taught by Praveen, such that the neural network receives spotlight data directly from the device that synthesizes the scene images. The motivation is that the viewing device is uniquely co-located with the observer during scene synthesis and can therefore provide the most accurate and low-latency spotlight data reflecting the observer's real-time viewing behavior, enabling the training system to immediately adapt training prioritization based on actual usage by using gaze information from the viewing device is leveraged for optimization of real time video content generation.
Regarding Claim 9, the combination of Kim and Praveen teaches the invention of Claim 3.
The combination further teaches wherein the view spotlight information includes eye tracking information associated with the viewer (Praveen, Paragraph [0022], "gaze information <read on view spotlight information> about an observer of a video stream is received. The gaze information can be received from a head-mounted or a display-mounted gaze tracker <read on eye tracking information associated with the viewer>. The gaze information includes information about instantaneous eye position").
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to implement the view spotlight information in Kim's training prioritization system using eye tracking information as taught by Praveen, such that the spotlight is grounded in a precise, objective, and continuously updated measurement of the viewer's actual eye fixation point. The motivation is that eye tracking data provides the most direct and physiologically accurate measure of the observer's visual focus within the scene, enabling the neural network training to concentrate resources with maximum precision on the scene regions the viewer is genuinely observing, consistent with Praveen's system described in Paragraphs-, which shows that eye gaze data collected from a head-mounted tracker is parameterized and directly leveraged for content generation optimization
Regarding Claim 15, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the neural network is a latent model encoder (Kim, Column 3, Lines 36-42, "a neural network encoder 106 <read on latent model encoder> can analyze images 102, which may correspond to a sequence of images as may have been captured using at least one camera of a moving vehicle, extract representative features of those images, and encode those features into a latent representation 108 of a scene represented in images 102").
Regarding Claim 16, the combination of Kim and Praveen teaches the invention of Claim 15.
The combination further teaches further including transmitting the latent model to a viewing device including a latent model decoder, wherein the latent model decoder is configured to decode latent model to generate imagery corresponding to novel views of the scene (Kim, Column 10, Lines 23-38, "results of any of these components (e.g., generated meshes or determined network weights) can be transmitted to client device 602 <read on viewing device including a latent model decoder> using an appropriate transmission manager 622 to send by download, streaming, or another such transmission channel"; "a scene generator component 612 on client device 602 may generate 3D representations, such as by using a generative network with network weights <read on latent model> received from content server 620"; Column 4, Lines 65-67-Column 5, Line 1-3, "given this latent representation 158 and view information to be used, a neural network decoder 160 <read on latent model decoder> can then generate one or more images 162 (or 20, 30, or 40 image or video data) showing this random scene from one or more determined points of view <read on decode latent model to generate imagery corresponding to novel views of the scene>").
Regarding Claim 17, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the training further includes encoding the training data using the neural network to produce an initial latent space model of the scene (Kim, Column 3, Lines 36-42, "a neural network encoder 106 can analyze images 102, which may correspond to a sequence of images as may have been captured using at least one camera of a moving vehicle, extract representative features of those images, and encode those features into a latent representation 108 <read on initial latent space model of the scene> of a scene represented in images 102"), decoding the initial latent space model of the scene using a pre-trained latent model decoder to produce initial generated imagery corresponding to the scene (Kim, Column 4, Lines 27-33, "a decoder 110 can use this camera information along with this latent 30 representation 108 <read on initial latent space model of the scene> to generate a series of images 112 <read on initial generated imagery corresponding to the scene> that are accurate reconstructions of these input images"; Column 7, Lines 24-27, "these models can be used to synthesize output latent representations 410,412, 414 that can be input to one or more pre-trained decoders <read on pre-trained latent model decoder> to construct a 3D representation and render from any appropriate viewpoint"), comparing the initial generated imagery to the training data to evaluate an encoding loss based upon differences between the initial generated imagery to the training data (Kim, Column 9, Lines 1-3, "this recreated image can be compared against this original scene image to determine a reconstruction loss <read on encoding loss based upon differences between the initial generated imagery to the training data>"; Column 6, Lines 24-25, "a network can be trained using a loss function with multiple loss terms, including a reconstruction loss term <read on encoding loss> for reproduced input images"), and updating weights of the neural network using a parameter of the encoding loss (Kim, Column 9, Lines 3-5, "one or more network parameters can then be adjusted 510 to attempt to minimize this reconstruction loss <read on updating weights of the neural network using a parameter of the encoding loss> as part of a training process").
Claim(s) 10-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. (US 12322068 B1, hereinafter Kim) in view of Praveen (WO 2022108652 A1, hereinafter Praveen) as applied to Claim 3, 1, 12 above respectively and further in view of Chen et al., "Novel View Acoustic Synthesis," CVPR 2023, hereinafter Chen).
Regarding Claim 10, the combination of Kim and Praveen teaches the invention of Claim 1 above.
The combination does not explicitly disclose but Chen teaches utilizing acoustic geolocation to identify sources of sound within the scene (Chen, Figure 1 caption I task statement, "Given audio-visual observations from one viewpoint and the relative target viewpoint pose, render the sound received at the target viewpoint. Note that the target is expressed as the desired pose of the microphones;"; "this task is to synthesize the sound in a scene from a new acoustic viewpoint, given only the visual and acoustic input from another source viewpoint in the same scene"; Page 6412, Section 5.2, "The goal of active speaker localization is to predict the bounding box of the active speaker in each frame of the video"), the view spotlight information being based at least in part upon locations of the sources of the sound (Chen, Page 6410, Section Introduction, "the environment acoustics also affect the sound one hears as a function of the scene geometry, materials, and emitter/receiver locations." "The same source sounds very differently if it is located in the center of a room, at the corner, or in a corridor," "the network ... synthesizes the audio at a target location"; Page 6410, Section 3, "the network first takes as input the image observed at the source viewpoint in order to infer global acoustic and geometric properties of the environment along with the bounding box of the active speaker <read on view spotlight information being based at least in part upon locations of the sources of the sound>"; Section 3, Eq. 1, " Assuming there are N sound emitters in the scene (emitter i emits sound Ci from location Li),... the goal is to synthesize the audio AT at the target viewpoint T, as it would sound from the target location <read on locations of the sources of the sound guiding synthesis toward the target viewpoint>").
Chen and Kim are analogous since all three are concerned with neural network-based generation or processing of scene imagery or audio-visual content in which
synthesis resources are prioritized or guided based on areas of the scene of particular interest. Kim provided a way of training a neural network encoder using sequences of images and camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views can be synthesized. Chen provided a way of acoustically geolocating active sound sources within a scene and using the spatial location of those sound sources to guide and prioritize neural synthesis toward viewpoints and scene regions that contain those sound emitters. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the acoustic geolocation technique taught by Chen into the neural network training system of Kim such that the view spotlight information used to guide training prioritization is based not only on viewer gaze but also on the acoustically determined locations of active sound sources within the scene. The motivation is to improve scene prioritization and perceptual realism by incorporating multimodal cues (visual and audio) when identifying regions of importance within a synthesized scene
Regarding Claim 11, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the view spotlight information includes at least one of: (i) a last known or predicted view direction of a viewer of the scene (Praveen, Paragraph [0032], "the position A (a, b) can be extrapolated based on eye movement direction, eye movement speed, and network latency. The extrapolated position is for predicting where the user would be looking at by the
time the image is delivered"; Paragraph (0046], "The center of the region of interest corresponds to the instantaneous eye position plus an offset proportional to the instantaneous eye velocity times the network latency "), eye tracking information associated with the viewer (Praveen, Paragraph [0022], "gaze information <read on view spotlight information> about an observer of a video stream is received. The gaze information can be received from a head-mounted or a display-mounted gaze tracker. The gaze information includes information about instantaneous eye position <read on eye tracking information associated with the viewer>"; [0027], "The gaze tracker of the video display device can collect the gaze information, e.g., eye movement data, including the visual field, the eye movement speed, the focus speed, and others <read on eye tracking information associated with the viewer>"), and a region of 3D space within the scene (Praveen, Paragraph [0002], "Video streaming consumes a large amount of bandwidth, especially in three-dimensional (3D) environments. For example, in virtual reality (VR), augmented reality (AR), and/or mixed reality (MR) systems, a display device (e.g., head-mounted device) will receive video stream data from a server and display the received video stream data to a user in a spatial three-dimensional (3D) environment <read on view spotlight information corresponding to a region of 3D space within the scene>"; (0023], "the video display device can be a head-mounted device in AR/VR/MR systems that includes one or more sensors. The one or more sensors can provide real time tracking of the gaze information, such as human eye movement data ").
The combination of does not explicitly disclose but Chen teaches identifying other high-interest areas of the scene (Chen, Page 6412, Section 5.2, "Knowing where the emitters of different primary sounds are located in the environment can help to solve the NVAS task. In this paper, we focus on localizing the active speaker, although there can be other important primary sound events like instruments playing, speakers interacting with objects, etc. <read on identifying other high-interest areas of the scene, beyond the primary spotlight region, including areas associated with active sound emitters and interacting subjects>"; Section 5, "The high-level idea is to separate the observed sound into primary and ambient, extract useful visual information (active speaker and acoustic features), and use this information to guide acoustic synthesis for the primary sound"), wherein the prioritizing the training is further based at least in part upon the other high-interest areas (Chen, Page 6410, Section 3, "the network first takes as input the image observed at the source viewpoint in order to infer global acoustic and geometric properties of the environment along with the bounding box of the active speaker"; Page 6412, Section 5.2, "The goal of active speaker localization is to predict the bounding box of the active speaker in each frame of the video").
Chen and Kim are analogous since all both of them are dealing with neural network-based generation or processing of scene imagery or audio-visual content in which synthesis or processing resources are prioritized or guided based on areas of the scene of particular interest to a viewer or determined to be perceptually significant. Kim provided a way of training a neural network encoder using sequences of images and camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views can be synthesized. Chen provided a way of identifying other high-interest areas of the scene beyond the viewer's gaze spotlight by acoustically and visually localizing active sound emitters within the scene, and using those identified regions to guide and prioritize neural network synthesis resources. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate augment the training prioritization with the identification of other high-interest areas of the scene as taught by Chen into modified invention of Kim such that the neural network training concentrates resources on both the region the viewer is actively watching and additional regions identified as perceptually salient through acoustic and visual analysis.
Regarding Claim 12, the combination of Kim, Praveen and Chen teaches the invention of Claim 11.
The combination further teaches wherein the prioritizing the training includes preferentially biasing selection of the queried 3D viewpoint and viewing angles utilized during the training the based upon the view spotlight information and the other high-interest areas (Praveen, Paragraph [0042], "identifying a center of a region of interest (ROI) corresponding to a predicted eye position; selecting a first shape for the ROI and selecting a video compression profile that includes higher compression outside the first shape"; [0044], "The further an image area away from the center of the ROI, the larger the compression rate for that image area. In other words, image areas further away from the center of ROI are compressed more significantly, and thus have less fidelity).
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Chen further teaches the prioritizing the training includes preferentially biasing selection of the queried 3D viewpoint and viewing angles utilized during the training the based upon the view spotlight information and the other high-interest areas (Chen, Page 6412, Section 5.2, "The goal of active speaker localization is to predict the bounding box of the active speaker in each frame of the video <read on identifying other high-interest areas>"; Page 6410, Section 3, "the network first takes as input the image observed at the source viewpoint in order to infer global acoustic and geometric properties of the environment along with the bounding box of the active speaker <read on preferentially biasing synthesis based upon both the spotlight information and the other high-interest area corresponding to the speaker's location>").
Chen and Kim are analogous since all three are concerned with neural network-based generation or processing of scene imagery or audio-visual content in which
synthesis resources are prioritized or guided based on areas of the scene of particular interest. Kim provided a way of training a neural network encoder using sequences of images and camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views can be synthesized. Chen provided a way of adjusting the training result based on viewing angles utilized during the training and based on the view spotlight information and the other high-interest areas. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate preferential biasing of queried viewpoint and viewing angle selection taught by Chen into modified invention of Kim such that system will be able to further improve neural network training efficiency and rendering fidelity in all scene regions of greatest perceptual relevance.
Regarding Claim 12, the combination of Kim and Praveen teaches the invention of Claim 1.
The combination further teaches wherein the prioritizing the training includes preferentially biasing selection of the queried 3D viewpoint and viewing angles utilized during the training the based upon the view spotlight information and other high-interest areas of the scene (Praveen, Paragraph [0042], "identifying a center of a region of interest (ROI) corresponding to a predicted eye position; selecting a first shape for the ROI and selecting a video compression profile that includes higher compression outside the first shape <read on preferentially biasing selection of the queried 3D viewpoints and viewing angles based upon the view spotlight information and the other high-interest areas>"; [0044], "The further an image area away from the center of the ROI, the larger the compression rate for that image area. In other words, image areas further away from the center of ROI are compressed more significantly, and thus have less fidelity <read on preferential biasing of the queried viewpoints and viewing angles such that those falling within both the spotlight region and other high-interest areas receive prioritized training while those outside are deprioritized>").
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Chen further teaches the prioritizing the training includes preferentially biasing selection of the queried 3D viewpoint and viewing angles utilized during the training the based upon the view spotlight information and the other high-interest areas (Chen, Page 6412, Section 5.2, "The goal of active speaker localization is to predict the bounding box of the active speaker in each frame of the video <read on identifying other high-interest areas>"; Page 6410, Section 3, "the network first takes as input the image observed at the source viewpoint in order to infer global acoustic and geometric properties of the environment along with the bounding box of the active speaker <read on preferentially biasing synthesis based upon both the spotlight information and the other high-interest area corresponding to the speaker's location>").
Chen and Kim are analogous since all three are concerned with neural network-based generation or processing of scene imagery or audio-visual content in which
synthesis resources are prioritized or guided based on areas of the scene of particular interest. Kim provided a way of training a neural network encoder using sequences of images and camera parameters to build a compact latent representation of a three-dimensional scene, from which novel views can be synthesized. Chen provided a way of adjusting the training result based on viewing angles utilized during the training and based on the view spotlight information and the other high-interest areas. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate preferential biasing of queried viewpoint and viewing angle selection taught by Chen into modified invention of Kim such that system will be able to further improve neural network training efficiency and rendering fidelity in all scene regions of greatest perceptual relevance.
Regarding Claim 13, the combination of Kim, Praveen, and Chen teaches the invention of Claim 12.
The combination further teaches wherein the other high-interest areas include one or more areas in the scene in which a face or motion is present (Chen, Page 6413, Section 5.2, "the model has to rely on other cues to identify the speaker (such as body motion, gender or identity) <read on areas in the scene in which motion is present>"; "The goal of active speaker localization is to predict the bounding box of the active speaker in each frame of the video <read on areas in the scene in which a face is present, the bounding box enclosing the active speaker's face and body region>"; Page 6410, Section 1 (Introduction), "the network first takes as input the image observed at the source viewpoint in order to infer global acoustic and geometric properties of the environment along with the bounding box of the active speaker").
As explained in rejection of claim 12, the obviousness for combining of adjusting the training result based on viewing angles utilized during the training and based on the view spotlight information and the other high-interest areas of Chen into Kim is provided above.
Regarding Claim 14, the combination of Kim, Praveen, and Chen teaches the invention of Claim 12.
The combination further teaches wherein the other high-interest areas include one or more areas in the scene that are in focus through a virtual camera view of the scene (Kim, Column 4, Lines 10-15, "a voxel-based 3D representation 108 can be generated using a generative model (as may include an encoder 106) that takes in this sequence of images, relevant camera parameters (e.g., depth of focus or field of view), and camera orientation information <read on areas in the scene that are in focus through a virtual camera view of the scene, the depth of focus parameter defining which regions of the 3D scene are rendered in focus through the virtual camera>").
Praveen further teaches preferentially directing processing resources toward the region of interest (Praveen, Paragraph [0035], "Region of interest can be treated as a mask which can algorithmically define the required quality based on the distance from the ROI Center. The further the image area from the ROI Center, the less needed quality <read on preferential allocation of rendering quality to the in-focus region of the virtual camera view, which constitutes the region of highest visual salience and rendering importance>").
As explained in rejection of claim 1, the obviousness for combining of spotlight focus of Praveen into Kim is provided above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 10460511 B2 Method and system for creating a virtual 3D model
US 11176637 B2 Foveated rendering using eye motion
US 12444126 B1 Neural network-based view synthesis
Mildenhall et al - NeRF Representing Scenes as Neural Radiance Fields for View Synthesis - 202201 - ACM
Deng et al - FoV-NeRF Foveated Neural Radiance Fields for Virtual Realit - 20220722 - arXiv
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YUJANG TSWEI whose telephone number is (571)272-6669. The examiner can normally be reached 8:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached on (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/YuJang Tswei/Primary Examiner, Art Unit 2614