Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
The drawings were received on 8/16/2024. These drawings are accepted.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-14,16,18-19,21-22 is/are rejected under 35 U.S.C. 102a1 as being anticipated by Xu et al (US Publication No.; 20240267695).
Claim 1, Xu et al discloses
At least one processor (paragraph 105) and a memory storing instructions that are operable, when executed by the processor (paragraph 105) to cause the apparatus to:
Receive audio data and image data associated with an audio environment (Paragraph 50 discloses audio recorded and video captured. Such are considered audio data and image data of the audio environment.);
Generate, based at least in part on the audio data and the image data, an image set (Paragraph 53 discloses training data is composed of video clips with video clips described in paragraph 50-51. The training data as image set.), the image set comprising a plurality of images each associated with audio samples representing acoustic properties of the audio environment (Paragraph 53 discloses training data includes audio visual scene eg. video clips where the images are associated with audio samples representing properties of the audio environment. Fig. 5 shows collected video clips.); and
Generate, based at least in part on the image set and the audio samples, an audio rendering model for the audio environment (Fig. 3 shows the generated neural network with AV-NeRF as the audio rendering model for the audio environment. The hypernetwork generates weighs or parameters of A-NeRF. Paragraph 40 discloses A-NeRF generates binaural audio for the audio environment of the video or image such as 3D structure shown in Fig. 2, label acoustic-aware audio generation.), wherein the audio rendering model comprises a neural rendering volumetric representation of the audio environment augmented with audio encodings (Paragraph 42 discloses “The output volume density may be composed into an environment voxel grid, which represents the 3D structure of the scene.”. Paragraph 39 discloses “A-NeRF is to learn a neural acoustic representation that can map 5D coordinates … to corresponding acoustic masks …”, wherein acoustic masks indicate audio encodings.).
Claim 2, Xu et al discloses the audio rendering model (Fig. 3) comprises an augmented neural radiance field (NeRF) model that is augmented with the audio encodings (Fig. 3 shows the NeRF model augmented with audio encodings (paragraph 39,42)).
Claim 3, Xu et al discloses the audio data and the image data (Paragraph 50) are respectively captured via at least one microphone and at least one camera of a capture device that scans the audio environment (Paragraph 50 discloses video capture via camera and audio capture via microphones.).
Claim 4, Xu et al discloses train weights (paragraphs 50-53) of the audio rendering model (Fig. 3) on volumetric probability densities which represent respective locations within the audio environment (Fig. 3, label video, direction, position. Paragraph 34,54 discloses direction and position of cameras. Paragraph 37 discloses NeRF learns a mapping from camera poses to colors and densities. NeRF maps a 3D coordinate to density.), wherein the weights are configured based at least in part on latent encoding of physical information and acoustic information for respective locations of the audio environment (paragraph 37 discloses the NeRF learns with mapping from camera poses and densities. Paragraph 54 discloses NeRF trained with camera poses that include direction and position of the camera. The direction and position of the camera are considered latent encoding of physical information and acoustic information for respective locations of the audio environment.).
Claim 5, Xu et al discloses
Determine, based at least in art on the audio samples, a camera properties set comprising relative audio sample locations and camera orientations associated with the audio samples (Paragraph 54 discloses positions around a sound source, cameras poses such as 706,708,710, etc. with sound source, L,R associated with the camera shots at different directions and positions.);
Generate the audio rendering model based at least in part on the image set, the audio samples and the camera properties set (Fig. 3 shows the audio rendering model, generated based on training data as per paragraph 53-54.).
Claim 6, Xu et al discloses the camera properties set comprise a respective location of the image set with respect to the audio environment (Fig. 7 shows an example of camera properties set comprising location of the image set (direction, location) of the audio environment.).
Claim 7, Xu et al discloses the camera properties set comprise a respective orientation of the image set with respect to the audio environment (Fig. 7 shows the camera properties set comprising respective orientation (location, direction) of the image set.).
Claim 8, Xu et al discloses input, to the audio rendering model, a training data vector that comprises impulse responses for the image set augmented with the camera properties set and the audio encodings (Paragraph 39 discloses the A-NeRF is trained or learned with training data including mapping of 5D coordinates (camera poses as per paragraph 37) to corresponding acoustic masks (audio encodings). Fig. 7, label baseline method and AV NeRF as the impulse response for the image set (rendered images) augmented with camera properties such as view point shown in labels 706,708,710,712,714 and audio encodings such as acoustic masks the A-NeRF is trained with.).
Claim 9, Xu et al discloses input, to the audio rendering model, a training data vector that comprises material acoustic properties for the image set augmented with the camera properties set and the audio encodings.
Claim 10, Xu et al discloses determine, based at least in part on the audio rendering model, one or more impulse responses associated with one or more audio sources in the audio environment (Fig. 7, label baseline method, AV-NeRF shows the determined one or more impulse responses associated with audio sources.).
Claim 11, Xu et al discloses infer, based at least in part on the audio rendering model, locations of audio sources within the audio environment (Paragraph 44 discloses rotation angle relative to the sound source is determined and the angle is calculated relative to direction coordinates. This indicates locations of the audio source.).
Claim 12, Xu et al discloses wherein the instructions are further operable to cause the apparatus to: infer, based at least in part on the audio rendering model, acoustic attributes of sound emitted within or from the audio environment (Paragraph 44 discloses angle of the sound source which indicates acoustic attributes of the sound emitted within or from the audio environment.).
Claim 13, Xu et al discloses
output one or more candidate audio component locations associated with the audio environment (Fig. 7, label 702 as the sound source in the audio environment. Paragraph 39 discloses position with magnitude and phase of the sound (one or more candidate audio component locations).),
wherein the one or more candidate audio component locations are generated based at least in part on the audio rendering model (Paragraph 39 discloses A-NeRF learns neural acoustic representation that can map 5d coordinates with acoustic masks. The acoustic masks includes magnitude and phase of sound with regard to the position of the sound.).
Claim 14, Xu et al discloses generate, based at least in part on the audio rendering model, a digital twin of the audio environment (Fig. 1b shows the rendered images are a digital twin of the audio environment such as shown in Fig. 1a. Fig. 3 shows the audio rendering model generating the rendered images.).
Claim 16, Xu et al discloses wherein the instructions are further operable to cause the apparatus to: generate, based at least in part on the audio rendering model, an audio simulation for the audio environment (Fig. 7 shows the rendered images with audio.).
Claim 18, Xu et al discloses wherein the instructions are further operable to cause the apparatus to: infer, based at least in part on the audio rendering model, material attributes of objects or surfaces within the audio environment (Fig. 7, label 706,708,etc. shows the rendered images from the respective cameras as shown in 701. Such figures show material attributes of objects or surfaces such as walls, windows, etc.).
Claim 19, Xu et al discloses wherein the instructions are further operable to cause the apparatus to: generate, based at least in part on the audio rendering model, a set of drawings or images of the audio environment along with optimal locations or audio settings of audio equipment within the audio environment (Fig. 7, label 701 with sound source 702, 710,714,708,706,712 shows the optimal locations of the audio equipment or cameras within the audio environment. Paragraph 54 discloses Fig. 7 as rendering results which indicates such rendering is based at least in part on the audio rendering model of Fig. 3.).
Claim 21, Xu et al discloses
Receiving audio data and image data associated with an audio environment (Paragraph 50 discloses audio recorded and video captured. Such are considered audio data and image data of the audio environment.);
Generating, based at least in part on the audio data and the image data, an image set (Paragraph 53 discloses training data is composed of video clips with video clips described in paragraph 50-51. The training data as image set.), the image set comprising a plurality of images each associated with audio samples representing acoustic properties of the audio environment (Paragraph 53 discloses training data includes audio visual scene eg. video clips where the images are associated with audio samples representing properties of the audio environment. Fig. 5 shows collected video clips.); and
Generating, based at least in part on the image set and the audio samples, an audio rendering model for the audio environment (Fig. 3 shows the generated neural network with AV-NeRF as the audio rendering model for the audio environment. The hypernetwork generates weighs or parameters of A-NeRF. Paragraph 40 discloses A-NeRF generates binaural audio for the audio environment of the video or image such as 3D structure shown in Fig. 2, label acoustic-aware audio generation.), wherein the audio rendering model comprises a neural rendering volumetric representation of the audio environment augmented with audio encodings (Paragraph 42 discloses “The output volume density may be composed into an environment voxel grid, which represents the 3D structure of the scene.”. Paragraph 39 discloses “A-NeRF is to learn a neural acoustic representation that can map 5D coordinates … to corresponding acoustic masks …”, wherein acoustic masks indicate audio encodings.)
Claim 22, Xu et al discloses
Receive audio data and image data associated with an audio environment (Paragraph 50 discloses audio recorded and video captured. Such are considered audio data and image data of the audio environment.);
Generate, based at least in part on the audio data and the image data, an image set (Paragraph 53 discloses training data is composed of video clips with video clips described in paragraph 50-51. The training data as image set.), the image set comprising a plurality of images each associated with audio samples representing acoustic properties of the audio environment (Paragraph 53 discloses training data includes audio visual scene eg. video clips where the images are associated with audio samples representing properties of the audio environment. Fig. 5 shows collected video clips.); and
Generate, based at least in part on the image set and the audio samples, an audio rendering model for the audio environment (Fig. 3 shows the generated neural network with AV-NeRF as the audio rendering model for the audio environment. The hypernetwork generates weighs or parameters of A-NeRF. Paragraph 40 discloses A-NeRF generates binaural audio for the audio environment of the video or image such as 3D structure shown in Fig. 2, label acoustic-aware audio generation.), wherein the audio rendering model comprises a neural rendering volumetric representation of the audio environment augmented with audio encodings (Paragraph 42 discloses “The output volume density may be composed into an environment voxel grid, which represents the 3D structure of the scene.”. Paragraph 39 discloses “A-NeRF is to learn a neural acoustic representation that can map 5D coordinates … to corresponding acoustic masks …”, wherein acoustic masks indicate audio encodings.).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US Publication No.: 20240267695) in view of Veselinovic et al (US Publication No.: 20230224636).
Claim 15, Xu et al discloses generate, based at least in part on the audio rendering model, an output of the audio environment (Fig. 1b), but fails to disclose the output is an audio heat map.
Veselinovic et al discloses rendering a heat map of location data obtained for a given environment by an audio system (paragraph 82). It would be obvious to one skilled in the art before the effective filing date of the application to substituted one well known element of Xu et al’s rendered output with another well-known output of rendering a heat map of the audio environment as disclosed by Veselinovic et al so to obtain predictable results of rendered output of the audio environment.
Claim(s) 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US Publication No.: 20240267695) in view of Soon-Shiong et al (US Publication No.: 20230388439).
Claim 17, Xu et al discloses rendering engine or audio-visual scene synthesizer with audio rendering model (Fig. 3), but fails to disclose control audio equipment in the audio environment based at least in part on the audio rendering model.
Soon-Shiong et al discloses control audio equipment in the audio environment based at least in part on the audio rendering model (Paragraph 74 discloses “The content display interface module adjusts the captured content, with the help of a rendering engine, placing and performing required steps using the coordinates and orientation of the captured content (e.g. microphones, camera, actors, props, lights,…,etc.) …”. Such indicates controlling audio equipment such as microphones, cameras, etc. based at least in part on an audio rendering engine or model.).
It would be obvious to one skilled in the art before the effective filing date of the application to modify Xu et al’s rendering engine or audio-visual scene synthesizer by using such model or engine for controlling or adjusting the audio equipment as disclosed by Soon-Shiong et al so to improve capturing of audio in the environment.
Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al (US Publication No.: 20240267695) in view of Marsh (US Publication No.: 20200066042).
Claim 20, Xu et al discloses audio visual rendering model (Fig. 3), but fails to disclose all the recited limitations.
Marsh discloses
receive a digital exploration request associated with the audio environment (Fig. 6, label s601. Paragraph 1 discloses “receiving an indication of a desired item from a device of a user. The indication requests a rendering of a restricted virtual object in a space of the user and comprises an item identifier for the restricted virtual object, an identifier for the user and position data for the requested location that the restricted virtual object is to be rendered.”, the digital exploration request comprising an audio environment identifier associated with the audio environment (Paragraph 1 discloses requested location indicates an environment identifier associated with the environment. Paragraph 61 discloses “identifying content data associated with a virtual object, such as audio and video content …”.);
identify the audio rendering model based at least in part on the audio environment identifier (Paragraph 1 discloses “determining and retrieve partially rendered model of the restricted virtual object based on the received indication and the item identifier. Based on the position data, a rendering location in the space of the user is determined, and the partially rendered model is linked to the user and the rendering location before sending the partially rendered model and the rendering location to the device.”); and
generate one or more audio inferences based at least in part on the audio rendering model (Paragraph 61 discloses “In one embodiment, data associated with an item may include rendering data for rendering the item in a space of a user. The rendering data may include data for one or more rendering models, including partially and fully rendered models. … identifying content data associated with a virtual object, such as audio and video content …”. This indicates audio inferences based at least in part on the audio rendering model.).
It would be obvious to one skilled in the art before the effective filing date of the application to modify Xu et al’s rendering of an environment by incorporating rendering as disclosed by Marsh so to allow the user autonomy for space rendered and improve the user’s experience with audio visual scene rendering.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LINDA WONG/Primary Examiner, Art Unit 2655