DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-7 are rejected under 35 U.S.C. 101 because the claims are drawn to functional descriptive material not claimed as residing on a computer readable medium. Claims 1-7, while reciting a system, do not include any structural limitations. The system includes components that can be encompassed entirely by software without a recitation to any hardware component (See Applicant's specification at paragraph 50 and claim 15). Therefore claims 1-7 are non-statutory. See MPEP 2106.03(I)(“Products that do not have a physical or tangible form, such as information (often referred to as "data per se") or a computer program per se (often referred to as "software per se") when claimed as a product without any structural recitations.”)
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Garg et al. Geometry-aware multi-task learning for binaural audio generation from video arXiv:2111.10882v1 (hereafter “Garg”).
Referring to claims 1, 8 and 15, Garg discloses a localization system, comprising:
an image input that receives images from a video source (page 4, The network takes the visual frames and monaural audio as input);
an audio input that receives, from the video source, audio synchronized with the images (page 4, The network takes the visual frames and monaural audio as input); and
an audio feature disentanglement network that correlates distinct audio elements from the audio input with corresponding visual features from the image input (page 5, In this way, the visual features are forced to reason about the relative positions of the sound sources and learn to find the cues in the visual frames which dictate the direction of sound heard).
Referring to claims 2, 9 and 16, Garg discloses wherein the images received from the video source comprise first-person videos (pages 1-2, Videos or other media with binaural audio imitate that rich audio experience for a user, making the media feel more real and immersive. This immersion is important for virtual reality and augmented reality applications, where the user should feel transported to another place and perceive it as such).
Referring to claims 3, 10 and 17, Garg discloses a geometry-based feature aggregation module that estimates a geometric transformation between two or more images from the video source and aggregates visual features based on that geometric transformation (page 6, Since the videos are continuous samples over time rather than individual frames, our fourth and final loss regularizes the visual features by requiring them to have spatio-temporal geometric consistency).
Referring to claims 4, 11 and 18, Garg discloses a sounding object estimation engine that correlates the distinct audio elements with object locations of the visual features from the image input (page 10, Figure 6 shows the qualitative visualization of the activation maps for the visual network that provides the object/region producing the sound and its location).
Referring to claims 5, 12 and 19, Garg discloses wherein the visual features are determined based on the geometric transformation (page 5, In particular, we incorporate a classifier to identify whether the visual input is aligned with the audio. The classifier G combines the binaural audio ALR = [At
L, At R] and the visual features vtf to classify if the audio and visuals agree. In this way, the visual features are forced to reason about the relative positions of the sound sources and learn to find the cues in the visual frames which dictate the direction of sound heard).
Referring to claims 6, 13 and 20, Garg discloses wherein the audio feature disentanglement network comprises at least one convolution layer (page 17, The classifier combines the audio and visual
features and uses a fully connected layer for prediction).
Referring to claims 7 and 14, Garg discloses an augmented reality module that plays the distinct audio elements from the audio input in conjunction with displaying the corresponding visual features in an augmented reality environment (page 5, Using the publicly available SoundSpaces2 audio simulations together with the Habitat simulator, we create realistic videos with binaural sounds for publicly available 3D environments in Matterport3D. To construct the dataset, we insert diverse 3D models from poly.google.com of various instruments like guitar, violin, flute etc. and other sound-making objects like phones and clocks into the scene. To generate realistic binaural sound in the environment as if it is coming from the source location and heard at the camera position, we convolve the appropriate SoundSpaces room impulse response with an anechoic audio waveform (e.g., a guitar playing for an inserted guitar 3D object)).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lindahl US Patent 11,736,862
Senocak et al. Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications. arXiv:1911.09649v1.
Owens et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv:1804.03641v2.
Hu et al. Class-Aware Sounding Objects Localization via Audiovisual Correspondence. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 44, NO. 12, DECEMBER 2022.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PETER K HUNTSINGER whose telephone number is (571)272-7435. The examiner can normally be reached Monday - Friday 8:30 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Q Tieu can be reached at 571-272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PETER K HUNTSINGER/ Primary Examiner, Art Unit 2682