Last updated: May 29, 2026

Application No. 18/090,418

MULTIMODAL MACHINE LEARNING FOR GENERATING THREE-DIMENSIONAL AUDIO

Non-Final OA §103

Filed

Dec 28, 2022

Priority

Oct 28, 2022 — EU 22383044.9

Examiner

MULLINAX, CLINT LEE

Art Unit

2123

Tech Center

2100 — Computer Architecture & Software

Assignee

International Business Machines Corporation

OA Round

3 (Non-Final)

Interview Optional

— +38.7% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 48% grant rate with +38.7% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 126 resolved cases, 2023–2026

Examiner Intelligence

MULLINAX, CLINT LEE View full profile →

Grants 48% of resolved cases

Career Allowance Rate

60 granted / 126 resolved

-7.4% vs TC avg

Strong +39% interview lift

Without

With

+38.7%

Interview Lift

resolved cases with interview

Typical timeline

4y 7m

Avg Prosecution

12 currently pending

Career history

151

Total Applications

across all art units

Statute-Specific Performance

§101

6.3%

-33.7% vs TC avg

§103

85.8%

+45.8% vs TC avg

§102

4.8%

-35.2% vs TC avg

§112

1.9%

-38.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 126 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/20/2026 has been entered.

Status of Claims
This action is in reply to the amendments and remarks filed on 03/20/2026.
Claims 1-20 are pending.
Claims 1, 11, and 19 have been amended.  

Response to Arguments
Applicant’s arguments, with respect to the rejection(s) of claim(s) 1, 11, and 19 under 35 U.S.C. 103, have been considered but they are not persuasive. Applicant argues that no reference teaches the amended limitations, since “the audio feature maps of Morgado do not individually identify a sound event in a multimodal content, time period corresponding to the audio feature map, and a physical location of the audio feature map in the multimodal content”; and the “25ms segment of Morgado is not a property identified by the audio feature map”. The examiner respectfully disagrees.
Due to the broadness of the claim language, Morgado has been found to teach the argued limitations, since section 3.2 and Figs. 1 and 4 teach “we apply a (two-dimensional) CNN encoder to the audio spectrogram” to output “feature maps” (audio object) of “high-level features” (audio element) in “25ms segments” (time period) and reduced dimensionality of the audio (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source. Here, the “feature maps” are merged “at each time step t” from the derived “25ms segments of the inputs” and maintained as reading on the claimed language.
Further section 3.2 and Figs. 1 and 4 teach RGB CNN image encoder outputting “feature maps” (image objects) of “high-level features” (image element) at a time (time period) of frames in the video sequence (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claim(s) 1, 11, and 19 under 35 U.S.C. 103, have been considered but they are not persuasive. Applicant argues that no reference teaches the amended limitations, since “Morgado does not teach or suggest that the reduced dimensionality of the audio is a physical location of audio object (a sound source) in the real-world scene of the content”. The examiner respectfully disagrees.
Due to the broadness of the claim language, Morgado has been found to teach the argued limitations, since section 3.2 and Figs. 1 and 4 teach “we apply a (two-dimensional) CNN encoder to the audio spectrogram” to output “feature maps” (audio object) of “high-level features” (audio element) in “25ms segments” (time period) and reduced dimensionality of the audio (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source. Further section 3.2 and Figs. 1 and 4 teach RGB CNN image encoder outputting “feature maps” (image objects) of “high-level features” (image element) at a time (time period) of frames in the video sequence (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Applicant’s arguments, with respect to the rejection(s) of claim(s) 1, 11, and 19 under 35 U.S.C. 103, have been considered but they are not persuasive. Applicant argues that no reference teaches the amended limitations, since “Morgado_ 2020 discloses that the angles (θk; ϕk) are properties of the audio/video clips, not of those output image/audio representations…as an attribute identifying the physical location of a specific sound or image element within the scene”. The examiner respectfully disagrees.
Due to the broadness of the claim language, Morgado_2020 has been found to teach the argued limitations, since sections 3.2-3.3 and Figs. 4 and 6-7 teach a transformer network processing features from audio and video clips to output a image/audio representations (objects) of a specific clip time, image/audio feature (element), image feature viewing angle (spatial positioning) and audio feature listening angle (spatial positioning); and further “while an audio clip aki sampled at position (θk; ϕk) contains audio from all sound sources present in a scene, only those physically located around (θk; ϕk) can be seen on the video clip vki . This implies that, to enable accurate feature translation, networks gv2a and ga2v should combine features from all sampled location” while mapping audio sources in the visual frames with “encoder outputs (vki and aki)” with k as the corresponding viewing angle; thus, the outputs carry identifiers of the location as argued.
See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1 and 3-20 are rejected under 35 U.S.C. 103 as being unpatentable over Morgado et al ("Self-Supervised Generation of Spatial Audio for 360° Video", 2018) hereinafter Morgado, in view of Morgado et al (“Learning Representations from Audio-Visual Spatial Alignment”, 2020) hereinafter Morgado_2020.
Regarding claims 1, 11, and 19, Morgado teaches a method; apparatus comprising: a memory; and at least one processor, coupled to the memory, and operative to perform operations comprising; and computer readable storage medium comprising computer executable instructions which when executed by a computer cause the computer to perform the method of (Examiner note: Applicant’s specification paragraph 0060 states “computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se”, thus the CRM is interpreted by the examiner as non-transitory.
Morgado, section 4 teaches using a GPU for performing the embodiments of the disclosure, well known to be included on a computer system and communicatively connected to one or more memories): 
accessing, by a computing device, a multimodal content item (sections 4 teach using a GPU for performing the embodiments of the disclosure, including sections 3-3.2 teaching of obtaining “video input” and “audio” input); and 
automatically generating, by the computing device, three-dimensional sound using one or more machine learning models for the multimodal content item (sections 4 teach using a GPU for performing the embodiments of the disclosure, including sections 3-3.2 teaching of creating “ambisonic” spatial audio from the input “video input” and “mono audio” input), wherein the automatic generating of the three-dimensional sound comprises:
processing, using a first neural network of the one or more machine learning models, an audio of the multimodal content item to generate one or more audio objects, wherein each audio object, of the one or more audio objects, identifies an audio element corresponding to a sound event in the multimodal content item, a time period corresponding to the audio object, and a spatial position representing a physical location of the audio object in the multimodal content item (section 3.2 and Figs. 1 and 4 teach “we apply a (two-dimensional) CNN encoder to the audio spectrogram” to output “feature maps” (audio object) of “high-level features” (audio element) in “25ms segments” (time period) and reduced dimensionality of the audio (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source); 
processing, using a second neural network of the one or more machine learning models, one or more images of the multimodal content item to generate a plurality of image objects, wherein each image object, of the plurality of image objects, identifies an image element within one or more image frames of the multimodal content item, a time period corresponding to the respective image object, and a spatial position representing a physical location of the image element of the image object in the multimodal content item (section 3.2 and Figs. 1 and 4 teach RGB CNN image encoder outputting “feature maps” (image objects) of “high-level features” (image element) at a time (time period) of frames in the video sequence (spatial position), and further visualizing the audio “as a color overlay over the frame” based on the location of the audio source).

Morgado at least implies wherein each audio object, of the one or more audio objects, identifies an audio element corresponding to a sound event in the multimodal content item, a time period corresponding to the audio object, and a spatial position representing a physical location of the audio object in the multimodal content item, and wherein each image object, of the plurality of image objects, identifies an image element within one or more image frames of the multimodal content item, a time period corresponding to the respective image object, and a spatial position representing a physical location of the image element of the image object in the multimodal content item (see mappings above); however, Morgado_2020 teaches wherein each audio object, of the one or more audio objects, identifies an audio element corresponding to a sound event in the multimodal content item, a time period corresponding to the audio object, and a spatial position representing a physical location of the audio object in the multimodal content item, and wherein each image object, of the plurality of image objects, identifies an image element within one or more image frames of the multimodal content item, a time period corresponding to the respective image object, and a spatial position representing a physical location of the image element of the image object in the multimodal content item (sections 3.2-3.3 and Figs. 4 and 6-7 teach a transformer network processing features from audio and video clips to output a image/audio representations (objects) of a specific clip time, image/audio feature (element), image feature viewing angle (spatial positioning) and audio feature listening angle (spatial positioning); and further “while an audio clip aki sampled at position (θk; ϕk) contains audio from all sound sources present in a scene, only those physically located around (θk; ϕk) can be seen on the video clip vki . This implies that, to enable accurate feature translation, networks gv2a and ga2v should combine features from all sampled location” while mapping audio sources in the frames with “encoder outputs (vki and aki)” with k as the corresponding viewing angle).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to implement Morgado_2020’s teachings of transformer network processing audio and image features with viewing/listening angles into Morgado’s teaching of combined neural networks processing 360° video data for producing ambisonics audio for the video in order to “yield better representations” for producing more accurate audio for a 360° video scene (Morgado_2020, section 6).

Regarding claims 3, 12, and 20, Morgado teaches all the claim limitations of claims 1, 11, and 19 above; and further teaches wherein the generating of the three-dimensional sound comprises: 
tracking, using a third neural network of the one or more machine learning models, an evolution of each audio object to generate an audio element track (Morgado, section 3.2 and Fig. 1 teach audio separation decoder network processing the “multi-modal features” to produce a “separated audio track” over each corresponding time);
tracking, using a fourth neural network of the one or more machine learning models, an evolution of each image object to generate an image element track (Morgado, section 3.2 and Fig. 1 teach motion flow CNN image encoder for outputting features at a time (time period) across image frames (tracking…an evolution of each image object to generate an image element track)); 
linking, using a fifth neural network of the one or more machine learning models, at least one of the audio element tracks with at least one selected from the group consisting of: another of the audio element tracks and at least one of the image element tracks to generate a summary stream (Morgado, section 3.2 teaches “joint audio-visual representation (summary stream) is then obtained by merging (linking) the three feature maps (audio, RGB and flow) produced at each time t” via synchronizing audio and video with nearest neighbor up-sampling); and 
processing, using a sixth neural network of the one or more machine learning models, the summary stream to generate an audio output, the audio output comprising the three-dimensional sound (Morgado, section 3.2 and Fig. 1 teach audio localization fully-connected layer network that processes the “feature vectors” and “generates, at each time t (tracking), the localization weights…associated with each of the k sources”; and further generating ambisonic spatial audio by utilizing the weights to compute “first-order ambiosonic channels”).
Morgado at least implies wherein each audio object identifies an audio element, a time period corresponding to the audio object, and a spatial position of the audio object, and wherein each image object identifies an image element, a time period corresponding to the image object, and a spatial position of the image object (see mappings above); however, Morgado_2020 teaches wherein each audio object identifies an audio element, a time period corresponding to the audio object, and a spatial position of the audio object, and wherein each image object identifies an image element, a time period corresponding to the image object, and a spatial position of the image object (sections 3.2-3.3 teach a transformer network processing features from audio and video clips to output a image/audio representations (objects) of a specific clip time, image/audio feature (element), image feature viewing angle (spatial positioning) and audio feature listening angle (spatial positioning)).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to implement Morgado_2020’s teachings of transformer network processing audio and image features with viewing/listening angles into Morgado’s teaching of combined neural networks processing 360° video data for producing ambisonics audio for the video in order to “yield better representations” for producing more accurate audio for a 360° video scene (Morgado_2020, paragraph 0033).

Regarding claims 4 and 13, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 3 and 12 above; and further teach wherein at least one or more of the audio objects comprises information for reconstructing the audio object (Morgado, section 3.2 and Fig. 1 teach generating “Audio features” of extracted information used for creating ambisonic audio data outputs).

Regarding claims 5 and 14, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 3 and 12 above; and further teach wherein at least one of the plurality of image objects comprises information for reconstructing the image object (Morgado, section 3.2, 4, and Fig. 1 teach generating “Video features” of extracted information used for creating matched video frames for ambisonic audio data outputs).

Regarding claims 6 and 15, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 1 and 11 above; and further teach further comprising training the first neural network using soundtracks of existing multimodal content items and their corresponding audio labels and training the second neural network using image sequences of the existing multimodal content items and their corresponding image labels (Morgado, sections 1 and 3.4 teach “we introduce two 360° video datasets (existing multimodal content items) with spatial audio (soundtracks), one recorded by ourselves in a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. During training, the captured spatial audio serves as ground truth (labels)” and “We further validate each component of the proposed architecture” in this training method; and further the supervised video training data including sources in the image (labels). Sections 3-3.2, 3.4 and Fig. 1 teach model components that are trained including “a (two-dimensional) CNN encoder” and an RGB CNN image encoder trained and “fine-tuned on our task”.).

Regarding claims 7 and 16, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 3 and 12 above; and further teach further comprising training the third neural network using training audio objects and their corresponding audio labels and training the fourth neural network using training image objects and their corresponding image labels (Morgado, sections 1, 3.4, and 4 teach “we introduce two 360° video datasets (existing multimodal content items) with spatial audio, one recorded by ourselves in a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. During training, the captured spatial audio serves as ground truth (labels)” and “We further validate each component of the proposed architecture” in this training method; and further the supervised video training data including sources in the image (labels). These are taught to be represented as matrices for training. Sections 3-3.2, 3.4 and Fig. 1 teach model components that are trained including audio separation decoder network and a motion flow CNN image encoder trained and “fine-tuned on our task”.).

Regarding claims 8 and 17, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 3 and 12 above; and further teach further comprising training the fifth neural network using training audio element tracks and training image element tracks (Morgado, sections 1, 3.4, and 4 teach “we introduce two 360° video datasets with spatial audio, one recorded by ourselves in a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. During training, the captured spatial audio serves as ground truth (labels)” and “We further validate each component of the proposed architecture” in this training method; and further the supervised video training data including sources in the image (labels). Sections 3-3.2, 3.4 and Fig. 1 teach model components that are trained including synchronizing audio and video with nearest neighbor up-sampling algorithm.).

Regarding claims 9 and 18, the combination of Morgado and Morgado_2020 teach all the claim limitations of claims 3 and 12 above; and further teach further comprising training the sixth neural network using a training summary stream generated from training data (Morgado, sections 1, 3.4, and 4 teach “we introduce two 360° video datasets with spatial audio, one recorded by ourselves in a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. During training, the captured spatial audio serves as ground truth (labels)” and “We further validate each component of the proposed architecture” in this training method; and further the supervised video training data including sources in the image (labels). These taught to be matrices. Sections 3-3.2, 3.4 and Fig. 1 teach model components that are trained including an audio localization fully-connected layer network.).

Regarding claim 10, the combination of Morgado and Morgado_2020 teach all the claim limitations of claim 3 above; and further teach further comprising integrating the three-dimensional sound with a media content item (Morgado, section 3.2, 4, and Fig. 1 teach generating ambisonic audio data and matching video frames, since “To depict spatial audio, we overlay the directional energy map…of the predicted ambisonics (Eq. 5) over the video frame at time t.”).

Claim 2 are rejected under 35 U.S.C. 103 as being unpatentable over Morgado et al ("Self-Supervised Generation of Spatial Audio for 360° Video", 2018) hereinafter Morgado, in view of Morgado et al (“Learning Representations from Audio-Visual Spatial Alignment”, 2020) hereinafter Morgado_2020, in view of Son et al (US Pub 20220386055) hereinafter Son.
Regarding claim 2, the combination of Morgado and Morgado_2020 teach all the claim limitations of claim 1 above; and further teach wherein the multimodal content item comprises at least one selected from the group consisting of a video,  (Morgado, sections 3-3.2 teach of obtaining “video input” and “audio” input).
However, the combination does not explicitly teach a film, and a video game.
Son teaches a film, and a video game (paragraph 0495 teach using a second NN for audio signal processing for “games or movies”; paragraph 0186 teaches the audio signal being for 3D audio).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify Morgado’s teaching of combined neural networks processing 360° video data for producing ambisonics audio for the video, as modified by Morgado_2020’s teachings of transformer network processing audio and image features with viewing/listening angles, to include NN audio processing for movies and games as taught by Son in order to more accurately route the processed audio to the appropriate channels for 3D audio (Son, paragraph 0495-0500).

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Sheaffer et al (US Patent 11997463) teach training neural networks for generating 3D spatial audio in XR environments.

Conclusion
15.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to CLINT MULLINAX whose telephone number is 571-272-3241.  The examiner can normally be reached on Mon - Fri 8:00-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/C.M./Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123

Read full office action

Prosecution Timeline

Show 3 earlier events

Jan 21, 2026

Final Rejection mailed — §103

Feb 22, 2026

Interview Requested

Mar 03, 2026

Examiner Interview Summary

Mar 03, 2026

Applicant Interview (Telephonic)

Mar 20, 2026

Response after Non-Final Action

Apr 17, 2026

Request for Continued Examination

Apr 25, 2026

Response after Non-Final Action

May 07, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

16/005,750

Patent 12619424

ROBOTIC SCRIPT GENERATION BASED ON PROCESS VARIATION DETECTION

7y 10m to grant Granted May 05, 2026

18/380,620

Patent 12613706

HARDWARE ACCELERATED MACHINE LEARNING

2y 6m to grant Granted Apr 28, 2026

17/089,974

Patent 12608639

SYSTEM AND METHOD FOR PREDICTIVE VOLUMETRIC AND STRUCTURAL EVALUATION OF STORAGE TANKS

5y 5m to grant Granted Apr 21, 2026

18/375,973

Patent 12561620

Machine Learning-Based URL Categorization System With Noise Elimination

2y 4m to grant Granted Feb 24, 2026

16/726,709

Patent 12554962

CONFIGURABLE PROCESSOR ELEMENT ARRAYS FOR IMPLEMENTING CONVOLUTIONAL NEURAL NETWORKS

6y 1m to grant Granted Feb 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

48%

Grant Probability

86%

With Interview (+38.7%)

4y 7m (~1y 2m remaining)

Median Time to Grant

High

PTA Risk

Based on 126 resolved cases by this examiner. Grant probability derived from career allowance rate.