Last updated: April 19, 2026

Application No. 18/708,561

AUDIO CONTENT GENERATION AND CLASSIFICATION

Non-Final OA §103

Filed

May 08, 2024

Examiner

LEE, JANGWOEN

Art Unit

2656

Tech Center

2600 — Communications

Assignee

Dolby Laboratories Licensing Corporation

OA Round

1 (Non-Final)

Interview Optional

— +24.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 44 resolved cases, 2023–2026

Examiner Intelligence

LEE, JANGWOEN View full profile →

Grants 82% — above average

Career Allow Rate

36 granted / 44 resolved

+19.8% vs TC avg

Strong +24% interview lift

Without

With

+24.2%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

23 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

26.5%

-13.5% vs TC avg

§103

54.6%

+14.6% vs TC avg

§102

11.0%

-29.0% vs TC avg

§112

4.1%

-35.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 44 resolved cases

Office Action

§103

DETAILED ACTION

This communication is in response to the Application filed on 05/08/2024. Claims 1-15 are pending and have been examined. Claims 1, 8, 10, 11, 14 and 15 are independent. This Application was published as U.S. Pub. No. 2025/0006208.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/25/2024 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Priority
This application is a 371 of PCT/US2022/04876 submitted on 11/03/2022. Applicant’s claims for benefit of a provisional applications 63/277,217 and 63/374,702 submitted on 11/09/2021 and 09/06/2022 respectively, are acknowledged.
	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-15 are rejected under 35 U.S.C. 103 as being unpatentable over  Arteaga et al. (US Pub No. 2024/0022868, hereinafter, Arteaga) in view of Lu et al., ("SpecTNT: A time-frequency transformer for music audio." arXiv preprint arXiv:2110.09127 (2021), hereinafter, Lu) .
Regarding Claim 1,
Arteaga discloses a method, comprising: 
receiving, by a control system, first audio data of a first audio data format including one or more first audio signals and associated first spatial data, wherein the first spatial data indicates intended perceived spatial positions for the one or more first audio signals (Arteaga, Fig.1, par [032], "…In step 101, a first input multichannel audio signal is received...the first input multichannel audio signal may comprise a 2.0, 3.1, 5.1 or 7.1 multichannel audio signal..."; Fig.1, par [032], "…the multichannel audio signal may correspond to a predefined speaker layout, such as a 2.0, 3.1, 5.1, or 7.1 speaker layout..."); 
determining, by the control system, at least a first feature type from the first audio data, wherein the first feature type corresponds to a frequency domain representation of the audio data (Arteaga, Fig.4, par [041], Reference Render (Spectrogram) 402-405, i.e., spectrogram is the time-frequency domain representation); 
receiving, by the control system, second audio data of a second audio data format including one or more second audio signals and associated second spatial data, the second audio data format being different from the first audio data format (Arteaga, Fig.1, par [032], "…In step 101, a first input multichannel audio signal is received...the first input multichannel audio signal may comprise a 2.0, 3.1, 5.1 or 7.1 multichannel audio signal..."; par [060], "…"first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects..."), wherein the second spatial data indicates intended perceived spatial positions for the one or more second audio signals (Fig.1, par [032], "…the multichannel audio signal may correspond to a predefined speaker layout, such as a 2.0, 3.1, 5.1, or 7.1 speaker layout..."). 
determining, by the control system, at least the first feature type from the second audio data (Arteaga, Fig.4, par [041], Reference Render (Spectrogram) 402-405, i.e., spectrogram is the time-frequency domain representation); 
training a neural network implemented by the control system to transform input audio data from one audio data format to a different audio data format, the training being based, at least in part, on the first encoded audio data and the second encoded audio data (Arteaga, Fig.3, Steps 301, 302, 303, and 307, par [0], "…In step 301, a reference intermediate audio signal is received. In step 302, the received reference intermediate audio signal is rendered into a first input multichannel audio signal…In step 303, a machine learning algorithm is used to generate an intermediate audio signal based on the first input multichannel audio signal…In step 307, the intermediate audio signal is rendered into a second output multichannel audio signal...").
Arteaga discloses the objection extraction network (Fig.4; i.e., encoder network), which extracts mono objects 415, position metadata 416 (Arteaga, Fig.4, par [015], "…The intermediate audio signal may include one or more audio objects, wherein each of the audio objects includes an audio track and position metadata. The position metadata may include a position (location) of the respective audio object, such as in the form of a coordinate vector..."), but does not explicitly disclose "applying, by the control system, a positional encoding process to the first audio data, to produce first encoded audio data, the first encoded audio data including representations of at least the first spatial data and the first feature type in first embedding vectors of an embedding dimension."
However, Lu, in analogous field of endeavor, discloses applying, by the control system, a positional encoding process to the first (second) audio data, to produce first (second) encoded audio data, the first (second) encoded audio data including representations of at least the first (second) spatial data and the first (second) feature type in first (second) embedding vectors of an embedding dimension (Lu, Figs.1-3, 3.3 positional encoding, SpecTNT architecture, "…we adopt a learnable positional embedding to encode the sequence order of frequency bins...").
 
Therefore, it would have been obvious to one of ordinary skill in the art, before effective filing date of the claimed invention, to have modified the methods and apparatus for audio object and position metadata of Arteaga with the positional encoding and attention mechanism of Transformer-based music information retrieval of Lu with a reasonable expectation of success to learn the representation of audio and adopt a learnable positional embedding to encode the sequence order of frequency bins (Lu, Introduction, Method).
 	Regarding Claim 2,
Arteaga in view of Lu discloses the method of claim 1, wherein the method comprises:
Lu further discloses receiving 1st through Nth audio data of 1st through Nth input audio data formats including 1st through Nth audio signals and associated 1st through Nth spatial data, N being an integer greater than 2 (Lu, Fig.1, 3. Method, "…The input time-frequency representation is first processed with a stack of convolutional layers for local feature aggregation..."); 
Arteaga further discloses determining, by the control system, at least the first feature type from the 1st through Nth input audio data formats (Arteaga, Fig.4, par [041], Reference Render (Spectrogram) 402-405, i.e., spectrogram is the time-frequency domain representation); 
Lu further discloses applying, by the control system, the positional encoding process to the 1st through Nth audio data, to produce 1st through Nth encoded audio data  (Lu, Figs.1-3, 3.3 positional encoding, SpecTNT architecture, "…we adopt a learnable positional embedding to encode the sequence order of frequency bins..."); and 
Arteaga further discloses training the neural network based, at least in part, on the 1st through Nth encoded audio data (Arteaga, Fig.3, par [038], "…FIG. 3 illustrates a flow diagram of a further example of a method for training a machine learning algorithm...").
Regarding Claim 3,   
Arteaga in view of Lu discloses the method of claim 1, wherein the neural network is, or includes, an attention-based neural network (Lu, Abstract, "…we introduce a novel variant of the Transformer-in- Transformer (TNT) architecture. In each SpecTNT block, a spectral Transformer extracts frequency-related features..."; 3.4.3 Transformer Encoder, "…A Transformer encoder is composed of three components: multi-head self-attention (MHSA), feed-forward network (FFN), and layer normalization (LN)..." )
Regarding Claim 4, 
Arteaga in view of Lu discloses the method of claim 1, wherein the neural network includes a multi-head attention module (Lu, 3.4.3 Transformer Encoder, "…A Transformer encoder is composed of three components: multi-head self-attention (MHSA), feed-forward network (FFN), and layer normalization (LN)..." ).
Regarding Claim 5, 
Arteaga in view of Lu discloses the method of claim 1. 
Arteaga further discloses wherein training the neural network involves training the neural network to transform the first audio data to a first region of a latent space and to transform the second audio data to a second region of the latent space, the second region being at least partially separate from the first region (Arteaga, Fig.3, Steps 301, 302, 303, and 307, par [0], "…In step 301, a reference intermediate audio signal is received. In step 302, the received reference intermediate audio signal is rendered into a first input multichannel audio signal…In step 303, a machine learning algorithm is used to generate an intermediate audio signal based on the first input multichannel audio signal…In step 307, the intermediate audio signal is rendered into a second output multichannel audio signal...").8.
Regarding Claim 6, 
Arteaga in view of Lu discloses the method of claim 1.
Arteaga further discloses wherein the intended perceived spatial position corresponds to at least one of a channel of a channel-based audio format or positional metadata (Arteaga, Fig.4, par [034], "…The intermediate audio signal may comprise one or more audio objects, wherein each of the audio objects comprises an audio track and position metadata..."; par [041], "…The mono objects 415, the position metadata 416, and the bed channels 417 may be seen as an example representation of the claimed audio objects...").
Regarding Claim 7, 
Arteaga in view of Lu discloses the method of claim 1.
Lu further discloses further comprising determining, by the control system, at least a second feature type from the first audio data and the second audio data, wherein the positional encoding process involves representing the second feature type in the embedding dimension (Lu, Figs.1-3, 3.3 positional encoding, SpecTNT architecture, "…we adopt a learnable positional embedding to encode the sequence order of frequency bins...").
Regarding Claim 8,
Arteaga discloses a neural network trained according to the method of claim 1 (Arteaga, Fig,.4, par [040], "…All items inside the dashed box 400 may be running internally a deep neural network (DNN)...").
Regarding Claim 9,
Arteaga discloses one or more non-transitory media having software stored thereon, the software including instructions for implementing the neural network of claim 8 (Arteaga, par [056], "…the present disclosure may take the form of carrier medium ( e.g., a computer program product on a computer-readable storage medium) carrying computer- readable program code embodied in the medium...").
Claim 10 is a method claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Rationale for combination is similar to that provided for Claim 1.
Claim 11 is a method claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally,
Lu further discloses training a neural network implemented by the control system to identify an input audio data type of input audio data, wherein identifying the input audio data type involves identifying a content type of the input audio data (Lu, 1. Introduction, "…SpecTNT, a novel modification of TNT to better fit the music data for MIR tasks; (3) we conduct extensive experiments to demonstrate the capability of SpecTNT in various MIR tasks – vocal melody extraction, music auto-tagging, and chord recognition...")
…
Rationale for combination is similar to that provided for Claim 1.
Regarding Claim 12,
Lu further discloses wherein identifying the input audio data type involves determining whether the input audio data corresponds to a podcast, movie or television program dialogue, or music (Lu, 1. Introduction, "…SpecTNT, a novel modification of TNT to better fit the music data for MIR tasks; (3) we conduct extensive experiments to demonstrate the capability of SpecTNT in various MIR tasks – vocal melody extraction, music auto-tagging, and chord recognition...").
Regarding Claim 13, 
Arteaga in view of Lu discloses the method of claim 11.
Lu further discloses further comprising training the neural network to generate new content of a selected content type (Lu, 3. Method, "…the output module projects the final embedding into the desired dimension for different tasks..."; 3.5 Output Module, "…The output TE of the 3rd SpecTNT block, E3, can be used towards the final output...").
Claim 14 is an apparatus claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Rationale for combination is similar to that provided for Claim 1.
Claim 15 is a non-transitory media claim with limitations similar to the limitations of Claim 1 and is rejected under similar rationale. Additionally, 
Arteaga discloses one or more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices (Arteaga, par [056], "…the present disclosure may take the form of carrier medium ( e.g., a computer program product on a computer-readable storage medium) carrying computer- readable program code embodied in the medium...")
...
Rationale for combination is similar to that provided for Claim 1.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Chang et al., (End-to-end multi-channel transformer for speech recognition." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.) discloses the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. The multi-channel transformer network mainly consists of three parts: channel-wise self-attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JANGWOEN LEE whose telephone number is (703)756-5597. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BHAVESH MEHTA can be reached at (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JANGWOEN LEE/Examiner, Art Unit 2656                                                                                                                                                                                                        

/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656

Read full office action

Prosecution Timeline

May 08, 2024

Application Filed

Jan 09, 2026

Non-Final Rejection — §103

Feb 22, 2026

Interview Requested

Precedent Cases

Applications granted by this same examiner with similar technology

18/007,025

Patent 12597432

HUM NOISE DETECTION AND REMOVAL FOR SPEECH AND MUSIC RECORDINGS

2y 5m to grant Granted Apr 07, 2026

18/118,619

Patent 12586571

EFFICIENT SPEECH TO SPIKES CONVERSION PIPELINE FOR A SPIKING NEURAL NETWORK

2y 5m to grant Granted Mar 24, 2026

18/258,569

Patent 12573381

SPEECH RECOGNITION METHOD AND APPARATUS, STORAGE MEDIUM, AND ELECTRONIC DEVICE

2y 5m to grant Granted Mar 10, 2026

17/925,261

Patent 12567430

METHOD AND DEVICE FOR IMPROVING DIALOGUE INTELLIGIBILITY DURING PLAYBACK OF AUDIO DATA

2y 5m to grant Granted Mar 03, 2026

18/310,577

Patent 12566930

CONDITIONING OF PRODUCTIVITY APPLICATION FILE CONTENT FOR INGESTION BY AN ARTIFICIAL INTELLIGENCE MODEL

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

82%

Grant Probability

99%

With Interview (+24.2%)

2y 11m

Median Time to Grant

Low

PTA Risk

Based on 44 resolved cases by this examiner. Grant probability derived from career allow rate.