Last updated: April 19, 2026

Application No. 18/667,475

SYSTEM AND METHOD FOR MULTI-CONDITIONED AUDIO GENERATION

Non-Final OA §103

Filed

May 17, 2024

Examiner

RILEY, MARCUS T

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Robert Bosch GmbH

OA Round

1 (Non-Final)

Interview Optional

— +15.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 675 resolved cases, 2023–2026

Examiner Intelligence

RILEY, MARCUS T View full profile →

Grants 76% — above average

Career Allow Rate

514 granted / 675 resolved

+14.1% vs TC avg

Strong +16% interview lift

Without

With

+15.7%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

14 currently pending

Career history

689

Total Applications

across all art units

Statute-Specific Performance

§101

14.7%

-25.3% vs TC avg

§103

60.2%

+20.2% vs TC avg

§102

17.1%

-22.9% vs TC avg

§112

6.6%

-33.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 675 resolved cases

Office Action

§103

DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections – 35 USC § 103
1.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
2.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

3.	Claims 1-3, 5-10, 12-17, 19 & 20 are rejected under 35 U.S.C. 103 as being unpatentable over Semenov et al. US 20200402497 A1 hereinafter, Semenov ‘497) in view of Motiian et al. (US 20240153259 A1 hereinafter, Motiian ‘259).
Regarding claim 8; Semenov ‘497 discloses a system (Fig. 12) for multi-conditional audio generation (i.e. Fig. 12 illustrates an example of a speech generation system. Systems and methods for generating audio data. See Abstract & Paragraph 0030)
comprising: 
one or more hardware computing devices (Fig. 12 User Computers 1280 & Server1210, 1240, and 1270.) configured to: 
define an audio input condition for an obtained input using an encoder (i.e. Passing an audio sample through such an encoder and then conditioning inference on that token would then enable one shot learning of a voice. Paragraphs 0033 & 0045-0046) 
the obtained input being indicative of one or more audio characteristics (i.e. The method includes steps for generating a plurality of style tokens from a set of audio inputs, generating an input feature vector based on the plurality of style tokens and a set of text features, and generating audio data (e.g., a spectrogram, audio waveforms, etc.) based on the input feature vector. Paragraph 0004)
define an audio style condition of a selected audio style profile employing an audio feature extraction neural network (i.e. Processes can use CNNs to convert text features and style inputs into mel spectrograms that include vocal and/or speaker characteristics. CNNs in accordance with certain embodiments of the invention can take as input text features (e.g., raw text, audio data, parts of speech, phonemes, etc.) and style tokens (e.g., speaker and/or prosody tokens) to produce a spectrogram. Text features can be generated using machine learning models, such as (but not limited to) convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks. Paragraph 0043)
and output a generated audio data indicative of a desired generated audio using a multi-conditioned latent diffusion model that employs the audio input condition and the audio style condition as adapters to the multi-conditioned latent diffusion model (i.e. Audio encoders can encode audio data of previously spoken speech. Audio encoders can be used to generate a feature vector that maps the input text features to features in a latent space. Audio encoders can be used to encode audio data of a first duration of audio (e.g., a number of frames) that can be used in conjunction with key and value vectors from a text encoder to determine attention for a next subsequent portion of the audio data that is to be generated. Attention evaluates how strongly each portion of the input text features correlate with a set of one or more frames of the output. The relationship between text and the generated audio data can be directed based on an attention mechanism. Paragraphs 0047-0049)
	Examiner reasonably believes that Semenov ‘497 discloses the multi-conditioned latent diffusion model as expressed above. However, Examiner cites Motiian ‘259 to cure any deficiencies of Semenov ‘497.
Motiian ‘259 discloses multi-conditioned latent diffusion model (Fig. 2, Diffusion Model 255 i.e. Systems and methods for image processing are provided. One aspect of the systems and methods includes identifying a style image including a target style. A style encoder network generates a style vector representing the target style based on the style image. The style encoder can be trained based on a style loss that encourages the network to match a desired style. A a diffusion model generates a synthetic image that includes the target style based on the style vector. The diffusion model is trained independently of the style encoder network. See Abstract).
Semenov ‘497 and Motiian ‘259 are combinable because they are from same field of endeavor of speech systems (Motiian ‘259 at “Background”). 
	Before the effective filing date, it would have been obvious to a person of ordinary skill in the art to modify the speech system as taught by Semenov ‘497 by adding a multi-conditioned latent diffusion model as taught by Motiian ‘259. The motivation for doing so would have been advantageous to generate high-fidelity, controllable, and context-aware outputs by guiding the denoising process with multiple, simultaneous inputs (e.g., text, pose, depth, style) to improve upon standard models by enabling precise, disentangled, and identity-preserving manipulation. Therefore, it would have been obvious to combine Semenov ‘497 with Motiian ‘259 to obtain the invention as specified.

Regarding claim 9; Semenov ‘497 discloses wherein the one or more hardware computing devices is further configured to: define, using the multi-conditioned latent diffusion model, an audio conditioned latent space indicative of the generated audio data (i.e. Some models try to take a multi speaker corpus and cluster different characteristics together, creating a latent space with N distinct clusters, depending on the parameter (number of knobs) that are exposed. Indeed, at some level, as the number of knobs on which the model is conditioned increases, the latent space of the voices would start to resemble a vine of grapes. Paragraphs 0035 & 0039) 
and transform the audio conditioned latent space to a frequency-spectrogram representing the generated audio data using a decoder (i.e. Audio decoders can be used to synthesize all of the frames for a given duration in a single pass. Once a spectrogram has been generated, processes in accordance with a number of embodiments of the invention can use traditional methods to transform spectrograms into audio waveforms. Paragraph 0050).

Regarding claim 10; Semenov ‘497 discloses wherein the audio input condition includes a text condition defined based on a text input string provided as part of the obtained input and an audio condition, wherein the audio condition is associated with the text input string or based on an audio sample (i.e. Processes can use CNNs to convert text features and style inputs into mel spectrograms that include vocal and/or speaker characteristics. CNNs in accordance with certain embodiments of the invention can take as input text features (e.g., raw text, audio data, parts of speech, phonemes, etc.) and style tokens (e.g., speaker and/or prosody tokens) to produce a spectrogram. Paragraph 0043)

Regarding claim 12; Semenov ‘497 discloses wherein the multi-conditioned latent diffusion model is at least partly defined as a text to audio generation model conditioned using a plurality of condition including text embedding, audio embedding, and style control condition (i.e. Training data can include (but is not limited to) ground truth spectrograms, spoken text, audio waveforms, encodings, and/or tokens. Losses can include (but are not limited to) attention loss, cyclic embedding loss, triplet loss, and/or a spectrogram loss. Cyclic embedding losses can be used to train parts of a speech generation framework. Cyclic embedding losses can be computed based on a difference between a computed style token and a predicted style token. Paragraph 0063 & 0073).

Regarding claim 13; Semenov ‘497 discloses wherein the audio feature extraction neural network is defined using a shallow convolutional neural network (i.e. The present invention generally relates to voice generation and, more specifically, to a system that uses a convolutional neural network to generate speech and/or audio data. Paragraph 0002)
Regarding claim 14; Semenov ‘497 discloses wherein the one or more hardware computing devices is further configured to: transform a selected original audio data to a latent space provided as an audio sample latent space using the encoder (i.e. Text encoders can be used to analyze input text features to generate linguistic features. Text encoders can take text features as input to generate an encoding of the text. Text encoders can be used to generate feature vectors that map input text features to features in a latent space. Paragraph 0046)
and change, using the multi-conditioned latent diffusion model, the audio sample latent space based on the audio style condition, wherein the generated audio data is indicative of the selected original audio data and the selected audio style profile (i.e. The set of audio inputs includes a set of samples with a desired characteristic, wherein the generated audio data reflects the desired characteristic. Paragraph 0007)

Regarding claim 1; Claim 1 contains substantially the same subject matter as claim 8. Therefore, claim 1 is rejected on the same grounds as claim 8.

Regarding claim 2; Claim 2 contains substantially the same subject matter as claim 9. Therefore, claim 2 is rejected on the same grounds as claim 9.

Regarding claim 3; Claim 3 contains substantially the same subject matter as claim 10. Therefore, claim 3 is rejected on the same grounds as claim10.

Regarding claim 5; Claim 5 contains substantially the same subject matter as claim 12. Therefore, claim 5 is rejected on the same grounds as claim 12.

Regarding claim 6; Claim 6 contains substantially the same subject matter as claim 13. Therefore, claim 6 is rejected on the same grounds as claim 13.

Regarding claim 7; Claim 7 contains substantially the same subject matter as claim 14. Therefore, claim 7 is rejected on the same grounds as claim 14.

Regarding claim 15; Claim 15 contains substantially the same subject matter as claim 8. Therefore, claim 15 is rejected on the same grounds as claim 8. However, Claim 15 further discloses non-transitory computer-readable medium comprising instructions for a multi-conditional audio generation system that, when executed by one or more hardware computing devices cause the one or more hardware computing devices to perform operations. Semenov ‘497 discloses at Paragraph 0016 a non-transitory machine readable medium containing processor instructions for generating audio data, where execution of the instructions by a processor causes the processor to perform a process that comprises generating several style tokens from a set of audio inputs, generating an input feature vector based on the several style tokens and a set of text features, and generating audio data based on the input feature vector.

Regarding claim 16; Claim 16 contains substantially the same subject matter as claim 9. Therefore, claim 16 is rejected on the same grounds as claim 9.

Regarding claim 17; Claim 17 contains substantially the same subject matter as claim 10. Therefore, claim 17 is rejected on the same grounds as claim 10.

Regarding claim 19; Claim 19 contains substantially the same subject matter as claim 12. Therefore, claim 19 is rejected on the same grounds as claim 12.

Regarding claim 20; Claim 20 contains substantially the same subject matter as claim 14. Therefore, claim 20 is rejected on the same grounds as claim 14.


Allowable Subject Matter
1.	Claims 4, 11 & 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

2.	Claims 4 & 18 contains substantially the same subject matter as claim 11. Therefore, claims 4 & 18 are objected on the same grounds as claim 11.


Examiners Statement of Reasons for Allowance
The cited reference (Semenov ‘497) teaches wherein systems and methods for generating audio data in accordance with embodiments of the invention are illustrated. One embodiment includes a method for generating audio data. The method includes steps for generating a plurality of style tokens from a set of audio inputs, generating an input feature vector based on the plurality of style tokens and a set of text features, and generating audio data (e.g., a spectrogram, audio waveforms, etc.) based on the input feature vector.
The cited reference (Motiian ‘259) teaches wherein systems and methods for image processing are provided. One aspect of the systems and methods includes identifying a style image including a target style. A style encoder network generates a style vector representing the target style based on the style image. The style encoder can be trained based on a style loss that encourages the network to match a desired style. A a diffusion model generates a synthetic image that includes the target style based on the style vector. The diffusion model is trained independently of the style encoder network.
The cited references fail to disclose wherein the one or more hardware computing devices is further configured to train the multi-conditioned latent diffusion model using the audio style condition as a local control condition and the audio input condition as a global control condition concatenating with one or more text tokens associated with the text condition. As a result, and for these reasons, Examiner indicates Claims 4, 11 & 18 as allowable subject matter. 


Relevant Prior Art References Not Relied Upon
1.	Shayani et al. (US 20230326158 A1) - One embodiment of the present invention sets forth a technique for training a machine learning model to perform style transfer. The technique includes applying one or more augmentations to a first input three-dimensional (3D) shape to generate a second input 3D shape. The technique also includes generating, via a first set of neural network layers, a style code based on a first latent representation of the first input 3D shape and a second latent representation of the second input 3D shape. The technique further includes generating, via a second set of neural network layers, a first output 3D shape based on the style code and the second latent representation, and performing one or more operations on the first and second sets of neural network layers based on a first loss associated with the first output 3D shape to generate a trained machine learning model.
2.	Ahmed et al. (US 20230282202 A1) - There are disclosed techniques for generating an audio signal and training an audio generator. An audio generator may generate an audio signal from an input signal and target data representing the audio signal. The target data is derived from text. The audio generator includes: a first processing block, receiving first data derived from the input signal and outputting first output data; a second processing block, receiving, as second data, the first output data or data derived from the first output data. The first processing block includes: a conditioning set of learnable layers configured to process the target data to obtain conditioning features parameters; and a styling element, configured to apply the conditioning feature parameters to the first data or normalized first data.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARCUS T. RILEY, ESQ. whose telephone number is (571)270-1581. The examiner can normally be reached 9-5 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MARCUS T. RILEY, ESQ.
Primary Examiner
Art Unit 2654



/MARCUS T RILEY/Primary Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

May 17, 2024

Application Filed

Feb 04, 2026

Non-Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/244,714

Patent 12603093

ELECTRONIC APPARATUS AND CONTROLLING METHOD THEREOF

2y 5m to grant Granted Apr 14, 2026

18/478,933

Patent 12585871

NARRATIVE GENERATION FOR SITUATION EVENT GRAPHS

2y 5m to grant Granted Mar 24, 2026

18/747,641

Patent 12585885

DIALOGUE MODEL TRAINING METHOD

2y 5m to grant Granted Mar 24, 2026

18/442,375

Patent 12573404

ELECTRONIC DEVICE AND METHOD OF OPERATING THE SAME

2y 5m to grant Granted Mar 10, 2026

18/357,803

Patent 12567418

SYNCHRONOUS AUDIO AND TEXT GENERATION

2y 5m to grant Granted Mar 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

76%

Grant Probability

92%

With Interview (+15.7%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 675 resolved cases by this examiner. Grant probability derived from career allow rate.