Last updated: May 29, 2026
Application No. 18/811,550
CROSS-MODAL TRAINING OF A MACHINE-LEARNING MODEL THAT IDENTIFIES ABUSE IN AUDIO STREAMS

Non-Final OA §102§103
Filed
Aug 21, 2024
Priority
Aug 22, 2023 — provisional 63/534,086 +2 more
Examiner
ROBERTS, SHAUN A
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Roblox Corporation
OA Round
1 (Non-Final)
Interview Optional

— +10.5% interview lift. Interview lift (+10.5%) is below the 15.0% threshold. A written response is recommended.
Based on 652 resolved cases, 2023–2026
Examiner Intelligence

ROBERTS, SHAUN A View full profile →
Grants 76% — above average
Career Allowance Rate
495 granted / 652 resolved
+13.9% vs TC avg
Moderate +10% lift
Without
With
+10.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 11m
Avg Prosecution
18 currently pending
Career history
679
Total Applications
across all art units
Statute-Specific Performance

§101
1.8%
-38.2% vs TC avg
§103
83.6%
+43.6% vs TC avg
§102
12.5%
-27.5% vs TC avg
§112
0.1%
-39.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 652 resolved cases
Office Action

§102 §103
DETAILED ACTION
1.	This action is responsive to Application no.18/811,550 filed 8/21/2024.  All claims have been examined and are currently pending.
Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
3.	The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 102
4.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

5.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

6.	Claims 1, 4, 7-9, 12, 15-16, 19-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Tongya 2022/0284884.

Regarding claim 1 Tongya teaches A computer-implemented method to moderate audio streams (14: computer system for an online multi-user chat service 10. The online multi-user chat service 10 implements a deep learning model for detecting offensive usage of language in different contexts; 24: toxicity deep learning model may include a plurality of trained machine learning models; 57: training process for the transformer machine learning models), the method comprising: 
receiving a user-provided audio stream associated with a user (fig 2 user chat data; para [0028] The first user chat data 54, which may be received from a first user operating one of the plurality of client devices 14, includes the statement “You're so *Expletive* bad at the game”.; 0059: inputting the user chat data via speech); 
dividing the user-provided audio stream into a plurality of portions, wherein each portion corresponds to a particular time window of the audio stream (28; 30 specific words or phrases within the first user chat data; 60: target portion of user voice data); 
providing the plurality of portions of the user-provided audio stream as input to an audio machine-learning model (fig 2,3; para: 14: computer system for an online multi-user chat service 10. The online multi-user chat service 10 implements a deep learning model for detecting offensive usage of language in different contexts; 0031: the user chat data 22 received from the plurality of client devices 14 in the chat session 52 may be processed by the toxicity deep learning model 44. The toxicity deep learning model 44 may include one or more of a transformer machine learning model 68, a convolutional neural network (CNN) machine learning model 70, and a heuristic offensive language checking module 72.); 
outputting, by the audio machine-learning model and based on the portions of the user-provided audio stream, a determination of abuse in a particular portion of the plurality of portions (fig 2,3,4; para: 31: The filter decision service 48 receives respective outputs from the transformer machine learning model 68, the CNN machine learning model 70, and the heuristic offensive language checking module 72.;
[0032] In one example, the transformer machine learning model 68 may be configured to determine a predicted label for a type of offensive language 74 associated with a target portion of the user chat data 22. The types of offensive language may, for example, include a profanity type, a bullying type, an adult language type, and a hate speech type that may include a sexism and/or racial hate speech type.); and 
performing a remedial action responsive to the determination of abuse in the particular portion (0030: The filter decision service 48 may then determine a filter action 64 to be performed for the first user chat data 54. In the illustrated example, the filter action 64 includes filtering out the first user chat data 54, and sending the filtered user chat data 66 that does not include the first user chat data 54 to the plurality of client devices 14 in the chat session 52. It should be appreciated that other filter actions 64 may be performed optionally or in addition to the illustrated filter action. For example, the filter action 64 may include a chat restriction and/or a ban for the user profile 34 associated with the user that send the first user chat data 54. As another example, the filter action 64 may include filtering out specific words or phrases within the first user chat data 54 rather than the entire statement of the first user chat data 54.).

Regarding claim 4 Tongya teaches The method of claim 1, wherein the audio machine-learning model is trained using training data and the method further comprises generating the training data by: 
receiving training audio streams of one or more people speaking (59:  the users of the client device 14 may be inputting the user chat data 22 via speech that is captured using a microphone); 
for each training audio stream: 
dividing the training audio stream into two or more audio segments (59: target portions of user voice data); 
transcribing the two or more audio segments into two or more textual segments (59: speech to text); and 
generating, with a first classifier, a first segment label for each of the two or more textual segments, wherein the first segment label indicates whether a textual segment is toxic or non-toxic (figure 5; 
57: training process for the transformer machine learning models 68. The transformer machine learning models 68 are trained using a corpus of labeled user chat data 104 that includes units of user chat data (e.g. A sentence) that are paired with a human curated label 108 (e.g. Ground truth).;
58: The output from the transformer encoder 90 is passage to a fully connected deep learning layers 92 that perform classification and outputs a predicted label for the type of offensive language 74. The predicted label is used with the human curated label 108 (e.g. ground truth) associated with the unit of user chat data being processed to perform cross-entropy loss feedback training on the transformer machine learning model 68.  
 [0060] Each transformer machine learning model 68 may be trained to predict labels for the types of offensive language 74 for a target portion of user voice data 110 based on both the text for the target portion of user voice data generated by the speech to text model 112 and the acoustic features of the target portion of user voice data 110 generated by the acoustic model 114.); and 
adding the training audio stream, the two or more textual segments, and corresponding first segment labels from the training audio streams to a training data set 
(figure 5; 57-61; [0061] The transformer encoder 90 and the fully connected deep learning layers 92 are trained to predict the label for the type of offensive language 74 based on both the output encodings 96 and the set of acoustic features 116. In this manner, the transformer machine learning model 68 may learn associations between acoustic features and certain types of offensive language. For example, a bullying type of offensive language such as “You're so *Expletive* bad at the game” may potentially be associated with loud or otherwise angry types of acoustic features 116. A similar architecture may be used at run-time to perform run-time processing of user voice data 110.).

Regarding claim 7 Tongya teaches The method of claim 1, wherein the remedial action includes providing a warning to the user
(0030: The filter decision service 48 may then determine a filter action 64 to be performed for the first user chat data 54. In the illustrated example, the filter action 64 includes filtering out the first user chat data 54, and sending the filtered user chat data 66 that does not include the first user chat data 54 to the plurality of client devices 14 in the chat session 52. It should be appreciated that other filter actions 64 may be performed optionally or in addition to the illustrated filter action. For example, the filter action 64 may include a chat restriction and/or a ban for the user profile 34 associated with the user that send the first user chat data 54. As another example, the filter action 64 may include filtering out specific words or phrases within the first user chat data 54 rather than the entire statement of the first user chat data 54.).

Regarding claim 8 Tongya teaches The method of claim 1, wherein the remedial action includes at least one of: 
causing a microphone on a user device associated with the user to be muted or suppressing the user-provided audio stream from being delivered to one or more other users (0030: The filter decision service 48 may then determine a filter action 64 to be performed for the first user chat data 54. In the illustrated example, the filter action 64 includes filtering out the first user chat data 54, and sending the filtered user chat data 66 that does not include the first user chat data 54 to the plurality of client devices 14 in the chat session 52. It should be appreciated that other filter actions 64 may be performed optionally or in addition to the illustrated filter action. For example, the filter action 64 may include a chat restriction and/or a ban for the user profile 34 associated with the user that send the first user chat data 54. As another example, the filter action 64 may include filtering out specific words or phrases within the first user chat data 54 rather than the entire statement of the first user chat data 54.).

Regarding claim 9 Tongya teaches The method of claim 1, wherein the determination of abuse includes an identification of a type of abuse, the type of abuse selected from a group of one or more of profanity, bullying, harassment, sexism, and combinations thereof (32: The types of offensive language may, for example, include a profanity type, a bullying type, an adult language type, and a hate speech type that may include a sexism and/or racial hate speech type).


Regarding claim 12 Tongya teaches A system to train an audio machine-learning model comprising: 
one or more processors; and 
a memory coupled to the one or more processors, with instructions stored thereon that, when executed by the processor, cause the processor to perform operations comprising: 
receiving a user-provided audio stream associated with a user; 
dividing the user-provided audio stream into a plurality of portions, wherein each portion corresponds to a particular time window of the audio stream; 
providing the plurality of portions of the user-provided audio stream as input to an audio machine-learning model; 
outputting, by the audio machine-learning model and based on the portions of the user-provided audio stream, a determination of abuse in a particular portion of the plurality of portions; and 
performing a remedial action responsive to the determination of abuse in the particular portion.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Claim 15 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning 

Regarding claim 16 Tongya teaches A non-transitory computer-readable medium with instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform operations, the operations comprising: 
receiving a user-provided audio stream associated with a user; 
dividing the user-provided audio stream into a plurality of portions, wherein each portion corresponds to a particular time window of the audio stream; 
providing the plurality of portions of the user-provided audio stream as input to an audio machine-learning model; 
outputting, by the audio machine-learning model and based on the portions of the user-provided audio stream, a determination of abuse in a particular portion of the plurality of portions; and 
performing a remedial action responsive to the determination of abuse in the particular portion.
Claim recites limitations similar to claim 1 and is rejected for similar rationale and reasoning 

Claim 19 recites limitations similar to claim 4 and is rejected for similar rationale and reasoning 
Claim 20 recites limitations similar to claim 7 and is rejected for similar rationale and reasoning 

Claim Rejections - 35 USC § 103
7.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

8.	Claims 2-3, 13-14, 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Tongya in view of Russell et al (2024/0386048)

Regarding claim 2 Tongya teaches The method of claim 1, wherein the audio machine-learning model is trained by: 
providing audio input to an audio encoder (fig 5; 0059-0060; 59: inputting user chat data via speech; 61: encoder); 
outputting, by the audio encoder and based on the audio input, audio embeddings corresponding to the audio input and voice toxicity classification that identifies one or more toxic labels to associate with the audio input (fig 5; 57: labeled data; 59: acoustic features; 60-61: 61: output encodings; trained to predict the label for the type of offensive language); 
providing text input to a text encoder, wherein the text input is a transcription of the audio input (59: speech to text model that generates target portions of the user chat data based on target portions of user voice data; 60: encoder); 
outputting, by the text encoder and based on the text input, text embeddings (61 embeddings); 

    PNG
    media_image1.png
    557
    932
    media_image1.png
    Greyscale


but does not specifically teach where Russell et al (2024/0386048) teaches
determining a value of a text injection loss function based on comparison of the audio embeddings and the text embeddings (abstract: generating a fused visual-text embedding based on a visual embedding and a text embedding corresponding to the input. The disclosed systems and methods further comprise comparing audio embeddings for music audio sequences of a music audio sequences database with the fused visual-text embedding.; figure 4; para: 55; 56: loss function can calculate losses between different, or additional, modalities. losses can be calculated between…audio and text embeddings); and 
adjusting one or more parameters of the audio encoder to reduce the value of the text injection loss function (56; [0057] The calculated loss can then be backpropagated to train the transformers (e.g., visual transformer 122, text transformer 126, embedding fusion module 130, and audio transformer 308), as shown at numeral 15.).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Russel for an improved system allowing to better train and tune the machine learning model with the multi-modal embeddings. 
Tongya already teaches training the model with audio and text, and one could look to Russell to further optimize the training for improved toxicity determination, ultimately providing a safer environment for the online users.

Regarding claim 3 Tongya teaches The method of claim 2, wherein the audio input includes real-world audio associated with abuse reports from one or more users and the method further comprises: 
comparing the voice toxicity classification to labels associated with the real-world audio to determine a value of a classifier loss function, wherein the labels associated with the real-world audio are ground truth provided by human reviewers (figure 5 human label, predicated label, cross-entropy loss; 57;
58: The predicted label is used with the human curated label 108 (e.g. ground truth) associated with the unit of user chat data being processed to perform cross-entropy loss feedback training on the transformer machine learning model 68. ); and 
adjusting parameters of the audio encoder to reduce the value of the classifier loss function.
([0057] FIG. 5 illustrates an example training process for the transformer machine learning models 68. The transformer machine learning models 68 are trained using a corpus of labeled user chat data 104 that includes units of user chat data (e.g. A sentence) that are paired with a human curated label 108 (e.g. Ground truth). The labeled user chat data 104 may be collected via any suitable technique. For example, the server system 12 may be configured to provide a reporting system that receive reports from client devices 14. These reports may indicate a unit of user chat data and a user selected type of offensive language, such as profanity, adult language, bullying, hate speech, sexism, etc. These user reports may be reviewed by moderators and collected into the corpus of labeled user chat data 104; 58;
0062: The gaming language dictionary encoder 102 may be pre-trained using negative sampling loss on a corpus of user chat data 126.)
Rejected for similar rationale and reasoning as claim 2

Claim 13 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning 
Claim 14 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning 
Claim 17 recites limitations similar to claim 2 and is rejected for similar rationale and reasoning 
Claim 18 recites limitations similar to claim 3 and is rejected for similar rationale and reasoning 


9.	Claims 5 is rejected under 35 U.S.C. 103 as being unpatentable over Tongya in view of Li 2023/0162722.
Regarding claim 5 Tongya teaches The method of claim 4, wherein generating the training data further includes: 
identifying, from the training audio streams, a subset of the training audio streams where one or more of the first segment labels indicate that one or more of the textual segments is toxic (figure 5; 57-61; 60: trained to predict labels for the types of offensive language); adding {the second segment} labels to the training set (61).
but does not specifically teach 
generating, with a second classifier, second segment labels for the subset of the training audio streams, wherein the second classifier is more accurate at identifying instances of abuse than the first classifier; and 
adding the second segment labels to the training set.
Li teaches a plurality of classifiers, and classifying sample data (speech data), using a plurality of classifiers, which may be a to-be-trained model to obtain a first predicted classification, and then additional classifiers to obtain a second predicted classification, thereby improving accuracy of the reference classification and reference confidence (39-43).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Li for an improved system allowing to better classify and predict the labels for improved training and ultimately better toxicity determination.

10.	Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Tongya in view of Khani et al (2024/0370662) In further view of Finkelstein et al (2023/0018384).

Regarding claim 6 Tongya teaches The method of claim 1, wherein the audio machine-learning model is trained using synthetic training data and the method further comprises generating the synthetic training data by: 
providing voice chat audio to an automatic speech recognition (ASR) system (59 ASR); 
outputting, by the ASR system, transcribed audio based on the voice chat audio (59 speech to text); 
providing the transcribed audio and a prompt specifying new text characteristics (61-62) 
but does not specifically teach
{ providing the transcribed audio and a prompt specifying new text characteristics to a large language model (LLM), the LLM configured to generate new text based on the prompt and the transcribed audio; 
providing the voice chat audio to a voice cloner that outputs audio tokens that preserve speaker characteristics in the voice chat audio; 
providing the new text and the audio tokens as input to a text to speech system; and 
outputting, by the text to speech system, the synthetic training.}

Khani et al (2024/0370662) teaches the LLM configured to generate new text based on the prompt
[0048] After a cluster has been identified, method 500 proceeds to generate a prompt for an LLM, at 512. The LLM may include or be a GPT model that generates text based on an input. The prompt may be generated based on data in the dataset that relates to the identified cluster. For example, the prompt may be datapoints from the identified cluster. Once the prompt is received by the LLM, the LLM generates synthetic training data similar to the prompt.
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Khani to allow for efficient text generation for an improved system in assisting in obtaining synthetic training data to adapt to new words.

Finkelstein et al (2023/0018384) teaches
providing the voice chat audio to a voice cloner that outputs audio tokens that preserve speaker characteristics in the voice chat audio (69); 
providing the new text and the audio tokens as input to a text to speech system (69); and 
outputting, by the text to speech system, the synthetic training.
([0069] FIG. 6 is a flowchart of an exemplary arrangement of operations for a method 600 of synthesizing an input text utterance into expressive speech having an intended accent/dialect and cloning a voice of a target speaker 432. The data processing hardware 122 (FIG. 1) may execute the operations for the method 600 by executing instructions stored on the memory hardware 124. At operation 602, the method 600 includes obtaining training data 10 including a plurality of training audio signals 102 and corresponding transcripts 106. Each training audio signal 102 corresponds to a reference utterance spoken by a target speaker in a first accent/dialect. Each transcript 106 includes a textual representation of the corresponding reference utterance. For each training audio signal 102 of the training audio signals 102, the method 600 performs operations 604 and 606. At operation 604, the method 600 includes generating, by a trained voice cloning system 200 configured to receive the training audio signal 102 corresponding to the reference utterance spoken by the target speaker in the first accent/dialect as input, a training synthetic speech representation 202 of the corresponding reference utterance spoken by the target speaker.)
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Finkelstein for an improved system allowing to better train and tune the machine learning model with the multi-modal embeddings for new words to adapt when gamer users may often create new words and to learn different words (Tongya 62).
The incorporation of Khani and Finkelstein (with Tongya) would thus allow for the teaching of:
 providing the transcribed audio and a prompt specifying new text characteristics to a large language model (LLM), the LLM configured to generate new text based on the prompt and the transcribed audio; 
providing the voice chat audio to a voice cloner that outputs audio tokens that preserve speaker characteristics in the voice chat audio; 
providing the new text and the audio tokens as input to a text to speech system; and 
outputting, by the text to speech system, the synthetic training.

11.	Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Tongya in view of Thyssen (6,983,242).

Regarding claim 10 Tongya does not specifically teach where Thyssen teaches The method of claim 1, wherein prior to receiving the user-provided audio stream, the method further comprises filtering, by a voice activity detection (VAD) model, the user-provided audio stream to remove parts of the audio stream that do not include human speech (col 4 l. 16-27: VAD).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate VAD and noise removal for an improved system to ensure only the necessary speech is extracted for proper audio classification and toxicity determination.

Regarding claim 11 Tongya does not specifically teach where Thyssen teaches The method of claim 1, further comprising filtering the user-provided audio stream to remove background noise (col 5 l 14-15 background noise attenuation).
Rejected for similar rationale and reasoning as claim 10

Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See PTO-892.

	Thomas et al (2021/0370188)
	Abstract: In various examples, game session audio data—e.g., representing speech of users participating in the game—may be monitored and/or analyzed to determine whether inappropriate language is being used. Where inappropriate language is identified, the portions of the audio corresponding to the inappropriate language may be edited or modified such that other users do not hear the inappropriate language. As a result, toxic behavior or language within instances of gameplay may be censored—thereby enhancing the user experience and making online gaming environments safer for more vulnerable populations. In some embodiments, the inappropriate language may be reported—e.g., automatically—to the game developer or game application host in order to suspend, ban, or otherwise manage users of the system that have a proclivity for toxic behavior.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool.  To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center.  Unpublished application information in Patent Center is available to registered users.  To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov.
For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SHAUN ROBERTS/Primary Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Aug 21, 2024
Application Filed
Mar 02, 2026
Non-Final Rejection mailed — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/575,883
Patent 12639534
WEBTOON CONTENT MULTILINGUAL TRANSLATION METHOD
2y 3m to grant Granted May 26, 2026
18/667,219
Patent 12626705
APPARATUS AND METHOD FOR MAPPING EMERGENCY CALL DATA MANUAL
1y 12m to grant Granted May 12, 2026
18/566,268
Patent 12621616
METHOD OF OPERATING A HEARING AID SYSTEM AND A HEARING AID SYSTEM USING SPEECH FORECASTING
2y 5m to grant Granted May 05, 2026
18/274,775
Patent 12609133
SCENE ESTIMATE METHOD, SCENE ESTIMATE APPARATUS, AND PROGRAM
2y 8m to grant Granted Apr 21, 2026
18/312,688
Patent 12586599
AUDIO SIGNAL PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM WITH MACHINE LEARNING AND FOR MICROPHONE MUTE STATE FEATURES IN A MULTI PERSON VOICE CALL
2y 10m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
86%
With Interview (+10.5%)
2y 11m (~1y 1m remaining)
Median Time to Grant
Low
PTA Risk
Based on 652 resolved cases by this examiner. Grant probability derived from career allowance rate.