Last updated: April 19, 2026
Application No. 17/967,165
AUTOMATIC CONTENT CLASSIFICATION AND AUDITING

Non-Final OA §103
Filed
Oct 17, 2022
Examiner
MORALES, PEDRO JESUS
Art Unit
2124
Tech Center
2100 — Computer Architecture & Software
Assignee
International Business Machines Corporation
OA Round
3 (Non-Final)
Interview Optional

— +50.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 9 resolved cases, 2023–2026
Examiner Intelligence

MORALES, PEDRO JESUS View full profile →
Grants 67% — above average
Career Allow Rate
6 granted / 9 resolved
+11.7% vs TC avg
Strong +50% interview lift
Without
With
+50.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 11m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
26.9%
-13.1% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
13.5%
-26.5% vs TC avg
§112
17.1%
-22.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 9 resolved cases
Office Action

§103
DETAILED ACTION
	This action is responsive to Applicant’s reply filed 05 March 2026. This action is made non-final. 

Status of the Claims
Claims 1, 8 and 17 are currently amended. 
Claim status is currently pending and under examination for claims 1-20 of which independent claims are 1, 8 and 17. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on September 29, 2025 has been entered.
Response to Amendment
	Applicant’s amendments to the Claims have overcome the 101 rejections previously set forth in the Final Office Action mailed January 6th 2026.
	In regards to the rejection of claims 1-2, 8, 11 and 17-18 under 35 U.S.C. 103 as being unpatentable over Autès in view of Goyal, Applicant argues that the examiner has failed to establish a proper prima facie case of obviousness because there is a lack of suggestion to combine the references. Arguing against the lack of suggestion to combine references is discussed in MPEP § 2145(X). On Pages 23-24 of Applicant’s reply, Applicant argues that Goyal teaches away from using coarse-grained data. Per MPEP § 2145 (X)(D), “the prior art’s mere disclosure of more than one alternative does not constitute a teaching away from any of these alternatives because such disclosure does not criticize, discredit, or otherwise discourage the solution claimed…." Applicant argues Goyal’s disclosure of “One prerequisite for learning more fine-grained information about the world is that labels describe video content that is restricted to a short time interval. Only this way can there be a tight synchronization between video content and the corresponding labels, …” (P. 3, Sec. 3, ¶2) teaches against the use of coarse-grained data. Goyal does not explicitly criticize, discredit, or discourage using coarse-grained data. Goyal merely teaches using fine-grained information is the only way to achieve tight synchronization, and by no means clearly discredits or discourages using coarse-grained data, therefore Applicant’s argument is not persuasive. 
	On Page 24 of Applicant’s reply, applicant argues that the examiner has failed to establish a proper prima facie case of obviousness because Goyal fails to disclose “the filled label template describes a data type of a content item.” Claim 1’s amendments are newly presented and have been addressed in the rejection below.
Applicant’s arguments regarding the art rejections are moot in view of the new grounds of rejection necessitated by applicant’s amendment.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 8, 11, and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Autès et al. (US 20220207864 A1), hereinafter Autès, in view of Goyal et al. (“The “something something” video database for learning and evaluating visual common sense”), hereinafter Goyal, further in view of Agrawal et al. (US 20230059007 A1), hereinafter Agrawal.

With respect to claim 1, Autès teaches:
a computer-implemented method comprising (Autès discloses “the present invention discloses a new computer-implemented method based on deep artificial neural networks to assign e.g. age suitability classes or ratings to media content” [0007].): 
training, using labelled training content …, a content classification model (Autès discloses a classifier unit (‘a content classification model’) that predicts age suitability classes for content “a classifier block or unit 21 for generating a probability score or vector 25 for a concatenated audio, image and text vector 23 to obtain an estimated age suitability class” [0021].
Autès further discloses “the classifier unit 21 is trained by using labelled data, i.e. labelled videos 62, to classify video scenes into age suitability classes” [0030].),
determining, using the trained content classification model, a label describing a first content (The Examiner interprets “label” according to its broadest reasonable interpretation as encompassing an age suitability class.
Autès discloses “the MLP 51 determines age suitability class probabilities by using the single concatenated audio, image and text feature vector 23. In other words, a vector of probabilities is computed or determined so that the number of entries in the vector equals the number of possible age suitability categories. In this manner, one probability value is allocated to each age suitability category. If there are for instance five different age suitability categories, then the probability vector could be for example [0.0, 0.1, 0.0, 0.8, 0.1]. In step 125, an age suitability class is assigned to the data stream under consideration. This step may be carried out by the post-processing unit 27. In other words, in this step, a sequence or stream classification is carried out. In practice, this step may be implemented so that that the highest probability value is selected from the probability vector and the assigned age suitability class is the class corresponding to that probability value” [0039-0040]. See also Figure 1 depicting how a multilayer perceptron (MLP) network is part of a classifier unit (‘a content classification model’).); 
classifying, into a category in a set of categories using the label, the first content (Autès discloses that age suitability classes are used to classify a data stream (‘first content’) into “compatible” or “not compatible” categories, “in step 127, the viewer age suitability class or classes is/are selected or the selection is received by the post-processing unit 27. It is to be noted that this step may be carried out at any moment prior to carrying out step 129. In step 129, it is determined whether or not the assigned age suitability class is compatible with the selection received in step 127. More specifically, it is determined whether or not the assigned age suitability class is the same as the selected age suitability class or is within the range of the allowed age suitability classes for this user or viewer. In the affirmative, in step 131, the stream is displayed or played to the viewer and the process then continues in step 103. If the assigned class is not compatible with the selection, then in step 133, it is decided not to show the stream in question to the viewer” [0040]. See Figure 3, steps 125-133 depicting how data streams are classified.); 
and removing, from a storage location responsive to the first content being classified into a category of inappropriate content, the first content (Autès discloses “in step 129, it is determined whether or not the assigned age suitability class is compatible with the selection received in step 127… if the assigned class is not compatible with the selection, then in step 133, it is decided not to show the stream in question to the viewer” [0040].
The Examiner interprets “removing, from a storage location” according to its broadest reasonable interpretation as encompassing modifying a data stream. 
Autès discloses “it is possible to filter incompatible scenes from a film and show the rest of the film to the user as a continuous sequence of scenes for example or so that the incompatible scenes have been modified to so that they comply with the class selection. In this case, a slightly modified film may be displayed to the viewer. For instance, it is possible to show all the image frames of a film to the viewer but so that some of the audio clips and/or text portions having unsuitable content are not played or vice versa. The incompatible audio content could simply be replaced with a muted content for example. The system may also determine which one of the assessed streams contributes most to the incompatible age suitability class. In other words, the system may rank the assessed streams according to their contribution to the estimated age suitability class. Then, for example, the stream having the greatest contribution may be modified or its playback prevented for a particular scene” [0041].
Autès discloses “the proposed solution could identify which scene of a film results in an R rating and allow creation of a cut or modified version suitable for younger audience” [0011].).
However, Autès does not teach labelled training content comprising content items associated with a label encoding generated from a filled label template, which is taught by Goyal:
the labelled training content comprising content items associated with a label encoding generated from a filled label template (Goyal discloses “we introduce the “something-something”- dataset. It currently contains 108, 499 videos across 174 labels, with duration ranging from 2 to 6 seconds. Labels are textual descriptions based on templates, such as “Dropping [something] into [something]” containing slots (“[something]”) that serve as placeholders for objects. Crowdworkers provide videos where they act out the templates. They choose the objects to perform the actions on and enter the noun-phrase describing the objects when uploading the videos. The dataset is split into train, validation and test-sets in the ratio of 8:1:1” (P. 4, Sec. 4, ¶1-2). See Figure 4 on P. 6 depicting example videos and their corresponding labels. 
Goyal discloses “Our goal in this work is to capture basic physical concepts expressible in simple phrases, which we hope will provide a stepping stone towards more complex relationships and facts. Consequently, we use natural language instead of a fixed data structure to represent labels” (P. 3, Sec. 2, Last Paragraph). 
Goyal discloses “natural language labels allow us to represent a spectrum of complexity, from simple objects and actions encoded as one-hot labels, to full-fledged captions. The use of natural language encodings for classes furthermore allows us to dynamically adjust the label structure in response to how learning progresses. In other words, the complexity of videos and natural language descriptions can be increased as a function of the validation-accuracy achievable by networks trained on the data so far” (P. 6, Sec. 4.2, Last Paragraph).
Goyal discloses “We compared these networks mainly on two subsets of the dataset with classes hand-picked to simplify the task and benchmark the complexity of the dataset (we refer to the supplementary materials for more details on selection of classes): 10 selected classes: We first pre-select 41 “easy” classes. We then generate 10 classes to train the networks (shown in Table 3)” (P. 9, Sec. 5.3, ¶ 1). See Table 3 on P. 8 depicting the 10 selected classes used to generate label templates for training networks.), 
wherein the filled label template describes a data type of a content item (The Examiner interprets “data type” according to its broadest reasonable interpretation (BRI) in view of the applicant’s specification. The examiner notes that the applicant provides no meaningful definition of this term in their specification. Accordingly, the Examiner interprets this term as encompassing a textual data type as disclosed by Goyal below. 
Goyal discloses “Crowd workers are asked to record videos and to complete caption templates, by providing appropriate input-text for placeholders. In this example, the text provided for placeholder “something” is a shoe” (P. 1, Figure 1 Caption). 
The label templates define the structure for labels and the input-text is textual data, therefore a filled label template is text that describes a video (and therefore a filled label template is a text data type of the video).);
Goyal teaches training networks with label templates represented as natural language encodings is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the method of Autès with the label templates of Goyal to train a machine learning model with label templates. By training a model with label templates, labels can be represented in a simple and consistent way while retaining semantic knowledge, thereby improving model accuracy. 
Furthermore, the combination of Autès in view of Goyal does not teach converting a fine-grained labeled data set to a coarse-grained labeled data set, which is taught by Agrawal:
converting a fine-grained labeled data set to coarse-grained labeled data set (Agrawal discloses “the layer generation system 102 analyzes the entire digital image 202 together with the layer labels 210 to determine that the digital image 202 depicts a room that includes a lamp, a chair, and a couch. Based on identifying the room including these three objects, the layer generation system 102 determines that “living room” is an appropriate project name 212” [0057].
Agrawal discloses “the layer generation system 102 utilizes an image name generation machine learning model to generate the project name 212 from the layer labels 210. For instance, the image name generation machine learning model can generate predictions or probabilities of the layer labels 210 corresponding to various of project names, and the layer generation system 102 can select the project name with the highest probability (or that satisfies a threshold probability). For example, the layer generation system 102 combines the layer labels 210 into a single text string and utilizes the image name generation machine learning model to determine which of a set of candidate project names corresponds to the text string” [0058]. 
Layer labels (‘fine-grained labeled data set’) are combined into a text string to determine candidate project names (‘coarse-grained labeled data set’) that describe an entire digital image. By combining the layer labels into a single text string to determine a project name (coarse-grained label), the layer labels are combined into a single label that describes the entire digital image (therefore converting a fine-grained labeled data set to a coarse-grained labeled data set).),
wherein a coarse-grained label of the coarse-grained labeled data set describes an entire content unit (Agrawal discloses “the layer generation system 102 generates the project name 212 to describe the layer labels 210 (e.g., using words not included in the layer labels 210) such as by generating the name “living room” corresponding to the digital image 202” [0057]. 
Agrawal discloses “the layer generation system 102 combines the layer labels 210 into a single text string and utilizes the image name generation machine learning model to determine which of a set of candidate project names corresponds to the text string” [0058]. 
Layer labels are combined to determine candidate project names that describe a digital image (‘content unit’), therefore candidate project names are a coarse-grained labeled data set that describe an entire digital image.), 
and wherein a fine-grained label of the fine-grained labeled data set describes a portion of a content unit (Agrawal discloses “the layer generation system 102 generates the layer labels 210 from the object classifications generated via the object classification machine learning model 206. As shown, the layer labels 210 include “lamp,” “couch,” and “chair” for the corresponding sets of pixels within the digital image 202” [0055]. 
Layer labels describe objects (lamp, couch, chair) that each correspond to a set of pixels (‘a portion’) of a digital image (‘content unit’), therefore layer labels are a fine-grained labeled data set.);
training, using … training content of the coarse-grained labeled data set, a content classification model (Agrawal discloses “the layer generation system 102 utilizes an image name generation machine learning model to generate the project name 212 from the layer labels 210. For instance, the image name generation machine learning model can generate predictions or probabilities of the layer labels 210 corresponding to various of project names, …utilizes the image name generation machine learning model to determine which of a set of candidate project names corresponds to the text string” [0058]. 
An image name generation machine learning model (‘content classification model’) uses a text string comprised of layer labels to generate candidate project names (coarse-grained labels), therefore it is implied that the machine learning model must have used project names to perform training.)
Agrawal teaches combining multiple labels that each describe a portion of a digital image into a single label that describes the entire digital image is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the method of Autès with the technique disclosed by Agrawal to combine multiple labels that each describe a portion of a content into a single label that describes the entire content. By combining multiple labels that each describe a portion of a content into a single label that describes the entire content, content can be summarized by a single high-level label, thereby allowing for complex content to be categorized into a broader, meaningful category. 

With respect to claim 2, the combination of Autès in view of Goyal, further in view of Agrawal teaches:
the computer-implemented method of claim 1, wherein the labelled training content comprises a text label describing the content (The Examiner interprets “a text label describing the content” according to its broadest reasonable interpretation as encompassing a sequence of words that are related to and describe content. 
Autès discloses such a sequence of words, “the system takes as its input a digital media file, which in this example is a video or motion picture file 3, which is fed into a data pre-processing unit 5. The data pre-processing unit 5 is configured to pre-process the input video file as will be explained later in more detail and output a sequence or stream of image frames 7, audio clips 9 and text portions or words 11, which may be subtitles, a video summary, a script, reviews etc related to the audio frames and/or audio clips” [0021]. 
Autès discloses “the classifier unit 21 is trained by using labelled data, i.e. labelled videos 62, to classify video scenes into age suitability classes. From the labelled videos, the following streams are extracted: an image stream 63, an audio stream 65 and a text stream 67. These streams are fed into the system 61. An age label rating 69 of the extracted streams is also extracted from the labelled video and fed to a loss function computation unit 71” [0030]).  

With respect to claim 8, the rejection of claim 1 is incorporated. The difference in scope being 
a computer program product comprising one or more computer readable storage medium, and program instructions collectively stored on the one or more computer readable storage medium, the program instructions executable by a processor to cause the processor to perform operations comprising (Autès discloses “the actual computing system may comprise a central or computing processing unit (CPU), a graphical processing unit (GPU), a memory unit and a storage device to store digital files. When processing a multimedia stream, the parameters and the operating software of the neural network system are first loaded from the storage to the memory” [0044].).

With respect to claim 11, the claim recites similar limitations corresponding to claim 2, therefore the same rationale of rejection is applicable.

With respect to claim 17, the rejection of claim 1 is incorporated. The difference in scope being
a computer system comprising a processor and one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions executable by the processor to cause the processor to perform operations comprising (Autès discloses “the actual computing system may comprise a central or computing processing unit (CPU), a graphical processing unit (GPU), a memory unit and a storage device to store digital files. When processing a multimedia stream, the parameters and the operating software of the neural network system are first loaded from the storage to the memory” [0044].).

With respect to claim 18, the claim recites similar limitations corresponding to claim 2, therefore the same rationale of rejection is applicable.

Claims 3-7, 12-16, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Autès in view of Goyal, further in view of Agrawal and Parida et al. (“Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos”), hereinafter Parida.

With respect to claim 3, the combination of Autès in view of Goyal, further in view of Agrawal teaches:
the computer-implemented method of claim 1, wherein training, using the labelled training content, the content classification model comprises: encoding, using a text encoding model, a text label describing the first content, the encoding resulting in a label encoding comprising a … point in a vector space (Autès discloses a sequence of words (‘text label’) is converted (‘encoded’) to a single vector “the system takes as its input a digital media file, which in this example is a video … the data pre-processing unit 5 is configured to pre-process the input video file as will be explained later in more detail and output a sequence or stream of image frames 7, audio clips 9 and text portions or words 11, which may be subtitles, a video summary, a script, reviews etc related to the audio frames and/or audio clips … The system further comprises … a text processing block or unit 17 for converting or transforming a respective sequence of words into a respective single text feature vector” [0021].
Autès further discloses “the text processing unit 17 comprises a set of text processing elements 43, which in this example are word embedding matrices 43 trained for text or word processing. Word embedding … where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it typically involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension … Each one of the word embedding matrices 43 is arranged to process the received word and output a text feature vector 45 for the received word … The text feature vectors 45 are then arranged to be fed into a fourth artificial neural network 47, which in this example is a second convolution through time (CTT) network … The second CTT network 47 is configured to process the incoming text feature vectors to output the single text feature vector 19” [0024].); 
encoding, using an image encoding model, a video component of the first content, the encoding resulting in a video encoding … (Autès discloses “the audio and image processing unit 13 comprises a set of first artificial neural networks, which in this example are convolutional neural networks (CNNs) 29 trained for image processing and referred to as a set of image CNNs. The image CNNs receive at their inputs the sequence of image frames 7 (four image frames in the example shown in FIG. 1), such that one image CNN 29 is arranged to receive and process one image frame. The image input is a sequence of t consecutive frames separated by a time duration s, which may be e.g. between 0.1 seconds and 10 seconds or more specifically between 0.5 seconds and 3 seconds. Each one of the image CNNs 29 is arranged to process the received image frame and output an image feature vector 31” [0022].); 
encoding, using an audio encoding model, an audio component of the first content, the encoding resulting in an audio encoding … (Autès discloses “the audio and image processing unit 13 also comprises a set of second artificial neural networks, which in this example are CNNs 33 trained for audio processing and/or recognition and referred to as a set of audio CNNs. The audio CNNs receive at their inputs the sequence of audio clips 9 of a given length (four audio clips in the example shown in FIG. 1), such that one audio CNN 33 is arranged to receive one audio clip. In this example, the audio input is a sequence of t audio clips with a duration, which equals s in this particular example. Each one of the audio CNNs 33 is arranged to process the received audio clip and output an audio feature vector 35” [0022].); 
However, Autès does not teach label, video and audio encodings comprising a multidimensional point in a vector space, and adjusting encoding models to minimize a distance between encodings, which is taught by Parida:
 	encoding, using a text encoding model, a text label describing the first content, the encoding resulting in a label encoding comprising a multidimensional point in a vector space (Parida discloses an encoding model consists of a neural network                         
                            
                                    f
                                
                                    x
                                
                     to obtain a vector representation of an input and a corresponding multilayer perceptron (MLP) neural network                         
                            
                                    g
                                
                                    x
                                
                     to project a vector into an embedding space (‘vector space’), “we represent each of the three types of inputs, ie. audio, video, and text, using the corresponding state-of-the-art neural networks outputs which we denote as                         
                            
                                    f
                                
                                    a
                                
                                    ∙
                                
                            ,
                             
                                    f
                                
                                    v
                                
                                    ∙
                                
                            ,
                             
                                    f
                                
                                    t
                                
                            (
                            ∙
                            )
                        
                    . We project each representation with corresponding neural networks which are small MLPs, denoted as                         
                            
                                    g
                                
                                    a
                                
                                    ∙
                                
                            ,
                             
                                    g
                                
                                    v
                                
                                    ∙
                                
                            ,
                             
                                    g
                                
                                    t
                                
                            (
                            ∙
                            )
                             
                    with parameters … Finally, the representations are obtained by passing the input audio/video/text through the corresponding networks sequentially, ie.                         
                            x
                            =
                            
                                    g
                                
                                    x
                                
                            ∘
                             
                                    f
                                
                                    x
                                
                                    X
                                
                            w
                            h
                            e
                            r
                            e
                             
                            x
                             
                            ∈
                            a
                            ,
                             
                            v
                            ,
                             
                            t
                             
                    and X is the corresponding raw audio/video/text input” (P. 3253, Sec. 3, Last Paragraph).
Parida further discloses “finally the text network, denoted as                         
                            
                                    f
                                
                                    t
                                
                                    ∙
                                
                     is the well known word2vec network pretrained on Wikipedia [42] with output dimension of 300D” (P. 3255, Sec. 5, First Paragraph). 
Parida discloses text inputs are class labels, “we study the problem of ZSL for videos with general classes like, ‘dog’, ‘sewing machine’, ‘ambulance’, ‘camera’, ‘rain’, and propose to use audio modality in addition to the visual modality … Our focus here is on leveraging both audio and video modalities to learn a joint projection space for audio, video and text (class labels). In such an embedding space, ZSL tasks can be formulated as nearest neighbor searches” (P. 3251, Sec. 1, Last Paragraph).); 
encoding, using an image encoding model, a video component of the first content, the encoding resulting in a video encoding comprising a multidimensional point in the vector space (Parida discloses “the video network, denoted as                         
                            
                                    f
                                
                                    v
                                
                                    ∙
                                
                     is an inflated 3D CNN network which is pretrained on the Kinetics dataset [40] and a large video dataset of action recognition. We also obtain the video features form the layer before the classification layer and average them to get a vector of 1024D” (P. 3255, Sec. 5, First Paragraph).
Parida discloses projection network                        
                             
                                    g
                                
                                    v
                                
                                    ∙
                                
                     is used to project a video vector representation into an embedding space (P. 3253, Sec. 3, Last Paragraph).); 
encoding, using an audio encoding model, an audio component of the first content, the encoding resulting in an audio encoding comprising a multidimensional point in the vector space (Parida discloses “the audio network                         
                            
                                    f
                                
                                    a
                                
                                    ∙
                                
                     is based on that of [41], and is trained on the spectrogram of the audio clips in the train set of our dataset. We obtain the audio features after seven conv layers of the network, and average them to obtain 1024D vector” (P. 3255, Sec. 5, First Paragraph).
Parida discloses projection network                        
                             
                                    g
                                
                                    a
                                
                                    ∙
                                
                     is used to project an audio vector representation into an embedding space (P. 3253, Sec. 3, Last Paragraph).);
and adjusting, to minimize a distance between the label encoding, the video encoding, and the audio encoding in the vector space, the text encoding model, the image encoding model, and the audio encoding model (Parida discloses Figure 1 (reproduced below) on P. 3251. Figure 1 depicts a multidimensional embedding space (‘vector space’) where video, audio, and text label embedding vectors are embedded. A nearest neighbor search is performed to classify the embedding vectors based on their proximity to a class embedding (text label embedding). 

    PNG
    media_image1.png
    486
    784
    media_image1.png
    Greyscale

Parida discloses “our focus here is on leveraging both audio and video modalities to learn a joint projection space for audio, video and text (class labels). In such an embedding space, ZSL tasks can be formulated as nearest neighbor searches (Fig. 1 illustrates the point). When doing classification, a new test video is embedded into the space and the nearest class embedding is predicted to be its class” (P. 3251-3252, Sec. 1).
Parida further discloses “we use nearest neighbor in the embedding space for making predictions. In the case of classification, the audio and video are embedded in the space and the class embedding with the minimum average distance with them is taken as the prediction” (P. 3254, Sec. 3, First Paragraph).
Parida discloses “we propose cross-modal extensions of the embedding based ZSL approach based on triplet loss for learning such a joint embedding space. We optimize an objective based on (i) two cross-modal triplet losses, one each for ensuring compatibility between the text (class labels) and the video, and the text and the audio, and (ii) another loss based on crossmodal compatibility of the audio and visual embeddings. While the triplet losses encourage the audio and video embeddings to come closer to respective class embeddings in the common space, the audio-visual crossmodal loss encourages the audio and video embeddings from the same sample to be similar. These losses together ensure that the three embeddings of the same class are closer to each other relative to their distance from those of different classes” (P. 3252, Sec. 1, First Paragraph). See Equations 1-4 on P. 3253, Sec. 3 (defining the loss functions). 
Parida discloses Equation 5 (reproduced below) on P. 3253, Sec. 3. Equation 5 is an optimization function that optimizes (‘adjusts’) the parameters of the audio, video, and text label projection networks. Projection networks take input vector representations and project them into a joint embedding space. The Loss Function (Equation 4) is the weighted average of the triplet losses that ensure embeddings of the same class (text label) are closer together. 

    PNG
    media_image2.png
    315
    722
    media_image2.png
    Greyscale

).  
	Parida teaches optimizing text, video, and audio neural network models (‘encoding models’) to minimize distances between multidimensional embeddings (‘encodings’) in a joint embedding space is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the method of Autès with the technique disclosed by Parida to capture complex relationships of input data. By representing video, text, and audio inputs as multidimensional embeddings, the embeddings are able to represent complex data. In a joint embedding space, the positions of embeddings can be used to calculate and compare distances between other embeddings in many directions, thereby revealing intricate relationships. By revealing complex relationships between embeddings, a model is able to achieve a deeper understanding of the input data and its underlying patterns, thereby improving a model’s ability to make accurate predictions. 

With respect to claim 4, the combination of Autès in view of Goyal, further in view of Agrawal and Parida teaches:
the computer-implemented method of claim 3, wherein adjusting the text encoding model, the image encoding model, and the audio encoding model comprises: first adjusting, to minimize a first plurality of distances, the text encoding model and the audio encoding model, each distance in the first plurality of distances comprising a distance between a label encoding and a corresponding audio encoding in the vector space (Parida discloses Equation 1 (reproduced below) on P. 3253, Sec. 3. Equation 1 is a bimodal loss function between text and audio embeddings (‘encodings’) that force audio embeddings to be closer to text (‘label’) embeddings in a joint embedding space (‘vector space’). The loss function is part of an optimization function (Equation 5, discussed above) that optimizes (‘adjusts’) the parameters of the audio, video, and text label projection networks responsible for transforming (‘encoding’) vector representations into embeddings in a joint embedding space.

    PNG
    media_image3.png
    291
    1270
    media_image3.png
    Greyscale

	Parida discloses “the class constraints are enforced using bimodal triplet losses between audio and text, and video and text embeddings. Denoting ai , vi , ti as the audio, video and text embedding … These losses force the audio and video embeddings to be closer to the correct class embedding by a margin δ > 0” (P. 3253, Sec. 3, First Paragraph).); 
second adjusting, to minimize a second plurality of distances, the text encoding model and the image encoding model, each distance in the second plurality of distances comprising a distance between a label encoding and a corresponding video encoding in the vector space, the second adjusting performed subsequent to the first adjusting (Parida discloses Equation 2 (reproduced below) on P. 3253, Sec. 3. Equation 2 is a bimodal loss function between text and video embeddings that force video embeddings to be closer to text (‘label’) embeddings in a joint embedding space. The loss function is part of an optimization function (Equation 5, discussed above) that optimizes (‘adjusts’) the parameters of the audio, video, and text label projection networks responsible for transforming (‘encoding’) vector representations into embeddings in a joint embedding space.

    PNG
    media_image4.png
    222
    1193
    media_image4.png
    Greyscale

); 
and third adjusting, to minimize a third plurality of distances, the text encoding model and the audio encoding model, each distance in the third plurality of distances comprising a distance between a label encoding and a corresponding audio encoding in the vector space, the third adjusting performed subsequent to the second adjusting (Parida discloses Equation 1 (reproduced above) on P. 3253, Sec. 3. Equation 1 is a bimodal loss function between text and audio embeddings that force audio embeddings to be closer to text (‘label’) embeddings in a joint embedding space. The loss function is part of an optimization function (Equation 5, discussed above) that optimizes the parameters of the audio, video, and text label projection networks responsible for transforming vector representations into embeddings in a joint embedding space. See Figure 1 (reproduced above) depicting a plurality of audio embeddings that are closest to (have a minimal distance between) text embeddings.
Parida discloses “we keep the initial network parameters fixed to be that of the pretrained networks and optimize over the parameters of the projection networks … We train for the parameters using standard backpropagation for neural networks” (P. 3253, Sec. 3, Last Paragraph).).  
The claimed invention in the instant application is directed to a method of predicting a text label for audio and video encodings by adjusting text, audio, and video encoding models to minimize a plurality of distances between encodings. At the time of the invention, using loss functions to optimize encoder neural networks to bring video, text label, and audio embeddings closer together was known in the art as disclosed by Parida. Using video, text label, and audio embeddings in a joint embedding space to classify videos based on distance was known since audio can reveal objects not visible in a video frame (Parida, P. 3251, Sec. 1, Paragraphs 2-3). The Examiner finds that the three adjustment steps were known in the art before the effective filing date. While Parida does not disclose the adjustment steps in the order specified by the claimed invention, the Examiner finds that it would have been obvious before the effective filing date of the claimed invention to perform the adjustments steps in the order claimed with a reasonable expectation of success since there are a finite number of permutations of orders. The courts have found rearrangement of parts to be obvious if such a rearrangement does not modify the operation of a device or is an obvious matter of design choice, “In re Japikse, 181 F.2d 1019, 86 USPQ 70 (CCPA 1950) (Claims to a hydraulic power press which read on the prior art except with regard to the position of the starting switch were held unpatentable because shifting the position of the starting switch would not have modified the operation of the device.); In re Kuhle, 526 F.2d 553, 188 USPQ 7 (CCPA 1975) (the particular placement of a contact in a conductivity measuring device was held to be an obvious matter of design choice)”, see MPEP § 2144.04(vi)(c). One of ordinary skill in the art having problems with adjusting encoder models would have looked to perform the adjustments steps of Parida in the claimed order since the claimed order would not modify classification results and is merely a matter of design choice.

With respect to claim 5, the combination of Autès in view of Goyal, further in view of Agrawal and Parida teaches:
the computer-implemented method of claim 3, wherein determining, using the trained content classification model, the label describing the first content comprises: 
encoding, using the image encoding model, a video component of the first content, the encoding resulting in a first video encoding (Parida discloses an image encoding model consists of neural networks                         
                            
                                    f
                                
                                    v
                                
                     and                         
                            
                                    g
                                
                                    v
                                
                    .
Parida discloses “the video network, denoted as                         
                            
                                    f
                                
                                    v
                                
                                    ∙
                                
                     is an inflated 3D CNN network which is pretrained on the Kinetics dataset [40] and a large video dataset of action recognition. We also obtain the video features form the layer before the classification layer and average them to get a vector of 1024D” (P. 3255, Sec. 5, First Paragraph).
Parida discloses projection network                        
                             
                                    g
                                
                                    v
                                
                                    ∙
                                
                     is used to project a video vector representation into an embedding space (P. 3253, Sec. 3, Last Paragraph).); 
and determining a label encoding in a set of label encodings that is closest to the first video encoding, the label encoding comprising an encoding of the label describing the first content (Parida discloses “we study the problem of ZSL for videos with general classes like, ‘dog’, ‘sewing machine’, ‘ambulance’, ‘camera’, ‘rain’, and propose to use audio modality in addition to the visual modality … Our focus here is on leveraging both audio and video modalities to learn a joint projection space for audio, video and text (class labels). In such an embedding space, ZSL tasks can be formulated as nearest neighbor searches (Fig. 1 illustrates the point). When doing classification, a new test video is embedded into the space and the nearest class embedding is predicted to be its class” (P. 3251-3252, Sec. 1). See Figure 1 (reproduced above) depicting classifying video encodings based on their proximity to a class embedding (text label embedding).).  

With respect to claim 6, the combination of Autès in view of Goyal, further in view of Agrawal and Parida teaches:
the computer-implemented method of claim 3, wherein determining, using the trained content classification model, the label describing the first content comprises: 
encoding, using the audio encoding model, an audio component of the first content, the encoding resulting in a first audio encoding (Parida discloses an audio encoding model consists of neural networks                         
                            
                                    f
                                
                                    a
                                
                     and                         
                            
                                    g
                                
                                    a
                                
                    .
Parida discloses “the audio network                         
                            
                                    f
                                
                                    a
                                
                                    ∙
                                
                     is based on that of [41], and is trained on the spectrogram of the audio clips in the train set of our dataset. We obtain the audio features after seven conv layers of the network, and average them to obtain 1024D vector” (P. 3255, Sec. 5, First Paragraph).
Parida discloses projection network                        
                             
                                    g
                                
                                    a
                                
                                    ∙
                                
                     is used to project an audio vector representation into an embedding space (P. 3253, Sec. 3, Last Paragraph).); 
and determining a label encoding in a set of label encodings that is closest to the first audio encoding, the label encoding comprising an encoding of the label describing the first content (Parida discloses “we use nearest neighbor in the embedding space for making predictions. In the case of classification, the audio and video are embedded in the space and the class embedding with the minimum average distance with them is taken as the prediction” (P. 3254, Sec. 3, First Paragraph).
Parida discloses “we propose cross-modal extensions of the embedding based ZSL approach based on triplet loss for learning such a joint embedding space. We optimize an objective based on (i) two cross-modal triplet losses, one each for ensuring compatibility between the text (class labels) and the video, and the text and the audio, and (ii) another loss based on crossmodal compatibility of the audio and visual embeddings” (P. 3252, Sec. 1, First Paragraph). See Figure 1 (reproduced above) depicting classifying audio encodings based on their proximity to a class embedding (text label embedding).).  

With respect to claim 7, the combination of Autès in view of Goyal, further in view of Agrawal and Parida teaches:
the computer-implemented method of claim 3, further comprising: adding, to the set of categories a new content category (The Examiner interprets “new content category” according to its broadest reasonable interpretation as encompassing unseen classes as disclosed by Parida throughout the cited pages.
Parida discloses “to create the seen, unseen splits for ZSL tasks, we selected a total of 10 classes spanning all the groups as the zero-shot classes (marked with ‘*’ in Fig. 3). We ensure that the unseen classes have minimal overlap with the Kinetics dataset [40] training classes as we use CNNs pre-trained on that. We do so by not choosing any class whose class embedding similarity is greater than 0.8 with any of the Kinetics train class embeddings in the word2vec space. We finally split, both the seen and unseen classes, as 60 − 20 − 20 into train, validation and test sets. We set the protocol to be as follows. Train on the train classes and then test on seen class examples and unseen class examples, both being classified into one of all the classes” (P. 3255, Sec. 4, Last Two Paragraphs).); 
generating, for the new content category, a plurality of labels, each label in the plurality of labels comprising a text description of content in the new content category (Parida discloses “we report the mean class accuracy (% mAcc) for the classification task and the mean average precision (% mAP) for the retrieval task. The performance for the seen (S) and unseen (U) classes are obtained after classification (retrieval) over all the classes (S and U). The harmonic mean HM of S and U indicates how well the system performs on both seen and unseen categories on average. For classification, we classify each test example, and for retrieval, we perform leave-one-out testing, ie. each test example is considered as a query” (P. 3255, Sec. 5, Third Paragraph).
Parida discloses “our focus here is on leveraging both audio and video modalities to learn a joint projection space for audio, video and text (class labels). In such an embedding space, ZSL tasks can be formulated as nearest neighbor searches (Fig. 1 illustrates the point). When doing classification, a new test video is embedded into the space and the nearest class embedding is predicted to be its class” (P. 3251-3252, Sec. 1). 
Parida discloses Figure 5 on P. 3258 depicting classification results after classifying unseen class inputs. The text inputs (unseen classes) depicted in Figure 5 are projected onto a joint projection space and act as class labels (class embeddings) describing the nearby audio and video embeddings.); 
encoding, using the text encoding model, each of the plurality of labels, the encoding resulting in plurality of label encodings, each label encoding in the plurality of label encodings comprising a multidimensional point in a vector space (Parida discloses a text encoding model consists of neural networks                         
                            
                                    f
                                
                                    t
                                
                     and                         
                            
                                    g
                                
                                    t
                                
                    .
Parida discloses “finally the text network, denoted as                         
                            
                                    f
                                
                                    t
                                
                                    ∙
                                
                     is the well known word2vec network pretrained on Wikipedia [42] with output dimension of 300D” (P. 3255, Sec. 5, First Paragraph). 
Parida discloses projection network                        
                             
                                    g
                                
                                    t
                                
                                    ∙
                                
                     is used to project a text vector representation into an embedding space (P. 3253, Sec. 3, Last Paragraph). See Figure 1 (reproduced above) depicting text label embeddings in a multidimensional joint embedding space.); 
determining, using the trained content classification model, a label describing a first content (Autès discloses an age suitability class (‘a label’), “the MLP 51 determines age suitability class probabilities by using the single concatenated audio, image and text feature vector 23. In other words, a vector of probabilities is computed or determined so that the number of entries in the vector equals the number of possible age suitability categories. …  In practice, this step may be implemented so that that the highest probability value is selected from the probability vector and the assigned age suitability class is the class corresponding to that probability value” [0039-0040]. See also Figure 1 depicting how a multilayer perceptron (MLP) network is part of a classifier unit (‘a content classification model’).); 
determining, using the trained content classification model, a second label describing a second content (Autès discloses “a method of classifying a media includes receiving a media file and extracting therefrom first and second data streams including first and second media content, respectively, the media content being associated with the media item. First and second feature vectors describing the first and second media content, respectively, are generated. At least a first single feature vector representing the first sequence of first feature vectors and the second sequence of second feature vectors is generated” [Abstract].
Autès further discloses “in step 125, an age suitability class is assigned to the data stream under consideration. This step may be carried out by the post-processing unit 27. In other words, in this step, a sequence or stream classification is carried out. In practice, this step may be implemented so that that the highest probability value is selected from the probability vector and the assigned age suitability class is the class corresponding to that probability value … the process continues in step 103 where the following streams are extracted. The process may be then repeated as many times as desired, and it can be stopped at any moment. More specifically, once the first sequence has been processed consisting of t audio and image frames and a first number of words, then the process continues to a second or next image frame and/or audio clip and includes these items and a given number of subsequent audio frames, audio clips and words into the next sequence” [0040].);
 and classifying, into the new content category using the second label, the second content (Autès discloses an age suitability class (‘second label’) is used to classify content as compatible or incompatible, “in step 125, an age suitability class is assigned to the data stream under consideration … In step 129, it is determined whether or not the assigned age suitability class is compatible with the selection received in step 127. More specifically, it is determined whether or not the assigned age suitability class is the same as the selected age suitability class or is within the range of the allowed age suitability classes for this user or viewer. In the affirmative, in step 131, the stream is displayed or played to the viewer and the process then continues in step 103. If the assigned class is not compatible with the selection, then in step 133, it is decided not to show the stream in question to the viewer. After this step, the process continues in step 103 where the following streams are extracted. The process may be then repeated as many times as desired” [0040].).  
Parida teaches generating labels for unseen classes (‘new content categories’) is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the method of Autès with the technique disclosed by Parida to improve classification accuracy. By introducing unseen classes, a model can improve its ability to generalize and classify a broader range of data since it can learn new underlying relationships within the data. A model can then use the underlying relationships to make better decisions which leads to higher classification accuracy and model performance. 

With respect to claims 12 and 19, the claims recite similar limitations corresponding to claim 3, therefore the same rationale of rejection is applicable.

With respect to claims 13 and 20, the claims recite similar limitations corresponding to claim 4, therefore the same rationale of rejection is applicable.

With respect to claim 14, the claim recites similar limitations corresponding to claim 5, therefore the same rationale of rejection is applicable.

With respect to claim 15, the claim recites similar limitations corresponding to claim 6, therefore the same rationale of rejection is applicable.

With respect to claim 16, the claim recites similar limitations corresponding to claim 7, therefore the same rationale of rejection is applicable.

Claims 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Autès in view of Goyal, further in view of Agrawal and Gale (US 20210256515 A1).

With respect to claim 9, the combination of Autès in view of Goyal, further in view of Agrawal teaches:
the computer program product of claim 8, wherein the stored program instructions are stored in a computer readable storage device in a data processing system (Autès discloses “the actual computing system may comprise a central or computing processing unit (CPU), a graphical processing unit (GPU), a memory unit and a storage device to store digital files. When processing a multimedia stream, the parameters and the operating software of the neural network system are first loaded from the storage to the memory” [0044].).
However, Autès does not teach stored program instructions transferred over a network from a remote data processing system, which is taught by Gale:
and wherein the stored program instructions are transferred over a network from a remote data processing system (Gale discloses “the program instructions are stored in a computer-readable storage medium in a data processing system and are transferred over a network from a remote data processing system” [0020].).  
Gale teaches transferring program instructions over a network from a remote data processing system is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to combine the method of Autès with the technique disclosed by Gale to provide remote access to program instructions. By providing remote access to program instructions, users can download the most recent version of a program and access a program from various local devices, thus, users can avoid performance problems due to outdated software.

With respect to claim 10, the combination of Autès in view of Goyal, further in view of Agrawal and Gale teaches:
the computer program product of claim 8, wherein the stored program instructions are stored in a computer readable storage device in a server data processing system (Gale discloses “the program instructions are stored in a computer-readable storage medium in a server data processing system, and downloaded over a network to a remote data processing system for use in a computer-readable storage medium associated with the remote data processing system” [0021].),
and wherein the stored program instructions are downloaded in response to a request over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system, further comprising (Gale discloses “the program instructions are stored in a computer-readable storage medium in a server data processing system, and downloaded over a network to a remote data processing system for use in a computer-readable storage medium associated with the remote data processing system” [0021].): 
program instructions to meter use of the program instructions associated with the request (Gale discloses “further comprise program instructions to meter usage of computer usable code in response to a request for the usage, and generate one or more invoices based on the metered usage” [0021].);
and program instructions to generate an invoice based on the metered use (Gale discloses “further comprise program instructions to meter usage of computer usable code in response to a request for the usage, and generate one or more invoices based on the metered usage” [0021].).
Gale teaches requesting remote program instructions and generating an invoice based on metered computer code usage is a known method in the art. Before the effective filing date of the claimed invention, it would have been obvious to combine the method of Autès with the technique disclosed by Gale to keep track of individual user resource consumption. By keeping track of each individual’s resource consumption, users can be billed only for the resources they use, thereby allowing users to save money and make cost-effective decisions to adapt their resource usage to suit their needs.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PEDRO J MORALES whose telephone number is (571)272-6106. The examiner can normally be reached 8:30 AM - 6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MIRANDA M HUANG can be reached at (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PEDRO J MORALES/Examiner, Art Unit 2124                                                                                                                                                                                                        
/VINCENT GONZALES/Primary Examiner, Art Unit 2124
Read full office action
Prosecution Timeline

Oct 17, 2022
Application Filed
Aug 01, 2025
Non-Final Rejection — §103
Oct 17, 2025
Examiner Interview Summary
Oct 17, 2025
Applicant Interview (Telephonic)
Nov 03, 2025
Response Filed
Dec 30, 2025
Final Rejection — §103
Mar 05, 2026
Request for Continued Examination
Mar 13, 2026
Response after Non-Final Action
Mar 26, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/724,539
Patent 12591803
SYSTEMS AND METHODS FOR APPLYING MACHINE LEARNING BASED ANOMALY DETECTION IN A CONSTRAINED NETWORK
2y 5m to grant Granted Mar 31, 2026
17/514,297
Patent 12530412
SEARCH-QUERY SUGGESTIONS USING REINFORCEMENT LEARNING
2y 5m to grant Granted Jan 20, 2026
17/840,851
Patent 12524673
MULTITASK DISTRIBUTED LEARNING SYSTEM AND METHOD BASED ON LOTTERY TICKET NEURAL NETWORK
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 3 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
67%
Grant Probability
99%
With Interview (+50.0%)
3y 11m
Median Time to Grant
High
PTA Risk
Based on 9 resolved cases by this examiner. Grant probability derived from career allow rate.