Last updated: July 17, 2026
Application No. 18/729,140
METHOD FOR GENERATING A FEATURE ENCODING MODEL, METHOD FOR AUDIO DETERMINATION, AND A RELATED APPARATUS

Final Rejection §103
Filed
Jul 15, 2024
Priority
Jan 14, 2022 — CN 202210045047.4 +1 more
Examiner
MANOHARAN, SHASHIDHAR SHANKAR
Art Unit
2655
Tech Center
2600 — Communications
Assignee
Beijing Youzhuju Network Technology Co., Ltd.
OA Round
2 (Final)
Interview Optional

— +0.0% interview lift. Interview lift (+0.0%) is below the 15.0% threshold. A written response is recommended.
Based on 3 resolved cases, 2023–2026
Examiner Intelligence

MANOHARAN, SHASHIDHAR SHANKAR View full profile →
Grants 100% — above average
Career Allowance Rate
3 granted / 3 resolved
+38.0% vs TC avg
Minimal +0% lift
Without
With
+0.0%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 1m
Avg Prosecution
20 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
98.0%
+58.0% vs TC avg
§102
2.0%
-38.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 3 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
	The amendments filed 4/27/2026 have been accepted and considered in this office action. Claims 12, 18, 19, 25, and 26 have been amended.

Response to Arguments
	Applicant’s arguments with respect to claims 12-31 have been considered but are moot in view of new grounds of rejection necessitated by the applicant’s amendments to the claims.

	Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 12-18 are rejected under 35 U.S.C. 103 as being unpatentable over Du et al. (hereinafter Du) (BYTECOVER: COVER SONG IDENTIFICATION VIA MULTI-LOSS TRAINING) in view of Phan et al. (hereinafter Phan) (MULTI-VIEW AUDIO AND MUSIC CLASSIFICATION).

Regarding claim 12, Du further discloses:
A method for generating a feature encoding model, comprising (Du, Page 1: “ByteCover is built based on the classical CNN model ResNet50” (teaches building/using neural-network feature-learning model, corresponding to a feature encoding model)): 
acquiring a plurality of sample audios marked with category labels (Du, Page 1: “The first category of methods, e.g., [8–10], treats CSI as a multi-class classification problem, where each version group is considered as an unique class.”; Du, Pages 2-3: “Our ByteCover model was trained on the training subset of SHS100K”, Du Page 3: “which is collected from Second Hand Songs website by, consisting of 8858 songs with various covers and 108523 recordings. For this data, we follow the settings of. The dataset is split into the training, validation, and test sets with a ratio of 8:1:1.”; Du, Page 3: “In terms of the classification loss Lcls, the CE refers the cross entropy function and y is the ground truth label.”, reads on acquiring a plurality of sample audios marked with category labels);
extracting audio features of the plurality of sample audios;(Du, Page 2: “As shown in the figure, ByteCover takes as input a constant-Q transform (CQT) spectrogram and employs a CNN-based model, i.e., ResNet-IBN, for feature learning.”; Du, Page 2: “audio is resampled to 22050 Hz before the feature extraction. Afterwards, the CQT is downsampled with an averaging factor of 100 along the temporal dimension.”, reads on extracting audio features of the plurality of sample audios);encoding the audio features of the plurality of sample audios by the feature encoding model to obtain a plurality of encoding vectors of the plurality of sample audios, and performing classification processing on the plurality of sample audios based on the plurality of encoding vectors to obtain category prediction values of the plurality of sample audios;(Du, Page 2: “The learned feature map is then compressed by GeMPool to form a fixed-length vector, which is then used to calculate the triplet loss and Softmax loss jointly. The joint training of these two losses are achieved by using the BNNeck.”; Du, Page 3: “Thus the BNNeck module inserts a BN layer before the classifier, which is a no-biased FC layer with weights W, as Figure 1 depicts. The feature yield by the GeM module is denoted as ft here. We let ft pass through a BN layer to generate the feature fc.”; Du, Page 3: “In terms of the classification loss Lcls, the CE refers the cross entropy function and y is the ground truth label. Finally, we train the model with Adam optimizer.”, reads on encoding the audio features to obtain a plurality of encoding vectors and performing classification processing to obtain category prediction values);and determining a target loss value of a target loss function based on the plurality of encoding vectors, the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios, and updating a parameter of the feature encoding model based on the target loss value to reduce a difference between the encoding vectors of the sample audios of a same category, to increase a difference between the encoding vectors of the sample audios of different categories, and to reduce a difference between the category prediction values and the category labels of the plurality of sample audios, so as to obtain the trained feature encoding moduli (Du, Page 3: “Overall, the objective function used for training our model can be derived by L = Lcls(fc) + Ltri(ft) = CE(Softmax(Wfc), y) + [dp - dn + α]+, where dp and dn are feature distances of positive pair and negative pair.”; Du, Page 1: “we employ the BNNeck method to allow a multi-loss training and encourage our method to jointly optimize a classification loss and a triplet loss, and by this means, the inter-class discrimination and intra-class compactness of cover songs, can be ensured at the same time.” (reads on determining a target loss value of a target loss function based on vectors and labels and updating a parameter to reduce a difference between encoding vectors of a same category, increase a difference between encoding vectors of different categories, and reduce a difference between prediction values and label categories);
the first loss value being a loss value of a first loss function determined based on the encoding vectors corresponding to samples comprised in training sample groups (Du, Page 2: "The learned feature map is then compressed by GeMPool to form a fixed length vector, which is then used to calculate the triplet loss and Softmax loss jointly." (Du's fixed-length vector maps to the claimed encoding vector, and Du's triplet loss maps to the first loss function determined based on encoding vectors of trained samples), Page 3: "Where dp and dn are feature distances of positive pair and negative pair." (Du's positive and negative pair distances are calculated from feature/encoding vectors of samples in triplet training groups, mapping to encoding vectors corresponding to samples comprised in training sample groups.)), the second loss value being a loss value of a second loss function determined based on differences between the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios (Du, Page 3, Equations 2 and 3: "L =Lcls(fc) +Ltri(ft). (2) =CE(Softmax(Wfc),y)+[dp −dn +α]+ , (3)", (teaches classification loss CE(Softmax(Wfc),y), where Softmax(Wfc) maps to category prediction values and y maps to category labels. Cross entropy maps to determining a loss based on differences between prediction values and labels.)).
Du does not explicitly disclose:wherein the target loss value is determined according to a result of weighted summation of a first loss value and a second loss value,
However, Phan discloses:
wherein the target loss value is determined according to a result of weighted summation of a first loss value and a second loss value (Pham, Page 2, Equation 2: "The total loss used for training at the training step n is computed by: (Equation 2), where w^(k)*(n) denotes the weight of the classification branch k at the training time n." (teaches determining a total/target training loss by weighted summation of multiple loss values. This maps to the target loss value being determined according to a weighted summation of the first loss value and a second loss value)),
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to extend the feature encoding model with the CQT feature of Du to include the Spectrum, Mel-spectrum, and Spectrogram features taught by Phan. While Du focuses on the CQT feature (D) for simplification (Du, Page 2: “For the simplification of the overall pipeline, we use the CQT spectrogram rather than the sophisticated features widely used in the CSI”), Phan proves that Spectrum, Mel-spectrum, and Spectrogram are complementary features to the CQT feature and that it is widely used and enhance classification performance when fused (Phan, Page 1: “We adopt four low-level features, including Mel-scale spectrogram, Gammatone spectrogram, CQT spectrogram, and raw waveform, which are most widely used for audio and music analysis under deep learning paradigms”). Therefore, extending Du’s model to include the full feature set of Phan is merely the predictable application of a known multi-dimensional feature fusion strategy to improve the discriminative capability of the model.

	Regarding claim 13, the combination of Du and Phan discloses the method according to claim 12.
	Du further discloses:
wherein determining the target loss value of the target loss function based on the plurality of encoding vectors, the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios comprises: 
determining a predetermined sample set based on the plurality of sample audios (Du, Page 3: “To evaluate the performance of our ByteCover model, we conducted several experiments on four publicly available benchmark datasets… The dataset is split into the training, validation, and test sets” (reads on determining a predetermined sample set because the training set is a fixed, predefined group of sample audios), and constructing a plurality of training sample groups based on the predetermined sample set, each training sample group comprising an anchor sample, a positive sample and a negative sample, wherein the anchor sample is any sample audio in the predetermined sample set, the positive sample is the sample audio in the predetermined sample set, which is of the same category as the anchor sample, the negative sample is the sample audio in the predetermined sample set, which is not of the same category as the anchor sample (Du, Page 3: “dp and dn are feature distances of positive pair and negative pair” (reads on anchor, positive, and negative distances are calculated relative to the same reference sample”)); 
determining a first loss value of a first loss function based on the encoding vectors corresponding to samples comprised in each of the training sample groups (Du, Page 2: “Our aim of using ByteCover is to derive from each input music track a single global feature”, Du, Page 3: “Compress the feature map X to a fixed-length vector f” (reads on encoding vectors corresponding to samples comprised in each training sample group as the fixed-length vector f is the encoding vector)), the first loss function being used to reflect a difference between the encoding vector of the anchor sample and the encoding vector of the positive sample, and a difference between the encoding vector of the anchor sample and the encoding vector of the negative sample (Du, Page 3: “[dp − dn + α]+ , (3) where dp and dn are feature distances of positive pair and negative pair.” (reads on reflecting differences between anchor-positive and anchor-negative encoding vectors because the loss is computed directly from both distances)), and determining a second loss value of a second loss function based on differences between the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios, (Du, Page 3: “CE(Softmax(Wfc), y)”, “y is the ground truth label”, (reads on category prediction values and category labels because Softmax outputs predictions that are compared to ground-truth labels using cross-entropy) and 
determining the target loss value of the target loss function based on the first loss value of the first loss function and the second loss value of the second loss function (Du, Page 3: “L = Lcls(fc) + Ltri(ft)” (reads on determining the target loss value because the overall loss is explicitly defined as a combination of the first and second loss values)).

Regarding claim 14, the combination of Du and Phan discloses the method according to claim 12.
	Du further discloses:
wherein the feature encoding model comprises an encoding network, and encoding the audio features of the plurality of sample audios by the feature encoding model to obtain the plurality of encoding vectors of the plurality of sample audios comprises: 
encoding the audio features of the plurality of sample audios according to the encoding network to obtain the plurality of encoding vectors of the plurality of sample audios, the encoding network comprising a residual network or a convolutional network, wherein an encoding vector output by the encoding network of the trained feature encoding model is usable as a feature vector of an audio output by the feature encoding model (Du, Page 1: “ByteCover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI.”; Du, Page 2: “To transform a ResNet to a model equipped with IBN module for learning an invariant embedding, the residual block which is the basic elements of the model, are replaced with IBN blcok.”; Du, Page 3: “The compression results f can be given by f = [f1 f2 . . . fK]. The behavior of the GeM pooling can be controlled by adjusting parameter p : the GeM pooling is equivalent to the average pooling when p = 1 and max pooling for p → ∞.”, reads on the encoding network comprising a residual or convolutional network where the output vector is usable as a feature vector).

Regarding claim 15, the combination of Du and Phan discloses the method according to claim 14.
	Du further discloses:
wherein the residual network comprises at least one of an instance normalization (IN) layer and a batch normalization (BN) layer.(Du, Page 1: “In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks, which are major components of our ResNet-IBN model.”, reads on the residual network comprising at least one of an IN layer and a BN layer).

Regarding claim 16, the combination of Du and Phan discloses the method according to claim 14.
	Du further discloses:
wherein the encoding network further comprises a GeM pooling layer; and encoding the audio features of the plurality of sample audios according to the encoding network to obtain the plurality of encoding vectors of the plurality of sample audios comprises: 
encoding the audio features of the plurality of sample audios according to the residual network or the convolutional network to obtain a plurality of initial encoding vectors of the plurality of sample audios; and 
processing the plurality of initial encoding vectors according to the GeM pooling layer to obtain the plurality of encoding vectors of the plurality of sample audios (Du, Page 3: “The compression results f can be given by f = [f1 f2 . . . fK]. The behavior of the GeM pooling can be controlled by adjusting parameter p : the GeM pooling is equivalent to the average pooling when p = 1 and max pooling for p → ∞. The p is set to a trainable parameter to fully exploit the advantage of deep learning.”, reads on the encoding network comprising a GeM pooling layer to process initial vectors into final encoding vectors).

Regarding claim 17, the combination of Du and Phan discloses the method according to claim 12.
	Du further discloses:
wherein the feature encoding model comprises a BN layer and a classification layer, and the method further comprises: 
processing the plurality of encoding vectors according to the BN layer to obtain a plurality of regularized encoding vectors (Du, Page 3: “Thus the BNNeck module inserts a BN layer before the classifier, which is a no-biased FC layer with weights W, as Figure 1 depicts. The feature yield by the GeM module is denoted as ft here. We let ft pass through a BN layer to generate the feature fc.” (reads on processing encoding vectors through a BN layer to obtain regularized encoding vectors because Du explicitly passes the extracted features through a BN layer to produce fc)); and 
performing classification processing on the plurality of sample audios based on the plurality of encoding vectors to obtain the category prediction values of the plurality of sample audios comprises: 
performing classification processing on the plurality of regularized encoding vectors according to the classification layer to obtain the category prediction values of the plurality of sample audios (Du, Page 3: “In the training stage, ft and fc are used to compute triplet and classification losses, respectively.” (reads on performing classification using the BN-processed vectors because Du explicitly uses fc, the output of the BN layer, as input to the classification layer to produce category predictions)), wherein an encoding vector output from the BN layer of the trained feature encoding model is usable as a feature vector of an audio output by the feature encoding model (Du, Page 3: “Because the normalized features are more suitable for the calculation of cosine metric, we use the fc to do the computation of the similarity among the performances during the inference and retrieval phases.” (reads on BN layer output being usable as a feature vector because Du explicitly uses the BN-processed feature fc for similarity computation and retrieval, i.e., as the feature representation of the audio)).

Regarding claim 18, the combination of Du and Phan discloses the method according to claim 12
Phan further discloses:
wherein the audio features of the plurality of sample audios comprises at least one of:
a spectrum feature (Phan, Page 3: “The 2D inputs (i.e. Mel-scale, Gammatone, and CQT spectrogram) have a general size of T × F where T is the number of time frames and F frequency bands.”; Page 2: “The architecture features six convolutional layers, each associated with Rectified Linear Unit (ReLU) activation [19], batch normalization [20], and a max pooling layer. The max pooling layers have a common kernel size of 2 × 1 and stride 1 × 1 to reduce size of spectral dimension by half”, reads on audio features comprising a spectrum feature);
a Mel-spectrum feature (Phan, Page 3: “To extract the 2D low-level features, a raw audio signal was transformed into a log Mel-scale spectrogram using F = 64 Mel-scale filters in the frequency range up to Nyquist rate.”; Page 1: “For audio and music classification, in particular, such an embedding can be learned from a variety of low-level features which have been developed alongside the development of the research field, such as Mel-scaled spectrogram”, reads on audio features comprising a Mel-spectrum feature);
a spectrogram feature (Phan, Page 3: “The 2D inputs (i.e. Mel-scale, Gammatone, and CQT spectrogram) have a general size of T × F where T is the number of time frames and F frequency bands.”; Page 2: “the CRNNs corresponding to the 2D inputs share a similar network architecture whose configuration is shown in Table 1.”, reads on audio features comprising a spectrogram feature);
[[and]]or a constant-Q transform (CQT) feature (Phan, Page 3: “Log CQT spectrogram [24] was extracted using Librosa [25] with F = 64 frequency bins, 12 bins per octave, and a hop length of 512 (for 22,050 Hz sampling rate) or 1024 (for 44.1 kHz sampling rate).”, reads on audio features comprising a constant-Q transform (CQT) feature).

	Claim(s) 19-31 are rejected under 35 U.S.C. 103 as being unpatentable over Du et al. (hereinafter Du) (BYTECOVER: COVER SONG IDENTIFICATION VIA MULTI-LOSS TRAINING) in view of Phan et al. (hereinafter Phan) (MULTI-VIEW AUDIO AND MUSIC CLASSIFICATION) in further view of Bilobrov et al. (hereinafter Bilobrov) (US 20180011854 A1).

	Regarding claim 19, Du discloses:
A method for audio determination (Du, Page 4: “we propose a simple and efficient design of the cover song identification system”), comprising: 
acquiring an audio to be queried (Du, See mapping below); 
extracting an audio feature of the audio to be queried (Du, Page 2: “ByteCover takes as input a constant-Q transform (CQT) spectrogram and employs a CNN-based model, i.e., ResNet-IBN, for feature learning.”; Du, Page 2: “Besides this, the audio is resampled to 22050 Hz before the feature extraction. Afterwards, the CQT is downsampled with an averaging factor of 100 along the temporal dimension. This size reduction of input feature improves the efficiency of the model and reduces the latency of our CSI system. As a result, the input audio is processed to a compressed CQT spectrogram S ∈ R 84×T, where the T depends on on the duration of the input music.”, reads on acquiring an audio to be queried and extracting its features); 
processing, according to a trained feature encoding model, the audio feature of the audio to be queried to obtain a first feature vector of the audio to be queried (Du, Page 3: “The feature yield by the GeM module is denoted as ft here. We let ft pass through a BN layer to generate the feature fc. In the training stage, ft and fc are used to compute triplet and classification losses, respectively.”; Du, Page 3: “Because the normalized features are more suitable for the calculation of cosine metric, we use the fc to do the computation of the similarity among the performances during the inference and retrieval phases.”, reads on processing a query audio using the trained model to obtain a first feature vector [fc]); and 
wherein the feature encoding model is obtained by a method for generating a feature encoding model, the method acts comprising (Du, Page 1: "ByteCover is built based on the classical CNN model ResNet50" (teaches a trained CNN/ResNet-based feature learning model, which maps to the claimed feature encoding model. The following underlines loss-function language explains how that feature encoding model is generated/trained), Page 3: "Finally, we train the model with Adam optimizer" (training the model with an optimizers maps to obtaining/generating the feature encoding model by updating model parameters during training)): 
acquiring a plurality of sample audios marked with category labels (Du, Page 1: “The first category of methods, e.g., [8–10], treats CSI as a multi-class classification problem, where each version group is considered as an unique class.”; Du, Pages 2-3: “Our ByteCover model was trained on the training subset of SHS100K”, Du Page 3: “which is collected from Second Hand Songs website by, consisting of 8858 songs with various covers and 108523 recordings. For this data, we follow the settings of. The dataset is split into the training, validation, and test sets with a ratio of 8:1:1.”; Du, Page 3: “In terms of the classification loss Lcls, the CE refers the cross entropy function and y is the ground truth label.”, reads on acquiring a plurality of sample audios marked with category labels);
extracting audio features of the plurality of sample audios (Du, Page 2: “As shown in the figure, ByteCover takes as input a constant-Q transform (CQT) spectrogram and employs a CNN-based model, i.e., ResNet-IBN, for feature learning.”; Du, Page 2: “audio is resampled to 22050 Hz before the feature extraction. Afterwards, the CQT is downsampled with an averaging factor of 100 along the temporal dimension.”, reads on extracting audio features of the plurality of sample audios);
encoding the audio features of the plurality of sample audios by the feature encoding model to obtain a plurality of encoding vectors of the plurality of sample audios, and performing classification processing on the plurality of sample audios based on the plurality of encoding vectors to obtain category prediction values of the plurality of sample audios (Du, Page 2: “The learned feature map is then compressed by GeMPool to form a fixed-length vector, which is then used to calculate the triplet loss and Softmax loss jointly. The joint training of these two losses are achieved by using the BNNeck.”; Du, Page 3: “Thus the BNNeck module inserts a BN layer before the classifier, which is a no-biased FC layer with weights W, as Figure 1 depicts. The feature yield by the GeM module is denoted as ft here. We let ft pass through a BN layer to generate the feature fc.”; Du, Page 3: “In terms of the classification loss Lcls, the CE refers the cross entropy function and y is the ground truth label. Finally, we train the model with Adam optimizer.”, reads on encoding the audio features to obtain a plurality of encoding vectors and performing classification processing to obtain category prediction values); and 
determining a target loss value of a target loss function based on the plurality of encoding vectors, the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios, and updating a parameter of the feature encoding model based on the target loss value to reduce a difference between the encoding vectors of the sample audios of a same category, to increase a difference between the encoding vectors of the sample audios of different categories, and to reduce a difference   between the category prediction values and the category labels of the plurality of sample audios, so as to obtain the trained feature encoding model (Du, Page 3: “Overall, the objective function used for training our model can be derived by L = Lcls(fc) + Ltri(ft) = CE(Softmax(Wfc), y) + [dp - dn + α]+, where dp and dn are feature distances of positive pair and negative pair.”; Du, Page 1: “we employ the BNNeck method to allow a multi-loss training and encourage our method to jointly optimize a classification loss and a triplet loss, and by this means, the inter-class discrimination and intra-class compactness of cover songs, can be ensured at the same time.” (reads on determining a target loss value of a target loss function based on vectors and labels and updating a parameter to reduce a difference between encoding vectors of a same category, increase a difference between encoding vectors of different categories, and reduce a difference between prediction values and label categories); 
the first loss value being a loss value of a first loss function determined based on the encoding vectors corresponding to samples comprised in training sample groups (Du, Page 2: "The learned feature map is then compressed by GeMPool to form a fixed length vector, which is then used to calculate the triplet loss and Softmax loss jointly." (Du's fixed-length vector maps to the claimed encoding vector, and Du's triplet loss maps to the first loss function determined based on encoding vectors of trained samples), Page 3: "Where dp and dn are feature distances of positive pair and negative pair." (Du's positive and negative pair distances are calculated from feature/encoding vectors of samples in triplet training groups, mapping to encoding vectors corresponding to samples comprised in training sample groups.)), the second loss value being a loss value of a second loss function determined based on differences between the category prediction values of the plurality of sample audios and the category labels of the plurality of sample audios (Du, Page 3, Equations 2 and 3: "L =Lcls(fc) +Ltri(ft). (2) =CE(Softmax(Wfc),y)+[dp −dn +α]+ , (3)", (teaches classification loss CE(Softmax(Wfc),y), where Softmax(Wfc) maps to category prediction values and y maps to category labels. Cross entropy maps to determining a loss based on differences between prediction values and labels.)).
Du does not explicitly disclose:
wherein the target loss value is determined according to a result of weighted summation of a first loss value and a second loss value
determining, based on a similarity between the first feature vector and second feature vectors of a plurality of candidate audios in a reference feature library, a target candidate audio, being a same audio as the audio to be queried, from the reference feature library, the second feature vectors of the plurality of candidate audios being predetermined by the trained feature encoding model, wherein a candidate audio with a similarity greater than a predetermined threshold is determined as the target candidate audio
However, Pham discloses:
wherein the target loss value is determined according to a result of weighted summation of a first loss value and a second loss value (Pham, Page 2, Equation 2: "The total loss used for training at the training step n is computed by: (Equation 2), where w^(k)*(n) denotes the weight of the classification branch k at the training time n." (teaches determining a total/target training loss by weighted summation of multiple loss values. This maps to the target loss value being determined according to a weighted summation of the first loss value and a second loss value))
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to extend the feature encoding model with the CQT feature of Du to include the Spectrum, Mel-spectrum, and Spectrogram features taught by Phan. While Du focuses on the CQT feature (D) for simplification (Du, Page 2: “For the simplification of the overall pipeline, we use the CQT spectrogram rather than the sophisticated features widely used in the CSI”), Phan proves that Spectrum, Mel-spectrum, and Spectrogram are complementary features to the CQT feature and that it is widely used and enhance classification performance when fused (Phan, Page 1: “We adopt four low-level features, including Mel-scale spectrogram, Gammatone spectrogram, CQT spectrogram, and raw waveform, which are most widely used for audio and music analysis under deep learning paradigms”). Therefore, extending Du’s model to include the full feature set of Phan is merely the predictable application of a known multi-dimensional feature fusion strategy to improve the discriminative capability of the model.
Pham, in combination with Du, does not explicitly disclose:
determining, based on a similarity between the first feature vector and second feature vectors of a plurality of candidate audios in a reference feature library, a target candidate audio, being a same audio as the audio to be queried, from the reference feature library, the second feature vectors of the plurality of candidate audios being predetermined by the trained feature encoding model, wherein a candidate audio with a similarity greater than a predetermined threshold is determined as the target candidate audio	However, Bilobrov discloses:
determining, based on a similarity between the first feature vector and second feature vectors of a plurality of candidate audios in a reference feature library, a target candidate audio, being a same audio as the audio to be queried, from the reference feature library, the second feature vectors of the plurality of candidate audios being predetermined by the trained feature encoding model, wherein a candidate audio with a similarity greater than a predetermined threshold is determined as the target candidate audio (Bilobrov, P[0026]: "The audio identification system 100 matches the generated test audio fingerprint 115 against a set of candidate reference audio fingerprints" ("test audio fingerprint" maps to the first feature vector of the queried audio, and the "candidate reference audio fingerprints" map to second feature vectors of candidate audios in the reference feature library.), P[0027]: "a similarity score between the candidate reference audio fingerprint and the test audio fingerprint 115 is computed." (maps to determining similarity between the first feature vector and the second feature vectors of candidate audios), P[0027]: "the test audio fingerprint 115 is determined to match a candidate reference audio fingerprint if a corresponding similarity score meets or exceeds a similarity threshold." (maps to determining the candidate audio as the target candidate audio when its similarity satisfies a predetermined threshold)); 
It would have been obvious to one of ordinary skill in art before the effective filing date of the claimed invention to have combined Du, Phan, and Bilobrov. Doing so would have combined the audio feature fixed length vector generation using a  trained ResNet-IBN/GeMPool model and jointly optimizing triplet and classification losses (Du, Page 1, Sections 2, 2.1, 2.2) with the weighted summation of multiple classification losses for training audio/music embedding networks of Phan (Phan, Section 2.1, 2.2: Equation 2) with the query/test audio representation comparison to candidate reference audio representations and method of determining a similarity score match when it exceeds a threshold value of Bilobrov (Bilobrov, P[0040], P[0044], P[0059]-P[0061]), thus, leading to improved training control and definite rules for selecting a target candidate audio.

Regarding claim 20, claim 20 recites the audio determination method corresponding to claim 13 and are rejected under the same grounds stated above.

Regarding claim 21, claim 21 recites the audio determination method corresponding to claim 14 and are rejected under the same grounds stated above.

Regarding claim 22, claim 22 recites the audio determination method corresponding to claim 15 and are rejected under the same grounds stated above.

Regarding claim 23, claim 23 recites the audio determination method corresponding to claim 16 and are rejected under the same grounds stated above.

Regarding claim 24, claim 24 recites the audio determination method corresponding to claim 17 and are rejected under the same grounds stated above.

Regarding claim 25, claim 25 recites the audio determination method corresponding to claim 18 and are rejected under the same grounds stated above.

Regarding claim 26, claim 26 recites the electronic device corresponding to the method presented in claim 12 and is rejected under the same grounds stated above.
Bilobrov further discloses:
An electronic device, comprising (Bilobrov, P[0061]-P[0062]): 
a storage device storing at least one computer program thereon (Bilobrov, P[0061]-P[0062]); and 
at least one processing device being used to execute the at least one computer program in the storage device to implement acts comprising (Bilobrov, P[0061]-P[0062]):

Regarding claim 27, claim 27 recites the electronic device corresponding to the method presented in claim 13 and is rejected under the same grounds stated above.

Regarding claim 28, claim 28 recites the electronic device corresponding to the method presented in claim 14 and is rejected under the same grounds stated above.

Regarding claim 29, claim 29 recites the electronic device corresponding to the method presented in claim 15 and is rejected under the same grounds stated above.

Regarding claim 30, claim 30 recites the electronic device corresponding to the method presented in claim 16 and is rejected under the same grounds stated above.

Regarding claim 31, claim 31 recites the electronic device corresponding to the method presented in claim 17 and is rejected under the same grounds stated above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHASHIDHAR S MANOHARAN whose telephone number is (571)272-6772. The examiner can normally be reached M-F 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHASHIDHAR SHANKAR MANOHARAN/             Examiner, Art Unit 2655                                                                                                                                                                                           
/ANDREW C FLANDERS/             Supervisory Patent Examiner, Art Unit 2655
Read full office action
Prosecution Timeline

Jul 15, 2024
Application Filed
Jan 27, 2026
Non-Final Rejection mailed — §103
Apr 27, 2026
Response Filed
May 22, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/589,802
Patent 12682890
MASK-CONFORMER AUGMENTING CONFORMER WITH MASK-PREDICT DECODER UNIFYING SPEECH RECOGNITION AND RESCORING
2y 4m to grant Granted Jul 14, 2026
18/592,494
Patent 12682173
MODULAR FRAMEWORK FOR EVALUATING LANGUAGE MODELS
2y 4m to grant Granted Jul 14, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
100%
Grant Probability
99%
With Interview (+0.0%)
2y 1m (~1m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 3 resolved cases by this examiner. Grant probability derived from career allowance rate.