Office Action Analysis: 17396510 — SYSTEM AND METHOD FOR LIGHTWEIGHT SEMANTIC MASKING

Examiner Intelligence

DAY, ROBERT N View full profile →
Grants only 23% of cases
Career Allow Rate
5 granted / 22 resolved
-32.3% vs TC avg
Strong +23% interview lift
Without
With
+23.2%
Interview Lift
resolved cases with interview
Typical timeline
4y 3m
Avg Prosecution
38 currently pending
Career history
60
Total Applications
across all art units
Statute-Specific Performance

§101
32.6%
-7.4% vs TC avg
§103
35.3%
-4.7% vs TC avg
§102
12.9%
-27.1% vs TC avg
§112
18.3%
-21.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 22 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This action is in response to amendments filed 24 November 2025. Claims 1, 8, 15, and 19 are amended. Claims 1-20 are pending and have been examined.

Response to Arguments
Applicant' s arguments, see page 12, filed 24 November 2025, with respect to the objection to Claim 19 have been fully considered and are persuasive. The objection to Claim 19 has been withdrawn.

APPLICANT'S ARGUMENT: Applicant argues (page 12, paragraph 1) that "The Applicant has amended Claim 19 to resolve the informality."
EXAMINER'S RESPONSE: Examiner agrees. The objection to Claim 19 has been withdrawn.

Applicant's arguments, see pages 12-15, filed 24 November 2025, with respect to the rejection of Claims 1-20 under 35 U.S.C. 101 have been fully considered and are persuasive. The rejection has been withdrawn.

Applicant's arguments, see pages 15-17, filed 24 November 2025, with respect to the rejection of Claims 1-20 under 35 U.S.C. 103 have been fully considered but they are not persuasive.

APPLICANT'S ARGUMENT: Applicant argues (page 16, paragraph 1) that "the cited portions of Zhao merely disclose probing a BERT model and discovering certain characteristics regarding the model. Zhao does not disclose or suggest extracting a set of contextual vectors from a plurality of targeted hidden layers; building ... hidden layers, one or more probing layers, and one or more output activation layers on top of the set of contextual vectors; and training the semantic probing model using the set of contextual vectors to predict an amount of semantic information in each contextual vector."
EXAMINER'S RESPONSE: Examiner notes that Applicant's arguments pertain to newly claimed features of amended Claim 1. Amended Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Chen. The combination of Zhao and Chen are shown to teach the newly recited features of amended Claim 1.

Claim Objections
The objection to Claim 19 is withdrawn in view of arguments and/or amendments.

Claim Rejections - 35 USC § 101
The rejections of Claims 1-20 under 35 U.S.C. 101 are withdrawn in light of arguments and/or amendments.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 4, 8, 11, 15, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, et al., "Quantifying the Contextualization of Word Representations with Semantic Class Probing," hereinafter Zhao, in view of Chen, et. al, "Shallowing deep networks: Layer-wise pruning based on feature representations" (hereinafter "Chen").
Regarding Claim 1, Zhao teaches:
a method comprising: performing ... semantic probing on a pre-trained machine learning model using one or more textual utterances (Zhao, p. 1, 1 Introduction: "We adopt the methodology of probing ...: diagnostic classifiers are applied to pretrained language model embeddings to determine whether they encode desired syntactic or semantic features" and "We use BERT (Devlin et al., 2019) as our pretrained language model") ... using at least one processor of an electronic device (Zhao, p. 14, A.1 Computing infrastructure: "All experiments are conducted on GeForce GTX 1080 Ti and GeForce GTX 1080") and at least one semantic probing model (Zhao, p. 3, 3.2.2 Probing contextualized embeddings: "We probe BERT with the same setup: a binary classifier is trained for each of the 34 s-classes; each BERT layer is probed individually") ... , wherein performing the semantic probing comprises:
processing the one or more textual utterances (Zhao, p. 5, Figure 4: "S-class probing results for contextualized embedding models. Results are micro                         
                            
                                    F
                                
                                    1
                                
                     on Wiki-PSE test set," where Zhao's test set corresponds to the instant utterances) to determine a performance score for each of a plurality of targeted hidden layers of the pre-trained machine learning model (Zhao, p. 1, 1 Introduction: "We investigate how accurately BERT interprets words in context. ...  We quantify how much each additional layer in BERT contributes to contextualization" and 3.2.2 Probing contextualized embeddings: "each BERT layer is probed individually" and p. 5, Figure 4: "S-class probing results for contextualized embedding models. Results are micro                         
                            
                                    F
                                
                                    1
                                
                     on Wiki-PSE test set," which depicts per-layer performance of a contextualized embedding model for each layer), wherein processing the one or more textual utterances comprises:
extracting a set of contextual vectors from the plurality of targeted hidden layers (Zhao, p. 3: "Figure 2: Setups for probing uncontextualized and contextualized embeddings. For BERT, we input a context sentence to extract the contextualized embedding of a word," where Zhao's contextualized embedding corresponds to the instant contextual vector, as in the contextual embedding label in Fig. 2, "one vector per context," and p. 3, 3.2.2 Probing contextualized embeddings: "Table 2, column contexts shows the sizes of train/dev/test when probing BERT," where Zhao's training dataset corresponds to the instant set); ... ;
training the semantic probing model using the set of contextual vectors (Zhao, p. 3, 3.2.2 Probing contextualized embeddings: "Table 2, column contexts shows the sizes of train/dev/test when probing BERT. Figure 2 compares our two probing setups," where Zhao's training dataset corresponds to the instant training) to predict an amount of semantic information in each contextual vector (Zhao, p. 1, Abstract: "We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embedding"); and
selecting a subset of the targeted hidden layers based on a comparison of the performance scores (Zhao, p. 4, 4.2.2 S-class inference results: "Figure 4 shows contextualized embedding probing results. Comparing BERT layers, a clear trend can be identified: s-class inference performance increases monotonically with higher layers. ... ¶ The very limited contextualization improvement brought by the top two layers may explain why representations from the top layers of BERT can deliver suboptimal performance on NLP tasks ...: the top layers are optimized for the pretraining objective," where Zhao's top two layers corresponds to the instant selected subset).
Zhao teaches a method of performing semantic probing on a BERT pre-trained language model using textual utterances, including processing the utterances to determine a performance score for each targeted hidden layer and selecting a subset of the targeted hidden layers based on a comparison of the performance scores.
Zhao does not explicitly teach building, within the at least one semantic probing model one or more hidden layers, one or more probing layers, and one or more output activation layers on top of the set of contextual vectors and selecting ... based on a comparison of ... performance scores to a predetermined threshold value ... a subset of the targeted hidden layers ... , wherein selecting the subset of the targeted hidden layers comprises disregarding at least one hidden layer of the plurality of targeted hidden layers based on a determination that the at least one performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value; and reconstructing ... the pre-trained machine learning model ... to generate a reconstructed machine learning model, wherein the reconstructed machine learning model includes the selected subset of the targeted hidden layers but not the at least one disregarded hidden layer based on the determination that the performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value.
However, Chen teaches:
building, within the at least one semantic probing model (Chen, p. 3052, Fig. 2: "Procedure of proposed layer-wise pruning," where the instant semantic probing model corresponds to the depicted the Original Model, Feature Diagnosis, and Pruned Model) one or more hidden layers (Chen, p. 3049, 1 Introduction: "We propose a layer-wise pruning method that identify and remove redundant convolutional layers within deep neural networks," where Chen's deep network layers correspond to the instant hidden layers, and depicted in p. 3052, Fig. 2: "Procedure of proposed layer-wise pruning," where the hidden layers of the Pruned Model correspond to the instant built layers), one or more probing layers (Chen, p. 3050, 3.1 Layer-Wise Pruning via Feature Diagnosis: "in this work we use a single fully-connected layer as the linear classifier to evaluate the effectiveness of a layer thus finding the ones to be pruned," where Chen's fully-connected layer corresponds to the instant probing layer, per [0049] of the instant specification, and where the probing layers are depicted as Linear Classifier in p. 3052, Fig. 2: "Procedure of proposed layer-wise pruning"), and one or more output activation layers (Chen, p. 3052, 4.1 Implementation, Architecture and Objective Function: "In all of our experiments, we utilize the model architectures proposed in the original papers.... For ResNet-101 on multi-label classification ... softmax activation is applied on the output for each category independently," where the activation layers are depicted as Original Model in p. 3052, Fig. 2: "Procedure of proposed layer-wise pruning") on top of the set of contextual vectors (Chen, p. 3052, Fig. 2: "Procedure of proposed layer-wise pruning," where the instant contextual vectors correspond to the feature representations of the Original Model, as in p. 3052, 4.2 Result on Single-Label Classification: "These observations demonstrate the effectiveness of the proposed layerwise pruning method, i.e., by removing the redundant layers based on feature representations and retraining the pruned model with proper settings");
selecting ... based on a comparison of ... performance scores to a predetermined threshold value (Chen, p. 3050, 3.1 Layer-Wise Pruning via Feature Diagnosis: "By comparing the performance of classifiers trained on features computed at adjacent layers, layers that have minor improvement on the feature representations are identified based on a predefined threshold. ... ¶ With the predefined threshold being set as 1.5 percent of the performance of original model ... nearly half of the convolutional layers in ResNet-56 have limited contributions on improving the performance of feature representations") ... a subset of the targeted hidden layers (Chen, p. 3049, 1 Introduction: "We propose a layer-wise pruning method that identify and remove redundant convolutional layers within deep neural networks," where Chen's convolutional layers within DNNs corresponds to the instant hidden layers) ... , wherein selecting the subset of the targeted hidden layers (Chen, p. 3050, 3.1 Layer-Wise Pruning via Feature Diagnosis: "in this work we ... evaluate the effectiveness of a layer thus finding the ones to be pruned," where Chen's finding and pruning corresponds to the instant selecting) comprises disregarding at least one hidden layer of the plurality of targeted hidden layers based on a determination that the at least one performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value (Chen, p. 3050, 3.1 Layer-Wise Pruning via Feature Diagnosis: "With the predefined threshold being set as 1.5 percent of the performance of original model ... nearly half of the convolutional layers in ResNet-56 have limited contributions on improving the performance of feature representations"); and
reconstructing ... the pre-trained machine learning model ... to generate a reconstructed machine learning model (Chen, p. 3051, 3.1 Layer-Wise Pruning via Feature Diagnosis: "we conduct the pruning process by directly removing the layers that provide minor influences on improving the feature representations" and where Chen's pruning results in a reconstructed model, as in p. 3051, 3.2 Knowledge Transfer via Distillation: "So far, we have obtained networks with a more compact architecture by reconstructing deep networks") ... based on the semantic probing (Chen, p. 3050, 3.1 Layer-Wise Pruning via Feature Diagnosis: "In [14], the authors propose a method called Linear Classifier Probe to gain understandings on the behaviors of a DNN. ... ¶ Inspired by the observations in [14], in this work we use a single fully-connected layer as the linear classifier to evaluate the effectiveness of a layer thus finding the ones to be pruned") ... , wherein the reconstructed machine learning model includes the selected subset of the targeted hidden layers but not the at least one disregarded hidden layer based on the determination that the performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value (Chen, p. 3051, 3.1 Layer-Wise Pruning via Feature Diagnosis: "we conduct the pruning process by directly removing the layers that provide minor influences on improving the feature representations" and p. 3051, 3.2 Knowledge Transfer via Distillation: "we have obtained networks with a more compact architecture by reconstructing deep networks. While retraining the networks with pretrained weights until convergence is able to achieve reasonable results, the pruned model may not perform as well as the original network").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhao regarding performing semantic probing on a BERT pre-trained language model using textual utterances, including processing the utterances to determine a performance score for each targeted hidden layer and selecting a subset of the targeted hidden layers based on a comparison of the performance scores with those of Chen regarding building, within the at least one semantic probing model one or more hidden layers, one or more probing layers, and one or more output activation layers on top of the set of contextual vectors and selecting based on a comparison of performance scores to a predetermined threshold value a subset of the targeted hidden layers, wherein selecting the subset of the targeted hidden layers comprises disregarding at least one hidden layer of the plurality of targeted hidden layers based on a determination that the at least one performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value; and reconstructing the pre-trained machine learning model to generate a reconstructed machine learning model, wherein the reconstructed machine learning model includes the selected subset of the targeted hidden layers but not the at least one disregarded hidden layer based on the determination that the performance score for each of the at least one disregarded hidden layer is below the predetermined threshold value.
The motivation to do so would be to reduce the computational cost of CNN usage without reducing performance (Chen, p. 3048, 1 Introduction: "To save computational cost of CNNs, several works propose to reduce network size via connection-wise pruning.... [W]e propose a feature representation based parameter pruning method that reduces CNNs by removing layers with small improvement on feature representations" and p. 3051, 3.1 Layer-Wise Pruning via Feature Diagnosis: "we argue that despite the destruction of interactions between adjacent layers, with proper retraining, the pruned residual network is still able to reconstruct the interactions between its layers and compensate the loss of the performance").

Regarding Claim 8, Zhao teaches:
an electronic device comprising: at least one memory configured to store instructions; and at least one processing device configured when executing the instructions (Zhao, p. 14, A.1 Computing infrastructure: "All experiments are conducted on GeForce GTX 1080 Ti and GeForce GTX 1080," where a memory and instructions are inherent in operating a GPU) to perform precisely those steps recited in the rejection of Claim 1. Claim 8 is rejected under the same rationale as Claim 1.

Regarding Claim 15, Zhao teaches a non-transitory machine-readable medium containing instructions that when executed cause at least one processor of an electronic device (Zhao, p. 14, A.1 Computing infrastructure: "All experiments are conducted on GeForce GTX 1080 Ti and GeForce GTX 1080," where a non-transitory machine-readable medium and instructions are inherent in operating a GPU) to perform precisely those steps recited in the rejection of Claim 1. Claim 15 is rejected under the same rationale as Claim 1.

Regarding Claim 4, the rejection of Claim 1 is incorporated. The Zhao/Chen combination teaches:
wherein the pre-trained machine learning model comprises a contextualized representation machine learning model (Zhao, p. 1, 1 Introduction: "We use BERT (Devlin et al., 2019) as our pretrained language model and quantify contextualization by investigating how well BERT infers semantic classes (s-classes) of a word in context").

Claims 11 and 18 incorporate substantively all limitations of Claim 4 in device and non-transitory machine-readable medium forms, respectively, and are rejected under the same rationale.

Claims 2, 9, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, et al., "Quantifying the Contextualization of Word Representations with Semantic Class Probing" (hereinafter "Zhao"), in view of Chen, et. al, "Shallowing deep networks: Layer-wise pruning based on feature representations" (hereinafter "Chen") in further view of Goodsitt, et al. (US 2021/0097343 A1, hereinafter "Goodsitt").
Regarding Claim 2, the rejection of Claim 1 is incorporated.
The Zhao/Chen combination teaches performing semantic probing on a pre-trained machine learning model. The Zhao/Chen combination does not explicitly teach imposing a time limit on the semantic probing.
However, Goodsitt teaches in the context of model tuning and reuse:
imposing a time limit (Goodsitt, [0054]: "Process 300 can then proceed to step 309. In step 309, computing resources 101 can generate a trained data model using the data model received from model optimizer 107 and the synthetic dataset received from dataset genera tor 103. For example , computing resources 101 can be configured to train the data model received from model optimizer 107 until some training criterion is satisfied . The training criterion can be , for example , a performance criterion (e.g., a Mean Absolute Error, Root Mean Squared Error, percent good classification, and the like), a convergence criterion (e.g., a minimum required improvement of a performance criterion over iterations or over time, a minimum required change in model parameters over iterations or over time), elapsed time or number of iterations, or the like. In some embodiments , the performance criterion can be a threshold value for a similarity metric or prediction accuracy metric as described herein").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Zhao/Chen combination regarding reconstructing the pre-trained model based on the semantic probing with the further teachings of Goodsitt regarding imposing a time limit.
The motivation to do so would be to provide a means to ensure that computing resources for training are allocated efficiently (Goodsitt, [0146]: "Termination or assignment may be based on performance of the development instance or the performance of another development instance. In this way, the serverless architecture may more efficiently allocate resources during hyperparameter tuning traditional, server-based architectures").

Claims 9 and 16 incorporate substantively all limitations of Claim 2 in device and non-transitory machine-readable medium forms, respectively, and are rejected under the same rationale.

Claims 3, 10, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, et al., "Quantifying the Contextualization of Word Representations with Semantic Class Probing," hereinafter Zhao, in view of Chen, et. al, "Shallowing deep networks: Layer-wise pruning based on feature representations" (hereinafter "Chen"), in further view of Merchant, et al., "What Happens To BERT Embeddings During Fine-tuning?" hereinafter Merchant.
Regarding Claim 3, the rejection of Claim 1 is incorporated.
The Zhao/Chen combination teaches selecting a subset of hidden layers of a pre-trained machine learning model based on a performance score determined from semantic probing of the model, then reconstructing the model using the selected hidden layers.
The Zhao/Chen combination does not explicitly teach the selected subset of targeted hidden layers comprises one or more original targeted hidden layers and one or more updated hidden layers; and the reconstructed model is generated using the one or more original targeted hidden layers and the one or more updated hidden layers.
However, Merchant teaches:
the selected subset of targeted hidden layers comprises one or more original targeted hidden layers and one or more updated hidden layers; and the reconstructed model is generated using the one or more original targeted hidden layers and the one or more updated hidden layers (Merchant, p. 7, 5.2 Layer Ablations: "Partial Freezing can be thought of as a test for how many layers need to change for a downstream task. We freeze the bottom                         
                            k
                        
                     layers (and the embeddings) – treating them as features – but allow the rest to adapt").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Zhao/Chen combination regarding reconstructing the pre-trained model based on the semantic probing with with those of Merchant regarding the selected subset of targeted hidden layers comprising one or more original targeted hidden layers and one or more updated hidden layers.
The motivation to do so would be to take advantage of the task-specific performance improvements of fine-tuning with decreased training effort (Merchant, p. 7, 5.2 Layer Ablations, Figure 3: "Effects of freezing an increasing number of layers during fine-tuning on performance.... The graph shows that only a few unfrozen layers are needed to improve task performance, supporting the shallow processing conclusion").

Claims 10 and 17 incorporate substantively all limitations of Claim 3 in device and non-transitory machine-readable medium forms, respectively, and are rejected under the same rationale.

Claims 5-7, 12-14, 19, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao, et al., "Quantifying the Contextualization of Word Representations with Semantic Class Probing" (hereinafter "Zhao"), in view of Chen, et. al, "Shallowing deep networks: Layer-wise pruning based on feature representations" (hereinafter "Chen"), in further view of Zhao, et al., "Masking as an Efficient Alternative to Finetuning for Pretrained Language Models" (hereinafter "Zhao2").
Regarding Claim 5, the rejection of Claim 1 is incorporated.
The Zhao/Chen combination teaches reconstructing a pre-trained machine learning model using hidden layers selected according to a performance score.
The Zhao/Chen combination does not explicitly teach generating, using the at least one processor, a binary mask based on the reconstructed machine learning model, the binary mask generated for a specific task with semantic and text embedding.
However, Zhao2 teaches:
generating, using the at least one processor, a binary mask based on the reconstructed machine learning model (Zhao2, p. 1, Abstract: "We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning"), the binary mask generated for a specific task (Zhao2, p. 1, 1 Introduction: "Instead of directly updating the pretrained parameters, we propose to select weights important to downstream NLP tasks while discarding irrelevant ones. The selection mechanism consists of a set of binary masks, one learned per downstream task through end-to-end training") with semantic and text embedding (Zhao2, p. 1, 1 Introduction: "We show that masking, when being applied to pretrained language models like BERT, RoBERTa, and DistilBERT (Sanh et al., 2019), achieves performance comparable to finetuning in tasks like part-of-speech tagging, named-entity recognition, sequence classification, and reading comprehension").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the Zhao/Chen combination regarding reconstructing a pre-trained model based on the semantic probing with with those of Zhao2 regarding generating a binary mask based on the reconstructed model for a specific task with semantic and text embedding.
The motivation to do so would be improved space efficiency for the trained model (Zhao2, p. 1, 1 Introduction: "Masking is parameter-efficient: only a set of 1- bit binary masks needs to be saved per task after training, instead of all 32-bit float parameters in finetuning. This small memory footprint enables deploying pretrained language models for solving multiple tasks on edge devices. The compactness of masking also naturally allows parameter-efficient ensembles of pretrained language models").

Regarding Claim 6, the Zhao/Chen/Zhao2 combination teaches the method of Claim 5, and thus the rejection of Claim 5 is incorporated.
The Zhao/Chen/Zhao2 combination has not yet been shown to teach teach generating an initial binary mask based on a threshold of real number mask weights for the specific task; applying the initial binary mask on multiple parameters of the reconstructed machine learning model to generate masked parameters; and evaluating the masked parameters to determine whether a goal of the specific task is met.
Zhao2 further teaches:
wherein generating the binary mask comprises: generating an initial binary mask based on a threshold of real number mask weights (Zhao2, p. 3, 3.2 Learning the mask: "We associate each linear layer                         
                            
                                    W
                                
                                    l
                                
                     ... of the l-th transformer block with a real-valued matrix                         
                            
                                    M
                                
                                    l
                                
                     that is randomly initialized from a uniform distribution and has the same size as                         
                            
                                    W
                                
                                    l
                                
                    . We then pass                         
                            
                                    M
                                
                                    l
                                
                     through an element-wise thresholding function ... to obtain a binary mask                         
                            
                                    M
                                
                                    b
                                    i
                                    n
                                
                                    l
                                
                     for                         
                            
                                    W
                                
                                    l
                                
                     ... where ...                         
                            τ
                        
                     is a global thresholding hyperparameter") for the specific task (Zhao2, p. 3, 3.2 Learning the mask: "We learn a set of binary masks for an NLP task.... We then update each                         
                            
                                    M
                                
                                    l
                                
                     through Eq. 2 with the task objective during training");
applying the initial binary mask on multiple parameters of the reconstructed machine learning model to generate masked parameters (Zhao2, p. 3, 3.2 Learning the mask: "In each forward pass of training, the binary mask                         
                            
                                    M
                                
                                    b
                                    i
                                    n
                                
                                    l
                                
                     (derived from                         
                            
                                    M
                                
                                    l
                                
                     via Eq. 1) selects weights in a pretrained linear layer                         
                            
                                    W
                                
                                    l
                                
                     by Hadamard product:
                                    
                                                        W
                                                    
                                                    ^
                                                
                                                l
                                            
                                        :
                                        =
                                         
                                                W
                                            
                                                l
                                            
                                        ⊙
                                         
                                                M
                                            
                                                b
                                                i
                                                n
                                            
                                                l
                                            
"); and
evaluating the masked parameters to determine whether a goal of the specific task is met (Zhao2, p. 3, 3.2 Learning the mask: "In the corresponding backward pass of training, with the associated loss function                         
                            L
                        
                    , we cannot backpropagate through the binarizer.... [W]e use                         
                            
                                    ∂
                                    L
                                    
                                                            W
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                    ∂
                                    
                                            M
                                        
                                            b
                                            i
                                            n
                                        
                                            l
                                        
                     as a noisy estimator of                         
                            
                                    ∂
                                    L
                                    
                                                            W
                                                        
                                                        ^
                                                    
                                                    l
                                                
                                    ∂
                                    
                                            M
                                        
                                            l
                                        
                     to update                         
                            
                                    M
                                
                                    l
                                
                    , i.e.:
                                            
                                                        M
                                                    
                                                        l
                                                         
                                                ⃪
                                                
                                                        M
                                                    
                                                        l
                                                    
                                                -
                                                η
                                                
                                                        ∂
                                                        L
                                                        
                                                                                W
                                                                            
                                                                            ^
                                                                        
                                                                        l
                                                                    
                                                        ∂
                                                        
                                                                M
                                                            
                                                                b
                                                                i
                                                                n
                                                            
                                                                l
                                                            
(2)

where                         
                            η
                        
                     refers to the step size. Hence, the whole structure can be trained end-to-end,"
where minimizing the loss function                         
                            L
                        
                     of a specific task during training corresponds to the instant a goal of the specific task).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of the prior teachings of the Zhao/Chen/Zhao2 combination regarding generating, for a specific task with semantic and text embedding, a binary mask based on the reconstructed model with the further teachings of Zhao2 regarding generating an initial binary mask for the specific task, applying the initial binary mask on model parameters to generate masked parameters, and evaluating the masked parameters.
The motivation to do so would be to take advantage of pretrained models for multiple tasks in a space-efficient manner (Zhao2, p. 1, 1 Introduction: "This small memory footprint enables deploying pretrained language models for solving multiple tasks on edge devices. The compactness of masking also naturally allows parameter-efficient ensembles of pretrained language models").

Regarding Claim 7, the rejection of Claim 6 is incorporated. The Zhao/Chen/Zhao2 combination has already been shown to teach:
updating the binary mask in response to determining that the goal of the specific task is not met (Zhao2, p. 3, 3.2 Learning the mask: Eq. 2, which represents a mask update during the loss-minimization goal of learning).

Claims 12-14 incorporate substantively all limitations of Claims 5-7, respectively, in device form and are rejected under the same rationale.

Claims 19 and 20 incorporate substantively all limitations of Claims 5 and 6, respectively, in non-transitory machine-readable medium form and are rejected under the same rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Huang, et al., "WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach," teach a method of combining multiple layers of a pre-trained neural network model to improve the quality of unsupervised sentence embeddings, which further employs averaging of all tokens and whitening-based vector normalization.
Sanh, et al., "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter," teach a method of performing knowledge distillation of a pre-trained model that removes selected layers for the purpose of reducing model size and improving efficiency.

THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT N DAY whose telephone number is (703)756-1519. The examiner can normally be reached M-F 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached at (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/R.N.D./Examiner, Art Unit 2122                                                                                                                                                                                                        

/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action
Prosecution Timeline

Aug 06, 2021
Application Filed
Oct 11, 2024
Non-Final Rejection — §101, §103
Dec 18, 2024
Applicant Interview (Telephonic)
Dec 18, 2024
Examiner Interview Summary
Jan 23, 2025
Response Filed
Apr 24, 2025
Final Rejection — §101, §103
May 30, 2025
Interview Requested
Jun 11, 2025
Applicant Interview (Telephonic)
Jun 11, 2025
Examiner Interview Summary
Jul 08, 2025
Response after Non-Final Action
Jul 17, 2025
Request for Continued Examination
Jul 22, 2025
Response after Non-Final Action
Aug 21, 2025
Non-Final Rejection — §101, §103
Oct 28, 2025
Interview Requested
Nov 05, 2025
Examiner Interview Summary
Nov 05, 2025
Applicant Interview (Telephonic)
Nov 24, 2025
Response Filed
Feb 26, 2026
Final Rejection — §101, §103
Apr 01, 2026
Interview Requested
Precedent Cases

Applications granted by this same examiner with similar technology

17/195,116
Patent 12406181
METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR UPDATING MODEL
2y 5m to grant Granted Sep 02, 2025
17/155,997
Patent 12229685
MODEL SUITABILITY COEFFICIENTS BASED ON GENERATIVE ADVERSARIAL NETWORKS AND ACTIVATION MAPS
2y 5m to grant Granted Feb 18, 2025
Study what changed to get past this examiner. Based on 2 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

5-6
Expected OA Rounds
23%
Grant Probability
46%
With Interview (+23.2%)
4y 3m
Median Time to Grant
High
PTA Risk
Based on 22 resolved cases by this examiner. Grant probability derived from career allow rate.
SYSTEM AND METHOD FOR LIGHTWEIGHT SEMANTIC MASKING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEM AND METHOD FOR LIGHTWEIGHT SEMANTIC MASKING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email