Last updated: April 19, 2026
Application No. 18/199,129
Machine Learning Model Adaptation to Account for Data Shifts in Visual Document Understanding Applications

Final Rejection §102§103
Filed
May 18, 2023
Examiner
ANSARI, TAHMINA N
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
Interview Optional

— +17.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 868 resolved cases, 2023–2026
Examiner Intelligence

ANSARI, TAHMINA N View full profile →
Grants 86% — above average
Career Allow Rate
743 granted / 868 resolved
+23.6% vs TC avg
Strong +18% interview lift
Without
With
+17.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
33 currently pending
Career history
901
Total Applications
across all art units
Statute-Specific Performance

§101
12.2%
-27.8% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 868 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is in response to the applicant’s reply filed September 11, 2025. In the applicant’s reply; no claims were amended, cancelled, or newly added.  Claims 1-20 are pending in this application.
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Examiner’s Responses to Applicant’s Remark
Applicants' amendments filed on September 11, 2025 have been fully considered. The amendments overcome the following rejections set forth in the office action mailed on June 12, 2025.
Applicant’s amendments overcome the objections to the title of the specification, and the objection is hereby withdrawn. 

Applicants' arguments filed on September 11, 2025 have been fully considered but they are not persuasive. The Examiner has thoroughly reviewed Applicants' arguments but firmly believes that the cited reference to reasonably and properly meet the claimed limitation. 

Applicant argues that the Examiner used different models to address the limitations for the independent claims. Specifically, Applicant argues that paragraphs [0034] through [0037] of Li  was cited for “applying/apply” verses paragraphs [0041] through [0043] of Li was cited for “generating/generate” limitations and presented the following table regarding the cited sections from Li. 

    PNG
    media_image1.png
    584
    660
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    172
    656
    media_image2.png
    Greyscale

Examiner respectfully disagrees. Although, Examiner has cited particular columns and line numbers or figures in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested from the applicant, in preparing the responses, to fully consider the reference as a whole in entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. 
The claims recite the following features: 
applying a masked visual language modeling ("MVLM") to target domain data determined as associated with the distribution shift to produce model predictions; 
generating pseudo-labels using the model predictions;
As a whole, Li is directed to “visual-and-language (V+L) systems and methods for learning vision and language representation” and teaches several embodiments, including a computing device for implementing a VLP method (Figure 1), a process flow for training a VLP system, a simplified logic flow illustrating the method that implements the submodules in Figure 1 (Figure 3) and a model architecture to use a VLP system in downstream tasks (Figures 4A-B). These models are used together as discussed in [0045] and Figure 3. 
For (a) The claims recite the following features and Examiner cited sections [0035]-[0037] of Li as they are also directed towards a multimodal encoder that generates a “masked-language-modeling (MLM)” “to learn multimodal interactions between the image input 210 and the text input 220.” Examiner did not cite these sections to be lengthy, but rather to show a pertinent section that directly relates to the application of MLM “utilizes both the image and the contextual text from the encoded image-text samples to predict the masked words in the encoded image-text samples”. The sections are presented below for clarification. 
applying a masked visual language modeling ("MVLM") to target domain data determined as associated with the distribution shift to produce model predictions; 
Li: [0034] The multimodal encoder 240 is also configured to generate a masked-language-modeling (MLM) loss 244 to learn multimodal interactions between the image input 210 and the text input 220. The MLM loss 244 can be defined as a loss function between a predicted possibility of one or more masked tokens in the encoded image-text samples and a ground truth identity of the one or more masked tokens of the encoded image-text samples. [0035] Masked language modeling (MLM) utilizes both the image and the contextual text from the encoded image-text samples to predict the masked words in the encoded image-text samples. The input tokens can be randomly masked out with a predetermined probability such as 15% and replaced with the special token [MASK]. For example, the replacements are 10% random tokens, 10% unchanged, and 80% [MASK]. [0036] The MLM learning loss 244 can be the cross-entropy H between the predicted probability for a masked token in the encoded image-text samples and the ground-truth one-hot vocabulary distribution, such as:

    PNG
    media_image3.png
    168
    608
    media_image3.png
    Greyscale
)

The claims recite the following features
generating pseudo-labels using the model predictions;
Applicant is reminded that the Examiner is entitled to give the broadest reasonable interpretation to the language of the claims. For (b) Examiner cited sections [0041]-[0043] of Li as they are also directed towards generating pseudo-targets using “momentum distillation (MOD)” which is a “continuously evolving teacher model.” This model is specifically referred to in paragraph [0042], and clearly states “[0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model.” Again, Examiner did not cite these sections to be lengthy, but rather to show a pertinent section that directly relates to the updating the MLM predictions using “pseudo-targets” generated using the “momentum distillation (MoD) as an alternative of original noisy data for training the model”. The sections are presented below for clarification.
Li: [0041] In one embodiment, in order to improve learning, such as in the presence of noisy input data for training the model, pseudo-targets are generated using momentum distillation (MoD) as an alternative of original noisy data for training the model. For all of the encoders (e.g., the image encoder 212, the text encoder 222, and the multimodal encoder 240), pseudo-targets are generated by a momentum model 260. The momentum model is a continuously-evolving teacher model which includes exponential-moving average versions of all of the encoders, including the unimodal and multimodal encoders. [0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model. 
    PNG
    media_image4.png
    426
    672
    media_image4.png
    Greyscale
)
Furthermore, the Examiner is not limited to Applicants' definition which is not specifically set forth in the claims. In re Tanaka et al., 193 USPQ 139, (CCPA) 1977. Applicant did not clarify the reason as to why the models of Li, which are both used in the overall VLP system, cannot read on the claimed limitations, as presented. In fact, it appears to be fully anticipated, as exemplified in Figure 3 and described further in paragraph [0045]-[0061], which specifically discusses the use of both, and how the overall VLP system is improved in [0061] and how MoD augments the overall dataset in the MLM model. 
[0045] FIG. 3 is a simplified logic flow diagram illustrating a method 300 for vision and language representation learning that implements the submodules 131-134 in FIG. 1 , according to some embodiments. One or more of the processes 310-360 of method 300 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 310-360. In some embodiments, method 300 may correspond to the method used by the module 130.
[0061] Both ITC and MLM generate views by taking partial information from an image-text pair. Momentum distillation can improve upon the ITC and MLM and generate different views from the entire proposed distribution. For ITC, alternative views of an image-text pair can be generated by finding semantically similar images and texts in the training dataset. For MLM, alternative views for the masked word can be generated from the entire vocabulary set. Therefore, MoD can be considered as performing data augmentation to the original views. MoD generates a diverse set of views that are absent in the original image-text pairs, which can improve the model's generalization performance.

Applicant argues that Li does not anticipate “applying a masked visual language modeling ("MVLM") to target domain data determined as associated with the distribution shift to produce model predictions” as required by the independent claims. 
Examiner respectfully disagrees. Applicants are reminded that the Examiner is entitled to give the broadest reasonable interpretation to the language of the claims. Li teaches that the MLM is a “masked language modeling” which “utilizes both the image and the contextual text from the encoded image-text samples to predict the masked words in the encoded image-text samples” So the Examiner considers “masked-language-modeling (MLM)” “to learn multimodal interactions between the image input 210 and the text input 220” which “utilizes both the image and the contextual text from the encoded image-text samples to predict the masked words in the encoded image-text samples” and then updating the MLM predictions using “pseudo-targets” generated using the “momentum distillation (MoD) as an alternative of original noisy data for training the model” to be Applicants' claimed features within the broad meaning of the term.  The Examiner is not limited to Applicants' definition which is not specifically set forth in the claims. In re Tanaka et al., 193 USPQ 139, (CCPA) 1977.

Applicant argues that Shrivastava was used in combination with Li and does not obviate the claimed features for Claim 4. 
Applicants are reminded that the Examiner is entitled to give the broadest reasonable interpretation to the language of the claims. Shrivastava clearly teaches in paragraphs [0304] Improving ASR Models Through Semi-Supervise Learning and [0305] 
[0305] In particular embodiments, the assistant system 140 may use a new semi-supervised learning framework for training ASR (automatic speech recognition) models, which takes three steps to ensure we select the most informative and clean transcriptions, and balances uncertainty and diversity during the data selection process. The first step may be “combine.” As a starting point, we may take a baseline model to generate a second view of the pseudo labels. Then we may use different ways to ensemble the outputs from the helper model and the base model. The second step may be “balance.” To further mine data that is more helpful (or informative) for model training, we may experiment with different confidence and combine them with various diversity metrics to balance uncertainty and diversity. The third step may be “lightweight supervision.” To address the issue about incorrect predictions from the confidence approach, we may design and implement a lightweight supervision approach. In particular, since the method designs an extremely lightweight supervision, it may take a low amount of effort from human annotators but may effectively resolve or mitigate the drawbacks of the purely automatic approach using only pseudo labels. Although this disclosure describes training particular models by particular systems in a particular manner, this disclosure contemplates training any suitable model by any suitable system in any suitable manner. 
So the Examiner considers Shrivastava’s new semi-supervised learning framework with three steps that include “combine, balance and lightweight supervision” with different confidence and combine them with various diversity metrics to balance uncertainty and diversity to be Applicants' “applying thresholding to the pseudo-labels to reduce the pseudo-labels by a given amount” within the broad meaning of the term.  The Examiner is not limited to Applicants' definition which is not specifically set forth in the claims. In re Tanaka et al., 193 USPQ 139, (CCPA) 1977. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims  1-3, 6-7, 9-11, 14-15 and 17-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Li et al. (US PGPub US 2022/0391755A1, filed July 8, 2021), hereby referred to as “Li”.
Consider Claims 1, 9 and 18.  
Li teaches: 
1. A method, comprising: / 9. A method for processing one or more electronic documents, comprising: receiving the one or more electronic documents as an input data stream; / 18. A non-transitory computer readable medium having stored thereon instructions that when executed by one or more computing devices cause the one or computing devices to: (Li: abstract, Figure 1, VLP Systems and Methods [0015] FIG. 1 is a simplified diagram of a computing device for implementing a VLP system for training a vision-and-learning (V+L) model, according to some embodiments. As shown in FIG. 1 , computing device 100 includes a processor 110 coupled to memory 120. Operation of computing device 100 is controlled by processor 110. And although computing device 100 is shown with only one processor 110, it is understood that processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100. Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine. [0016] Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read. [0017] Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities. [0045]-[0061], Figure 3)
1. training, via a source domain, a machine learning model to use with one or more visual document understanding ("VDU") tasks; / 9. applying a machine learning model to the input data stream; (Li: [0021]-[0044], Figure 2, [0021] FIG. 2 is a simplified diagram of a process flow for training a V+L model using one or more loss objectives, according to some embodiments. As shown in FIG. 2 , an image input 210 is passed to a feed forward image encoder 212 to generate embeddings 214. An input image I is encoded into a sequence of embeddings 214 such as {vcls, v1, . . . vN}, where vcls is the embedding of the [CLS] token. A text input 220 is passed to a feed forward text encoder 222 to generate embeddings 224. For example, the text encoder transforms an input text T into a sequence of embeddings 224 such as {wcls, w1, . . . wN}. [0041] In one embodiment, in order to improve learning, such as in the presence of noisy input data for training the model, pseudo-targets are generated using momentum distillation (MoD) as an alternative of original noisy data for training the model. For all of the encoders (e.g., the image encoder 212, the text encoder 222, and the multimodal encoder 240), pseudo-targets are generated by a momentum model 260. The momentum model is a continuously-evolving teacher model which includes exponential-moving average versions of all of the encoders, including the unimodal and multimodal encoders. [0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model. Specifically, for modifying the ITC, an image-text similarity can be adjusted with the pseudo-targets generated by the momentum model, such as s′(I, T)=g′v(v′cls)Tg′w(w′cls); similarity, a text-image similarity can be adjusted with the pseudo-targets generated by the momentum model,)
1. determining a distribution shift when the machine learning model is applied in a target domain; / 9. determining that there is a domain shift associated with the input data stream; / 18. determine a distribution shift when the machine learning model is applied in a target domain; (Li: [0036] The MLM learning loss 244 can be the cross-entropy H between the predicted probability for a masked token in the encoded image-text samples and the ground-truth one-hot vocabulary distribution, such as:

    PNG
    media_image3.png
    168
    608
    media_image3.png
    Greyscale

[0038] The subset of the encoded image and text samples can be selected based at least in part on negative mining before being encoded into encoded image-text samples by a multimodal encoder. Hard negatives can be sampled for the ITM task with zero computation overhead. A negative image-text pair is hard if they share similar semantics and differ in fine-grained details. The contrastive similarity from eqn (1) can be used to find hard negatives. For each image in a mini-batch, one negative text can be sampled from the same batch following the contrastive similarity distribution, where texts that are more similar to the image have a higher chance to be sampled. Likewise, one hard negative image can be sampled for each text.)
1. applying a masked visual language modeling ("MVLM") to target domain data determined as associated with the distribution shift to produce model predictions; / 9. applying masked visual language modeling ("MVLM") to target domain data determined as associated with the domain shift to produce model predictions; / 18. apply a masked visual language modeling ("MVLM") to target domain data detected as associated with the distribution shift to produce model predictions; (Li: [0034] The multimodal encoder 240 is also configured to generate a masked-language-modeling (MLM) loss 244 to learn multimodal interactions between the image input 210 and the text input 220. The MLM loss 244 can be defined as a loss function between a predicted possibility of one or more masked tokens in the encoded image-text samples and a ground truth identity of the one or more masked tokens of the encoded image-text samples. [0035] Masked language modeling (MLM) utilizes both the image and the contextual text from the encoded image-text samples to predict the masked words in the encoded image-text samples. The input tokens can be randomly masked out with a predetermined probability such as 15% and replaced with the special token [MASK]. For example, the replacements are 10% random tokens, 10% unchanged, and 80% [MASK]. [0036] The MLM learning loss 244 can be the cross-entropy H between the predicted probability for a masked token in the encoded image-text samples and the ground-truth one-hot vocabulary distribution, such as:

    PNG
    media_image3.png
    168
    608
    media_image3.png
    Greyscale
)
1. generating pseudo-labels using the model predictions; and adapting the machine learning model to include the pseudo-labels to produce an adapted model. / 9. generating pseudo-labels using the model predictions; adapting the machine learning model to include the pseudo-labels to produce an adapted model; and processing the input data stream using the adapted model. / 18. generate pseudo-labels using the model predictions; and adapt the machine learning model to include the pseudo-labels to produce an adapted model. (Li: [0041] In one embodiment, in order to improve learning, such as in the presence of noisy input data for training the model, pseudo-targets are generated using momentum distillation (MoD) as an alternative of original noisy data for training the model. For all of the encoders (e.g., the image encoder 212, the text encoder 222, and the multimodal encoder 240), pseudo-targets are generated by a momentum model 260. The momentum model is a continuously-evolving teacher model which includes exponential-moving average versions of all of the encoders, including the unimodal and multimodal encoders. [0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model. 
    PNG
    media_image4.png
    426
    672
    media_image4.png
    Greyscale
 [0045]-[0061], Figure 3)

Consider Claims 2, 11 and 19. 
Li teaches:
2. The method of claim 1, comprising applying self-training to the machine learning model using the pseudo-labels. / 11. The method of claim 9, comprising applying self-training to the machine learning model using the pseudo-labels. / 19. The non-transitory computer readable medium of claim 18, wherein the instructions cause the one or computing devices to apply self-training to the machine learning model using the pseudo-labels. (Li: [0041] In one embodiment, in order to improve learning, such as in the presence of noisy input data for training the model, pseudo-targets are generated using momentum distillation (MoD) as an alternative of original noisy data for training the model. For all of the encoders (e.g., the image encoder 212, the text encoder 222, and the multimodal encoder 240), pseudo-targets are generated by a momentum model 260. The momentum model is a continuously-evolving teacher model which includes exponential-moving average versions of all of the encoders, including the unimodal and multimodal encoders. [0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model. 
    PNG
    media_image4.png
    426
    672
    media_image4.png
    Greyscale
)

Consider Claims 3 and 20. 
Li teaches:3. The method of claim 1, comprising processing the target domain data detected as associated with the distribution shift using the adapted model. / 20. The non-transitory computer readable medium of claim 18, wherein the instructions cause the one or computing devices to process the target domain data detected as associated with the distribution shift using the adapted model. (Li: [0036] The MLM learning loss 244 can be the cross-entropy H between the predicted probability for a masked token in the encoded image-text samples and the ground-truth one-hot vocabulary distribution, such as:

    PNG
    media_image3.png
    168
    608
    media_image3.png
    Greyscale

[0038] The subset of the encoded image and text samples can be selected based at least in part on negative mining before being encoded into encoded image-text samples by a multimodal encoder. Hard negatives can be sampled for the ITM task with zero computation overhead. A negative image-text pair is hard if they share similar semantics and differ in fine-grained details. The contrastive similarity from eqn (1) can be used to find hard negatives. For each image in a mini-batch, one negative text can be sampled from the same batch following the contrastive similarity distribution, where texts that are more similar to the image have a higher chance to be sampled. Likewise, one hard negative image can be sampled for each text. [0039], [0050]-[0051] Figure 3)

Consider Claims 6 and 14. 
Li teaches:6. The method of claim 1, comprising generating the pseudo-labels on a per-batch basis. / 14. The method of claim 9, comprising generating the pseudo-labels on a per-batch basis. (Li: 0038] The subset of the encoded image and text samples can be selected based at least in part on negative mining before being encoded into encoded image-text samples by a multimodal encoder. Hard negatives can be sampled for the ITM task with zero computation overhead. A negative image-text pair is hard if they share similar semantics and differ in fine-grained details. The contrastive similarity from eqn (1) can be used to find hard negatives. For each image in a mini-batch, one negative text can be sampled from the same batch following the contrastive similarity distribution, where texts that are more similar to the image have a higher chance to be sampled. Likewise, one hard negative image can be sampled for each text.)

Consider Claims 7 and 15. 
Li teaches:7. The method of claim 1, comprising processing the target domain data using a visual encoder./ 15. The method of claim 9, comprising processing the target domain data using a visual encoder. (Li: [0019] In some embodiments, the VLP module 130 includes an image encoder module 131 and a text encoder module 132. Specifically, the image encoder module is configured to form an encoding of the image input 142. The text encoder module is configured to form an encoding of the text input 144. In some embodiments, the VLP module 130 includes a multimodal encoder 133. The multimodal encoder is configured to receive the encoding of the image input and the encoding of the text input. The multimodal encoder is configured to fuse the encoding of the image input with the encoding of the text input. In some embodiments, the VLP module 130 includes a momentum module 134. During training, the momentum module is configured to receive output from the multimodal encoder and to perform momentum distillation (MoD) that generate pseudo targets of the outputs such as exponential-moving average versions of the outputs. [0042] During training, the visual-and-language base model can be trained so that its predictions match the predictions from the momentum model. Specifically, for modifying the ITC, an image-text similarity can be adjusted with the pseudo-targets generated by the momentum model, such as s′(I, T)=g′v(v′cls)Tg′w(w′cls); similarity, a text-image similarity can be adjusted with the pseudo-targets generated by the momentum model, such as s′(T, I)=g′w(wcls)Tg′v(v′cls). Soft pseudo-targets qi2t and qt2i can be generated by replacing s with s′ in eqn(1). The ITC can be modified by the MoD pseudo-targets to generate the ITC-MoD loss, such as being defined as:

    PNG
    media_image5.png
    38
    636
    media_image5.png
    Greyscale
)

Consider Claims 10 and 17. 
Li teaches: 
10. The method of claim 9, wherein the machine learning model is trained on source domain data that does not account for the target domain data. /  17. The method of claim 9, wherein the target domain data comprises test-time data.(Li: [0047] At process 320, an image encoder may encode the plurality of image samples into a plurality of encoded image samples. At process 320, a text encoder may encode the plurality of text samples into a plurality of encoded text samples. The encoding of the image encoder or the text encoder may occur at the same time or at different times. For example, the encoding of the image encoder may occur before the encoding of the text encoder. For another example, the encoding of the image encoder may occur after the encoding of the text encoder. In some embodiments, the image encoder is a transformer. In further embodiments, the text encoder is a transformer.  [0048] At process 330, a first loss objective may be computed based on the plurality of encoded image samples and the plurality of encoded text samples. The first loss objective may comprise an image-text contrastive loss (ITC) loss objective that refers to a loss function between a predicted similarity between an encoded image sample and an encoded text sample and a corresponding ground-truth similarity.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.

Claims 1, 4-5, 8-9, 12-13, 16 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (US PGPub US 2022/0391755A1), hereby referred to as “Li”, in view of Shrivastava et al. (US PGPub US 2023/0245654 A1, filed on January 20, 2023, with provisional priority dating back to January 31, 2022), hereby referred to as “Shrivastava”. 
Consider Claims 1/4, 9/12 and 18.  
Li teaches: 
1. The method of Claim 1 / 9. The method of Claim 9 / 18. The non-transitory computer readable medium of Claim 18.
Li does not teach dependent features from claims 4 and 12: applying thresholding to the pseudo-labels to reduce the pseudo-labels by a given amount
Shrivastava teaches: 
1. A method, comprising: / 9. A method for processing one or more electronic documents, comprising: receiving the one or more electronic documents as an input data stream; / 18. A non-transitory computer readable medium having stored thereon instructions that when executed by one or more computing devices cause the one or computing devices to: (Shrivastava: abstract, In one embodiment, a system includes an automatic speech recognition (ASR) module, a natural-language understanding (NLU) module, a dialog manager, one or more agents, an arbitrator, a delivery system, one or more processors, and a non-transitory memory coupled to the processors comprising instructions executable by the processors, the processors operable when executing the instructions to receive a user input, process the user input using the ASR module, the NLU module, the dialog manager, one or more of the agents, the arbitrator, and the delivery system, and provide a response to the user input. Figure 1, [0035] FIG. 1 illustrates an example network environment 100 associated with an assistant system. Network environment 100 includes a client system 130, an assistant system 140, a social-networking system 160, and a third-party system 170 connected to each other by a network 110. Although FIG. 1 illustrates a particular arrangement of a client system 130, an assistant system 140, a social-networking system 160, a third-party system 170, and a network 110, this disclosure contemplates any suitable arrangement of a client system 130, an assistant system 140, a social-networking system 160, a third-party system 170, and a network 110. As an example and not by way of limitation, two or more of a client system 130, a social-networking system 160, an assistant system 140, and a third-party system 170 may be connected to each other directly, bypassing a network 110. As another example, two or more of a client system 130, an assistant system 140, a social-networking system 160, and a third-party system 170 may be physically or logically co-located with each other in whole or in part. Moreover, although FIG. 1 illustrates a particular number of client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110, this disclosure contemplates any suitable number of client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110. As an example and not by way of limitation, network environment 100 may include multiple client systems 130, assistant systems 140, social-networking systems 160, third-party systems 170, and networks 110.)
1. training, via a source domain, a machine learning model to use with one or more visual document understanding ("VDU") tasks; / 9. applying a machine learning model to the input data stream; (Shrivastava: [0071] In particular embodiments, due to a limited computing power of the client system 130, the on-device dialog manager 216 a may conduct on-device learning based on learning algorithms particularly tailored for client system 130. As an example and not by way of limitation, federated learning techniques may be implemented by the on-device dialog manager 216 a. Federated learning is a specific category of distributed machine learning techniques which may train machine-learning models using decentralized data stored on end devices (e.g., mobile phones). In particular embodiments, the on-device dialog manager 216 a may use federated user representation learning model to extend existing neural-network personalization techniques to implementation of federated learning by the on-device dialog manager 216 a. Federated user representation learning may personalize federated learning models by learning task-specific user representations (i.e., embeddings) and/or by personalizing model weights. Federated user representation learning is a simple, scalable, privacy-preserving, and resource-efficient. Federated user representation learning may divide model parameters into federated and private parameters. Private parameters, such as private user embeddings, may be trained locally on a client system 130 instead of being transferred to or averaged by a remote server (e.g., the server associated with assistant system 140). Federated parameters, by contrast, may be trained remotely on the server. In particular embodiments, the on-device dialog manager 216 a may use an active federated learning model, which may transmit a global model trained on the remote server to client systems 130 and calculate gradients locally on the client systems 130. Active federated learning may enable the on-device dialog manager 216 a to minimize the transmission costs associated with downloading models and uploading gradients. For active federated learning, in each round, client systems 130 may be selected in a semi-random manner based at least in part on a probability conditioned on the current model and the data on the client systems 130 in order to optimize efficiency for training the federated learning model.)
1. determining a distribution shift when the machine learning model is applied in a target domain; / 9. determining that there is a domain shift associated with the input data stream; / 18. determine a distribution shift when the machine learning model is applied in a target domain; (Shrivastava: [0289] In particular embodiments, the assistant system 140 may curate a vocabulary for language models by incorporating differential privacy with multi-round training with removal of highest-frequency words in each round to allow effective training of next highest-frequency words. Based on determining why distribution of words/noise was bad in datasets with exponential frequency drop offs and understanding the problem of high-frequency words polluting low-frequency words because of all the noise generated by high-frequency words, the assistant system 140 may use a particular federated analytics mechanism/algorithm where one can get recall@10K=90%. Although this disclosure describes curating particular vocabularies by particular systems in a particular manner, this disclosure contemplates curating any suitable vocabulary by any suitable system in any suitable manner. [0290] In particular embodiments, the assistant system 140 may use federated analytics to collect the most popular/frequent words to curate a vocabulary (word domain) for a language model for a smart keyboard (e.g., on a VR headset). However, the noise created by the differential privacy model on the client systems 130 may create much noise with respect to the highest-frequency words that low-frequency words cannot be differentiated from the noise. [0302]-[0303])
1. applying a masked visual language modeling ("MVLM") to target domain data determined as associated with the distribution shift to produce model predictions; / 9. applying masked visual language modeling ("MVLM") to target domain data determined as associated with the domain shift to produce model predictions; / 18. apply a masked visual language modeling ("MVLM") to target domain data detected as associated with the distribution shift to produce model predictions; (Shrivastava: [0162] Model-Based Negatives. While in-batch negatives may greatly expand the number of negatives, these negatives may not be particularly challenging as they are randomly sampled from the training dataset. To increase the quality of our utterance-scenario metric, we explore augmenting in-batch negatives with model-based negatives. Specifically, given a retrieval module from training round t with metric simt(u, s)=EU t (u)T ES t (s), we find the top-k neighbors for each positive pair (ui, si +) by computing argmax(1 . . . k) {simt (ui, s k )|1≤k≤n∧i≠k}. Using this algorithm, we cache each training example's hard negatives, then in training round (t+1), we fine-tune the retrieval module using both in-batch and model-based negatives. Each non-seed training round t>1 may therefore feature at most (B(k+1)−1) negatives per positive pair. This procedure may be similar to iterative training used in (Oguz et al., 2021). [0163] Identity Masking. The precise number of negatives may be empirically slightly less, as sometimes we may see conflicts between each training examples' negative pairs. Let (ui, si) and (uj, sj) be two training examples within the same batch. During model-based negatives sampling in training round t, (uj, sj) may include si as a top-k negative pair. This may complicate metric learning as si becomes a positive and negative for ui simultaneously. We therefore implement an identity mask on each training example's negatives which may ensure no conflicts when mixing in-batch and model-based negatives. [0302] In the second scenario, instead of using the proxy data frequency distribution, the embodiments disclosed herein use a more linear data distribution to see how that impacts recall@10K. FIG. 20 illustrates example impact to recall@10K. [0303] In the third scenario, the embodiments disclosed herein discarded the frequencies of the first 100 most frequent words and fitting a straight line. FIG. 21 illustrates an example graph regarding recall@10K for a linear data distribution. As can be seen from the graph, the light blue line which shows the recall@10K for the linear data distribution is about 20% higher. However, because the embodiments disclosed herein reduced the frequency of the top words, the recall@1K to recall@6K may have suffered quite a bit. But this may confirm that high frequency words contribute to a lot of noise for lower frequency words.)
1. generating pseudo-labels using the model predictions; and adapting the machine learning model to include the pseudo-labels to produce an adapted model. / 9. generating pseudo-labels using the model predictions; adapting the machine learning model to include the pseudo-labels to produce an adapted model; and processing the input data stream using the adapted model. / 18. generate pseudo-labels using the model predictions; and adapt the machine learning model to include the pseudo-labels to produce an adapted model. (Shrivastava: [0304] Improving ASR Models Through Semi-Supervise Learning [0305] In particular embodiments, the assistant system 140 may use a new semi-supervised learning framework for training ASR (automatic speech recognition) models, which takes three steps to ensure we select the most informative and clean transcriptions, and balances uncertainty and diversity during the data selection process. The first step may be “combine.” As a starting point, we may take a baseline model to generate a second view of the pseudo labels. Then we may use different ways to ensemble the outputs from the helper model and the base model. The second step may be “balance.” To further mine data that is more helpful (or informative) for model training, we may experiment with different confidence and combine them with various diversity metrics to balance uncertainty and diversity. The third step may be “lightweight supervision.” To address the issue about incorrect predictions from the confidence approach, we may design and implement a lightweight supervision approach. In particular, since the method designs an extremely lightweight supervision, it may take a low amount of effort from human annotators but may effectively resolve or mitigate the drawbacks of the purely automatic approach using only pseudo labels. Although this disclosure describes training particular models by particular systems in a particular manner, this disclosure contemplates training any suitable model by any suitable system in any suitable manner. [0306] Conventionally, ASR models may be trained by speech data from the voice assistant domains and videos publicly shared by users, without any live traffic utterances. In order to boost the model accuracy, we may collect live traffic audios and generate pseudo labels with existing ASR models to augment the training datasets.)
4. The method of claim 1, comprising applying thresholding to the pseudo-labels to reduce the pseudo-labels by a given amount./ 12. The method of claim 9, comprising applying threshold to the pseudo-labels to reduce the pseudo-labels by a given amount. (Shrivastava: [0272] In particular embodiments, dynamic end-pointing may make use of a domain classifier to infer the domain the utterance falls under. During decoding, the end-pointer may have access to the partial transcript of the audio received so far. Every time the partial transcript is changed, the domain classifier may be queried to determine the domain. Based on the predicted domain, the assistant system 140 may dynamically update the end-pointing thresholds for the rest of the utterance. This may comprise static end-pointer thresholds, as well as neural end-pointer thresholds. The domain specific thresholds may be pre-tuned and configured as part of the model package. [0273] Table 15 lists example adjustment of end-pointing thresholds based on predicted CQA domain. Table 16 lists example adjustment of end-pointing thresholds based on predicted capture domain. Note “ood” stands for “out of domain” and “cqa” stands for “community question answering”. [0304] Improving ASR Models Through Semi-Supervise Learning [0305] In particular embodiments, the assistant system 140 may use a new semi-supervised learning framework for training ASR (automatic speech recognition) models, which takes three steps to ensure we select the most informative and clean transcriptions, and balances uncertainty and diversity during the data selection process. The first step may be “combine.” As a starting point, we may take a baseline model to generate a second view of the pseudo labels. Then we may use different ways to ensemble the outputs from the helper model and the base model. The second step may be “balance.” To further mine data that is more helpful (or informative) for model training, we may experiment with different confidence and combine them with various diversity metrics to balance uncertainty and diversity. The third step may be “lightweight supervision.” To address the issue about incorrect predictions from the confidence approach, we may design and implement a lightweight supervision approach. In particular, since the method designs an extremely lightweight supervision, it may take a low amount of effort from human annotators but may effectively resolve or mitigate the drawbacks of the purely automatic approach using only pseudo labels. Although this disclosure describes training particular models by particular systems in a particular manner, this disclosure contemplates training any suitable model by any suitable system in any suitable manner.)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to improve Li’s visual and language machine learning system with the teachings of Shrivastava, as they are both directed towards natural language processing using machine learning algorithms. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify Li in order to leverage known parameters that would improve the overall learning capability and ensure a more accurate and interactive end-user process. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Li, while the teaching of Shrivastava continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of ensuring a more accurate and improved overall learning algorithm. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.

Consider Claims 5 and 13.
Claims 5 and 13 are rejected over the combination of Li and Shrivastava as presented for claims 4 and 12 above. 
Li teaches: The method of Claim 7 and The method of Claim 9. 
Li does not teach the use of an: optical character recognition parser
Shrivastava teaches: 
5. The method of claim 4, wherein the applying a threshold comprises applying an entropy-based uncertainty-aware pseudo-labeling selection mechanism to determine which of the pseudo-labels are reliable. / 13. The method of claim 12 wherein the applying the threshold comprises applying an entropy-based uncertainty-aware pseudo-labeling selection mechanism to determine which of the pseudo-labels are reliable. ([0071] Federated parameters, by contrast, may be trained remotely on the server. In particular embodiments, the on-device dialog manager 216 a may use an active federated learning model, which may transmit a global model trained on the remote server to client systems 130 and calculate gradients locally on the client systems 130. Active federated learning may enable the on-device dialog manager 216 a to minimize the transmission costs associated with downloading models and uploading gradients. For active federated learning, in each round, client systems 130 may be selected in a semi-random manner based at least in part on a probability conditioned on the current model and the data on the client systems 130 in order to optimize efficiency for training the federated learning model. [0148] Task Definition. Finally, we may precisely define our task of scenario-based semantic parsing. For the typical task of semantic parsing, we define U as a random variable over utterances, F as a random variable over frames, and model P(F|U): find the most likely frame given the utterance. However, because we introduce scenarios as a coarse, intermediate representation of frames, we additionally define S as a random variable over scenarios and are given all supported scenarios a priori, and model P(F|U, S)P(S|U): a coarse-to-fine objective where we (a) find the most likely scenario given the utterance and (b) find the most likely frame given the utterance and scenario.)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to improve Li’s visual and language machine learning system with the teachings of Shrivastava, as they are both directed towards natural language processing using machine learning algorithms. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify Li in order to leverage known parameters that would improve the overall learning capability and ensure a more accurate and interactive end-user process. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Li, while the teaching of Shrivastava continues to perform the same function as originally taugh
Read full office action
Prosecution Timeline

May 18, 2023
Application Filed
Jun 10, 2025
Non-Final Rejection — §102, §103
Sep 11, 2025
Response Filed
Dec 03, 2025
Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/068,590
Patent 12586249
PROCESSING APPARATUS, PROCESSING METHOD, AND STORAGE MEDIUM FOR CALIBRATING AN IMAGE CAPTURE APPARATUS
2y 5m to grant Granted Mar 24, 2026
18/484,909
Patent 12586354
TRAINING METHOD, APPARATUS AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR A MACHINE LEARNING MODEL
2y 5m to grant Granted Mar 24, 2026
18/471,055
Patent 12573083
COMPUTER-READABLE RECORDING MEDIUM STORING OBJECT DETECTION PROGRAM, DEVICE, AND MACHINE LEARNING MODEL GENERATION METHOD OF TRAINING OBJECT DETECTION MODEL TO DETECT CATEGORY AND POSITION OF OBJECT
2y 5m to grant Granted Mar 10, 2026
17/976,971
Patent 12548297
IMAGE PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE, STORAGE MEDIUM, AND PROGRAM PRODUCT BASED ON FEATURE AND DISTRIBUTION CORRELATION
2y 5m to grant Granted Feb 10, 2026
18/444,143
Patent 12524504
METHOD AND DATA PROCESSING SYSTEM FOR PROVIDING EXPLANATORY RADIOMICS-RELATED INFORMATION
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+17.9%)
2y 8m
Median Time to Grant
Moderate
PTA Risk
Based on 868 resolved cases by this examiner. Grant probability derived from career allow rate.