DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2. This Office Action is in response to the Amendment filed on 01/27/2026.
Response to Arguments
3. Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
4. Claims 1-17 are rejected under 35 U.S.C. 103 as being unpatentable over Krishna et al (US 20250103642 A1) in view of Marri et al (US 20230326178 A1).
Krishna et al (“Krishna”) is directed to Visual Search Interface In An Operating System.
Marri et al (“Marri”) is directed to CONCEPT DISAMBIGUATION USING MULTIMODAL EMBEDDINGS.
As per claim 1, Krishna discloses a system comprising: at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the system to perform a set of operations ([0033] Generally, Krishna disclosure is directed to systems and methods for visual search at the operating system level the set of operations. Also see the system of Fig. 1), comprising:
capturing a current screenshot of a window on a computer display ([0004] In some instances, a user may capture a screenshot and utilize the screenshot as a query image);
processing image information associated with the current screenshot into an image embedding ([0074] the one or more embedding models 240 can be utilized to embed portions of and/or all of the display data. The embeddings can then be utilized for searching for similar objects and/or text, classification, grouping, and/or compression).
receiving a plurality of text embeddings representing a plurality of predefined screenshot activities; ([0121] At 802, a computing system can obtain display data. The display data can be descriptive of content currently presented for display in a first application on a user computing device. Obtaining the display data can include generating a screenshot. A screenshot can be descriptive of a plurality of pixels provided for display);
At [0074] Krishna describes that the one or more embedding models 240 can be utilized to embed portions of and/or all of the display data. The embeddings can then be utilized for searching for similar objects and/or text, classification, grouping, and/or compression.
Krishna further describes that the semantic analysis model 242 can be utilized to process the display data to generate a semantic output descriptive of an understanding of the display data with regards to topic understanding, scene understanding, a focal point, pattern recognition, application understanding, and/or one or more other semantic outputs.
But Krishna does not seem to clearly describe comparing the image embedding to the plurality of text embeddings.
Marri, on the hand, compares the image embedding to the plurality of text embeddings. Marri disclose [0019] an image processing apparatus, wherein the image processing apparatus computes a similarity score between the image embedding and each of the concept embeddings and determines a matching concept based on comparing the similarity scores. In some examples, the multi-modal encoder is trained, via contrastive learning techniques, to support a single unified text-and-digital image embedding space that treats text and digital images as the same entity (e.g., common embedding space). [0022] By encoding images and text in a common shared embedding space, embodiments of the present disclosure calculate a similarity score between the query image and each of the concept descriptions found in the knowledge graph. In some examples, a multi-modal encoder encodes the query image to obtain an image embedding. The multi-modal encoder encodes each of concept descriptions to obtain a concept embedding (i.e., text embedding) for each retrieved concept in the knowledge graph. The image processing apparatus compares the image and text embedding in the same embedding space and locates a nearest neighbor by comparing similarity metrics. Also see [0024, 0072, 0079]).
Before effective filling date of the invention, it would have been obvious to a person of ordinary skill in the art to combine the above teaching of Marri with Krishna so that the image processing apparatus of Krishna computes a similarity score between the image embedding and each of the concept embeddings and determines a matching concept based on comparing the similarity scores.
Therefore, it would have been obvious to combine Marri with Krishna to obtain the invention as specified in claim 1.
Krishna in view of Marri further discloses determining at least one text embedding having a closest similarity to the image embedding, wherein the at least one text embedding represents a screenshot activity of the plurality of screenshot activities (Marri, [0004] the image processing apparatus computes a similarity score between the image embedding and each of the concept embeddings and determines a matching concept based on comparing the similarity scores. [0024]The image processing apparatus extracts multi-modal visual embedding from the query image and multimodal text embedding for each of the candidate concept descriptions. The image processing apparatus determines the closest concept description based on computing a similarity score between the visual embedding and each of the candidate concept embeddings).
Krishna in view of Marri further discloses determining a user is performing a screen activity corresponding to the screenshot activity represented by the at least one text embedding (Krishna, [0104] FIG. 5C depicts an illustration of an example visual search of an order confirmation page 538. The order confirmation page 538 may be processed to determine an action suggestion and one or more query suggestions 540. A particular query suggestion (e.g., “What furniture matches this?”) of the one or more query suggestions 540 may be selected. The selected query suggestion and a screenshot of the order confirmation page 538 can be utilized as a multimodal query 542). Also see [0191]); and
Krishna in view of Marri further discloses based on the screen activity, determining at least one prompt for assisting the user in performing the screen activity (Krishna, [0104] FIG. 5C depicts an illustration of an example visual search of an order confirmation page 538. The order confirmation page 538 may be processed to determine an action suggestion and one or more query suggestions 540. A particular query suggestion (e.g., “What furniture matches this?”) of the one or more query suggestions 540 may be selected. The selected query suggestion and a screenshot of the order confirmation page 538 can be utilized as a multimodal query 542).
As per claim 2, Krishna in view of Marri further discloses the system of claim 1, the set of operations further comprising: processing the image information associated with the current screenshot to detect textual information; and based on the textual information, determining one or more topics associated with the screen activity (Krishna, [0080] In some implementations, the display data can be processed with an object detection model to generate a plurality of bounding boxes associated with a plurality of detected objects. The display data and/or the plurality of bounding boxes can then be processed with a segmentation model to generate a plurality of segmentation masks associated with the silhouettes for the plurality of detected objects. The plurality of segmentation masks can be utilized to generate user interface indicators that indicate what objects are detected along with outlines for the detected objects. In some implementations, the display data can be processed with one or more classification models to generate one or more object classifications for objects depicted in the displayed content. Additionally and/or alternatively, the display data can be processed with an optical character recognition model to generate text data descriptive of text in the displayed content. The display data and/or the text data may be processed to determine one or more entities associated with the text and/or the objects in the displayed content. One or more user interface elements can be generated and provided to provide an indication of the determined entities to the user. Also see [0058, 0063, 0069, 0072, 0091-0092, and Figs. 4B).
As per claim 3, Krishna in view of Marri further discloses the system of claim 2, the set of operations further comprising: based on the one or more topics, determining the at least one prompt for assisting the user in performing the screen activity (Krishna, [0042] Additionally and/or alternatively, the visual search interface in the operating system can be configured to obtain prompt inputs from the user to aggregate data from a plurality of different applications on the computing device and/or the web. The prompt can be processed to determine one or more applications on the computing device are associated with a topic, task, and/or content type associated with the request of the prompt. [0050] additionally and/or alternatively, the visual search interface 14 can be utilized to process web information displayed in a viewing window of the second application 18 to generate image annotations, provide suggested searches, provide additional information, and/or suggest actions. Also see [0049, 0091, 0092, 0122, 0137]).
As per claim 4, Krishna in view of Marri further discloses that the system of claim 3, wherein determining the at least one prompt further comprises: receiving a plurality of predefined prompts for the screen activity, wherein each predefined prompt includes a placeholder; and generating the at least one prompt by replacing the placeholder with a topic of the one or more topics (Krishna, [0110] FIG. 6B depicts an illustration of structured output augmentation. The user may be budget conscious, and the structured output 610 may indicate that a current model-generated wish list is above budget. A user may select a particular item on the wish list to replace to (a) meet a budget and/or (b) replace the item with a different product based on one or more user preferences. The system can process a selection of a “suggest sofas” option to determine products of the particular product type that match a style, aesthetic, price range, and/or other preferences for the user. A plurality of product alternatives can then be provided for display in a carousel interface 614 for a user to view and select to augment the structured output. The plurality of product alternatives may be obtained from one or more applications and/or from the web. A particular alternative may be selected by the user and may be processed to generate an augmented structured output 616 that updates at least a portion of the structured output 610.
As per claim 5, Krishna in view of Marri further discloses that the system of claim 3, wherein determining the at least one prompt further comprises: generating the at least one prompt by querying a large language model (LLM) based on the screen activity and the one or more topics (Krishna, [0109] A plurality of content items can be obtained from the one or more particular applications based on the prompt 604. The plurality of content items can be processed with a machine-learned model to generate a structured output 610 that provides information from the plurality of content items in an organized format. [0126] At 806, the computing system can determine a particular second application on the computing device is associated with the visual search data. For example, the computing system can process the visual search data with a machine-learned suggestion model to determine a second application is associated with the one or more visual search results
As per claim 6, Krishna in view of Marri further discloses that the system of claim 5, wherein determining the at least one prompt further comprises: receiving partial user input; and generating the at least one prompt by querying the LLM in a loop to complete the partial user input (Krishna, [0131] In some implementations, generating the model-generated content item based on the selection of the option can include processing the visual search data and data associated with the particular second application to determine a suggested prompt, receiving input selecting the suggested prompt, and processing the suggested prompt and the visual search data with the generative model to generate the model-generated content item. The model-generated content item can then be transmitted to the second application).
As per claim 7, Krishna in view of Marri further discloses that the system of claim 1, the set of operations further comprising: causing display of the at least one prompt to the user for selectively initiating an interaction with an Al agent (Krishna, [0106] A prompt may be generated based on the selected subset 548 and/or based on one or more user inputs. The prompt can be processed to generate a model-generated message 554, which can then be added to a draft email message 556 in the email application 552. The draft email message 556 can be sent with a data packet 558 descriptive of the selected subset 548. The data packet 558 may include a model-generated content item that includes details associated with the selected subset 548. [0107] FIG. 6A depicts an illustration of prompt generation and processing. Also see [0111-0112], 0131, 0138, 0202, 0203])
As per claim 8, Krishna in view of Marri further discloses that the system of claim 1, wherein determining the at least one prompt includes selecting the at least one prompt from a plurality of predefined prompts for the screen activity (Krishna, [0102] he overlay interface 524 can depict a model-generated prompt (and/or a user input prompt) that can be processed to generate a model-generated message 526. The model-generated prompt may be generated based on the visual search data and/or the selection of the particular application suggestion. The model-generated message 526 can be sent with a data packet 528 with the web page 518. [0106] A prompt may be generated based on the selected subset 548 and/or based on one or more user inputs. The prompt can be processed to generate a model-generated message 554, which can then be added to a draft email message 556 in the email application 552. The draft email message 556 can be sent with a data packet 558 descriptive of the selected subset 548. The data packet 558 may include a model-generated content item that includes details associated with the selected subset 548. Also see [0112]).
As per claim 9, Krishna in view of Marri further discloses that the system of claim 1, wherein the image information of the current screenshot is processed using one or more machine learning (ML) models (Krishna, [009] The one or more on-device machine-learned models may have been trained to process image data to generate one or more machine-learned outputs based on detected features in the display data. Also see [0042, 0044, 0055, and 0074]).
As per claim 10, Krishna in view of Marri further discloses that the system of claim 1, wherein determining the at least one text embedding having the closest similarity to the image embedding is performed using a multimodal ML model (Krishna, [0074] the one or more embedding models 240 can be utilized to embed portions of and/or all of the display data. The embeddings can then be utilized for searching for similar objects and/or text, classification, grouping, and/or compression. The semantic analysis model 242 can be utilized to process the display data to generate a semantic output descriptive of an understanding of the display data with regards to topic understanding, scene understanding, a focal point, pattern recognition, application understanding, and/or one or more other semantic outputs. [0123] In some implementations, the one or more visual search results can include reverse image search results. The one or more visual search results can be determined based on detected features. The one or more visual search results can include similar images to the one or more images of the display data, can include similar objects to detected objects in the display data, can include similar interfaces to detected user interface features in the display data, determined caption data, determined classifications, and/or other search result data. The visual search data may include an output of the one or more classification models, one or more augmentation models, and/or one or more generative vision language models. For example, the display data may be processed with a machine-learned vision language model to generate a predicted caption for the display data. Also see [0083]).
As per method claims 11-17, Krishna in view of Marri discloses a method (Krishna, for example flowcharts of Figs. 3, 8 and 11). The method steps include limitations that correspond to system claims 1, 4, 5, 6, 7, 10, and 2, respectively. Thus, the method claims are also rejected under similar citations given to the system claims.
5. Claims 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over XIE et al (US 20230394855 A1) in view of Birru et al (US 20250165744 A1 ).
XIE is directed to image paragraph generator.
Birru is directed to method and system for integrated multimodal input processing for virtual agents.
XIE discloses a method (flowcharts of Figs. 4A-4B and 5) of automatically determining at least one prompt for assisting a user in performing a screen activity, comprising:
capturing a current screenshot of a window on a computer display [(0032] In response to receiving an image I (image 102), the examples of the disclosure generate long and coherent descriptive text based on image inputs that leverage pre-trained models;
receiving a visual-language model (VLM) instruction to evaluate a screenshot and generate at least one prompt (0054] Operation 416 includes, based on at least evaluation of image 102 and plurality of image story caption candidates 144 by vision language model 150, selecting selected story caption 154, which may have previously been story caption 141, from among plurality of image story caption candidates 144.
receiving an example screenshot and one or more example prompts generated based on the example screenshot ([0015] Example solutions for image paragraph captioning use a first vision language model to generate visual information comprising text for an image. [0031] Returning again to FIG. 1, further detail, to complement the description above, is provided. In response to receiving an image I (image 102), the examples of the disclosure generate long and coherent descriptive text based on image inputs that leverage pre-trained models. In some examples, this is accomplished in three stages: (1) Represent I with visual clues S (visual clues 130), which contain rich visual information, using at least visual language model 112 and captioner 114).
evaluating the current screenshot based on the received VLM instruction, the example screenshot and the one or more example prompts, wherein the evaluating includes at least one of processing text or images associated with the current screenshot ([0029] Measuring caption accuracy on scene graphs extracted from generated text and extracted image tags, which may include object tags about specific objects in the image, may match better with human judgment for what constitutes a “good description.” For example, given an image with content “A man sitting in front of a blue snowboard”, a good evaluation for a generated caption should determine whether each of the semantic propositions is correct, namely, (1) a man is sitting; (2) a man is in front of a snowboard; and (3) the snowboard is blue, rather than matching the exact words of another description. Thus, graphs 304 and 314 are compared. Also see [0054].
based on the evaluating the current screenshot, determining at least one prompt for assisting the user ([0060] Operation 504 includes generating, by a generative language model, from the visual information, a plurality of image story caption candidates. Operation 506 includes, based on at least evaluation of the image and the plurality of image story caption candidates by a second vision language model, selecting a selected caption from among the plurality of image story caption candidates. and
XIE falls short in teaching causing display of the at least one prompt to the user for selectively initiating an interaction with an Al agent.
Birru, on the other hand, is directed to method and system for integrated multimodal input processing for virtual agents. [0140] The method 500 illustrated by the flow diagram of FIG. 500 for multimodal input processing start at 502. The method 500 may include, at step 504, obtaining multimodal input by the virtual agent from a user. The multimodal input includes data from modalities comprising sensors, ensembled data, speech, text, and vision. The virtual agent may employ an Artificial Intelligence (AI) model. In some embodiments, the AI model may be a Generative AI (GenAI) model. Examples of the GenAI may include, but are not limited to, LLM, VLM, and the like. In some embodiments, the Gen AI may be an ensembled model,
Before effective filling date of the invention, it would have been obvious to a person of ordinary skill in the art to combine the teaching of Birru with XIE so that a user of XIE would be able to get assistance from AI agent or interact with AI agent.
Therefore, it would have been obvious to combine XIE Barret with Birru to obtain the invention as specified in claim 18.
As per claim 19, XIE in view of Birru further discloses that method of claim 18, further comprising: evaluating the current screenshot to determine the user is performing a screen activity (XIE, [0056] In one option, operation 454 detects plurality of objects 118 in image 302 and operation 456 determines image tags from the detected objects in image 302. In an alternative version, a human authors “ground truth” baseline description 306 in operation 474 operation 476 determines image tags from baseline description 306. Using the images tags, operation 458 generates baseline graph 304).
As per claim 20, XIE in view of Birru further discloses that method of claim 19, wherein the at least one prompt is for assisting the user in performing the screen activity (Birru, [0045] The user input may represent a query, request, or command from the user 102, indicating their intention or information they seek. The user input serves as the starting point for the virtual agent 104 to understand the user's needs and provide appropriate assistance or information).
Conclusion
6. Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
7. Any inquiry concerning this communication or earlier communications from the examiner should be directed to TADESSE HAILU whose telephone number is (571)272-4051; and the email address is Tadesse.hailu@USPTO.GOV. The examiner can normally be reached Monday- Friday 9:30-5:30 (Eastern time).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bashore, William L. can be reached (571) 272-4088. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TADESSE HAILU/Primary Examiner, Art Unit 2174