Last updated: April 19, 2026
Application No. 18/439,322
TILE-BASED IMAGE UNDERSTANDING IN VISION AND LANGUAGE MODELS

Non-Final OA §101§103
Filed
Feb 12, 2024
Examiner
WELLS, HEATH E
Art Unit
2664
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +18.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 77 resolved cases, 2023–2026
Examiner Intelligence

WELLS, HEATH E View full profile →
Grants 75% — above average
Career Allow Rate
58 granted / 77 resolved
+13.3% vs TC avg
Strong +18% interview lift
Without
With
+18.1%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
46 currently pending
Career history
123
Total Applications
across all art units
Statute-Specific Performance

§101
17.8%
-22.2% vs TC avg
§103
62.8%
+22.8% vs TC avg
§102
2.4%
-37.6% vs TC avg
§112
13.8%
-26.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 77 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged that application is a National Stage application of PCT PCT/US25/15393. Priority to PCT/US25/15393 with a priority date of 11 February 2025 is acknowledged under 35 USC 119(e) and 37 CFR 1.78.
Information Disclosure Statement
The IDSs dated 12 February 2024 and 21 July 2025 have been considered and placed in the application file.  
Specification - Abstract
Applicant is reminded of the proper content of an abstract of the disclosure.
A patent abstract is a concise statement of the technical disclosure of the patent and should include that which is new in the art to which the invention pertains. The abstract should not refer to purported merits or speculative applications of the invention and should not compare the invention with the prior art.
If the patent is of a basic nature, the entire technical disclosure may be new in the art, and the abstract should be directed to the entire disclosure. If the patent is in the nature of an improvement in an old apparatus, process, product, or composition, the abstract should include the technical disclosure of the improvement. The abstract should also mention by way of example any preferred modifications or alternatives. 
Where applicable, the abstract should include the following: (1) if a machine or apparatus, its organization and operation; (2) if an article, its method of making; (3) if a chemical compound, its identity and use; (4) if a mixture, its ingredients; (5) if a process, the steps.
Extensive mechanical and design details of an apparatus should not be included in the abstract. The abstract should not contain legal language such as comprising. The abstract should be in narrative form and generally limited to a single paragraph within the range of 50 to 150 words in length. The sheet or sheets presenting the abstract may not include other parts of the application or other material.
See MPEP § 608.01(b) for guidelines for the preparation of patent abstracts.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-22 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.  All of the claims are method claims (1-22), under (Step 1), but under Step 2A all of these claims recite abstract ideas and specifically mental processes—concepts performed in the human mind including observation, evaluation, judgement and opinion which are generally described as a human visually observing a label to judge the locations and dimensions of empty regions in order to insert content into these empty regions; furthermore these mental processes are more particularly: 
Recited in claims 1 as: 
generating, from the input image and the input query, a plurality of image tiles…
receiving, from the one or more image analysis models, a plurality of image facts…
generating, using a vision and language model, a response to the input query… 
Recited in claim 19 as:
generating, from the input image and the input query, a plurality of image tiles…
generating, using a vision and language model, a response to the input query…
generating a natural language response to the input query…
associating an element of the natural language response with a respective one or more portions of the input image…
It is noted that the above analysis is according to the 2019 Revised Patent Subject Matter Eligibility Guidance published in the Federal Register (84 FR 50) on January 7, 2019 and MPEP 2106.04(a)(2)(III).
Consider also that “If a claim recites a limitation that can practically be performed in the human mind, with or without the use of a physical aid such as pen and paper, the limitation falls within the mental processes grouping, and the claim recites an abstract idea” as per MPEP 2106.04(a)(2)(III)(B).  See also footnotes 14 and 15 of the Federal Register Notice.  As detailed above, the steps of generating, modifying, converting, etc. may be practically performed in the human mind with the use of a physical aid such as a pen and paper (marking the label on the package with a pen).
Under Step 2B, this judicial exception is not integrated into a practical application because each of claims 1-22 do not recite additional elements that integrate the exception into a practical application.  The only additional elements bounding boxes, altering resolution and causing an indication of the respective one or more portions of the input image to be rendered at the client device} are recited at a high level of generality and merely equate to “apply it” or otherwise merely uses a generic computer as a tool to perform an abstract which are not indicative of integration into a practical application as per MPEP 2106.05(f).  See also MPEP 2106.04(a)(2)(III) with respect to Mental Processes:  “Nor do the courts distinguish between claims that recite mental processes performed by humans and claims that recite mental processes performed on a computer”.  See also MPEP 2106.04(a)(2)(III)(C)(3) Using a computer as tool to perform a mental process and MPEP 2106.04(a)(2)(III)(D) as well as the case law cited therein.  
In other words, the additional elements and/or are recited at a high level of generality that does not amount to significantly more and/ such that they could practically be performed in the human mind.
For all of the above reasons, taken alone or in combination, claims 1-22 recite a non-statutory mental process.

1st Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-17 and 19-22 are rejected under 35 U.S.C. 103 as obvious over US Patent Publication 2025 0371863 A1, (Hall et al.).
Claim 1
[AltContent: textbox (Hall et al. Fig. 2, showing a query sectioned into sub-images in a text and image feature space.)]
    PNG
    media_image1.png
    343
    616
    media_image1.png
    Greyscale
 Regarding Claim 1, Hall et al. teach a method implemented by one or more processors ("These models are capable of interpreting and reasoning over both visual and textual inputs, enabling them to perform classification tasks of images," paragraph [0006]), the method comprising:
receiving an input query associated with a client device, the input query comprising an input image and an input text query ("receiving information that defines the scope, criteria, goal, and/or parameters of a domain-specific image diagnostic task. In some embodiments, this information may include a set of possible diagnostic labels, information on suitable or usable imaging modalities or modality specifications ( e.g., resolution, noise ratios, image orientation or perspective, magnification, coloring method, anatomical region, etc.), and task-specific constraints or goals (e.g., diagnostic accuracy thresholds, interpretability requirements)," paragraph [0023]);
generating, from the input image and the input query, a plurality of image tiles, wherein each image tile is a sub-image of the input image ("the system may crop or re-size images to ensure compatible image dimensions, adjust color gamuts and other visual settings for consistency, segment or threshold pixels or voxels according to any of their properties ( e.g., brightness, color channels, similar groupings, etc.), apply noise reduction or contrast enhancement, or extract image tiles or patches," paragraph [0028]);
providing, to one or more image analysis models, the plurality of image tiles ("The Initial Prompt Set of images can be used to serve as an unstructured component of a refined or tuned instruction set to a vision-language model, to serve as a verified set of examples that have been confirmed by a desired human-in-the-loop," paragraph [0036]);
receiving, from the one or more image analysis models, a plurality of image facts, each image fact corresponding to a respective one or more of the image tiles ("generating or validating a System Prompt that provides context and general diagnostic instructions to the vision-language model. The System Prompt may serve as a static or semi-static textual component that defines the diagnostic context, expected output format, and interpretive role of the model during inference," paragraph [0039] where generating includes receiving from a model);
generating, using a vision and language model, a response to the input query based on the input query, the plurality of image tiles, and the plurality of image facts ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]); and
causing the response to the input query to be rendered at the client device ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]).
It is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference.  Nevertheless, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because Hall et al. explicitly motivates doing so at least in paragraphs [0010], [0019] and [0135] including “In the foregoing specification, implementations of the disclosure have been described with reference to specific example implementations thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of implementations of the disclosure” and otherwise motivating experimentation and optimization.
The rejection of method claim 1 above applies mutatis mutandis to the corresponding limitations of method claim 19 while noting that the rejection above cites to both device and method disclosures.  Claim 19 is mapped below for clarity of the record and to specify any new limitations not included in claim 1.
Claim 2
 Regarding claim 2, Hall et al. teach the method of claim 1, further comprising: generating, using the vision and language model or a further vision and language model, one or more search requests based on the input query, the plurality of image tiles, and the plurality of image facts ("The images of the Initial Prompt Set, combined with the structured information ( e.g., labeling and related approaches) and unstructured information, are then transformed into Prompt Set entries to be utilized in an improved review/diagnosis procedure as described below," paragraph [0037]);
providing, to a search engine, the one or more search requests ("The host platform may include, but is not limited to, commercial diagnostic review suites such as Philips IntelliSite Pathology, Leica Aperio eSlide Manager, or Sectra PACS; web-based portals for remote image review; custom-built software applications; or image search engines that support case retrieval and triage," paragraph [0085]); and
receiving, from the search engine, one or more search responses to the one or more search requests ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]),
wherein generating, using the vision and language model, the response to the input query is further based on the one or more search responses ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]).
Claim 3
 Regarding claim 3, Hall et al. teach the method of claim 1, wherein generating, from the input image and the input query, the plurality of image tiles comprises: inputting, into the vision and language model, the input image and the input query ("The Initial Prompt Set of images can be used to serve as an unstructured component of a refined or tuned instruction set to a vision-language model, to serve as a verified set of examples that have been confirmed by a desired human-in-the-loop," paragraph [0036]);
processing the input image and the input query using the vision and language model to generate the plurality of image tiles ("the system may crop or re-size images to ensure compatible image dimensions, adjust color gamuts and other visual settings for consistency, segment or threshold pixels or voxels according to any of their properties ( e.g., brightness, color channels, similar groupings, etc.), apply noise reduction or contrast enhancement, or extract image tiles or patches," paragraph [0028]); and
outputting from the vision and language model, the plurality of image tiles ("the system may crop or re-size images to ensure compatible image dimensions, adjust color gamuts and other visual settings for consistency, segment or threshold pixels or voxels according to any of their properties ( e.g., brightness, color channels, similar groupings, etc.), apply noise reduction or contrast enhancement, or extract image tiles or patches," paragraph [0028]).
Claim 4
 Regarding claim 4, Hall et al. teach the method of claim 1, wherein generating, from the input image and the input query, the plurality of image tiles comprises generating, from the input image and the input query, the plurality of image tiles using an object detection model ("The cloud resource may use computer vision techniques like object detection, edge detection, etc. to crop images of individual apples from the iPhone images, and process them via specific instructions to the VLM, then return predicted diagnoses with corresponding cropped individual apple images for the user to confirm," paragraph [0121]).
Claim 5
 Regarding claim 5, Hall et al. teach the method of claim 1, wherein providing, to the one or more image analysis models, the plurality of image tiles comprises: determining, using the vision and language model, a classification for one or more of the plurality of image tiles ("These models are capable of interpreting and reasoning over both visual and textual inputs, enabling them to perform classification tasks of images," paragraph [0006]); and
providing each of the one or more image tiles to a respective one or more image analysis models in a plurality of image analysis models based at least in part on the respective image classification of the tile ("The Initial Prompt Set of images can be used to serve as an unstructured component of a refined or tuned instruction set to a vision-language model, to serve as a verified set of examples that have been confirmed by a desired human-in-the-loop," paragraph [0036]).
Claim 6
 Regarding claim 6, Hall et al. teach the method of claim 1, wherein providing, to the one or more image analysis models, the plurality of image tiles comprises providing the plurality of image tiles to the one or more image analysis models in parallel ("The input monitoring module may optionally operate in parallel with main diagnostic tasks performed by device 202, to evaluate whether the input images, while nominally matching a known diagnostic task," paragraph [0090]).
Claim 7
 Regarding claim 7, Hall et al. teach the method of claim 1, wherein the plurality of image tiles comprises a plurality of bounding boxes, each bounding box corresponding to a respective one or more objects in the input image ("The VLM may then generate structured outputs according to the System Prompt, including bounding boxes, classification labels, and explanatory captions," paragraph [0125]).
Claim 8
 Regarding claim 8, Hall et al. teach the method of claim 7, wherein one or more of the image facts relate to one or more objects in a bounding box of the bounding boxes ("The VLM may then generate structured outputs according to the System Prompt, including bounding boxes, classification labels, and explanatory captions," paragraph [0125]).
Claim 9
 Regarding claim 9, Hall et al. teach the method of any of claim 7, wherein one or more of the bounding boxes are rotated bounded boxes ("For example, images may be rotated, mirrored, cropped, brightened/darkened, or otherwise modified in ways that are known not to affect diagnosis/categorization based on the domain of the diagnostic task," paragraph [0069] where rotating an image teaches rotating the bounding box).
Claim 10
 Regarding claim 10, Hall et al. teach the method of claim 1, wherein generating, using the vision and language model, the response to the input query comprises sequentially inputting the plurality of image tiles and respective image facts into the vision and language model ("The module of block 114 may also (e.g., transparently to the user) experiment with random re-ordering/re-sequencing/adjustment of the Prompt Set component to attempt multiple permutations and assess whether doing so meaningfully affects model output," paragraph [0056]).
Claim 11
 Regarding claim 11, Hall et al. teach the method of claim 1, wherein generating, using the vision and language model, the response to the input query comprises inputting the plurality of image tiles and respective image facts into the vision and language model in parallel ("The input monitoring module may optionally operate in parallel with main diagnostic tasks performed by device 202, to evaluate whether the input images, while nominally matching a known diagnostic task," paragraph [0090]).
Claim 12
 Regarding claim 12, Hall et al. teach the method of claim 1, wherein the method further comprises: generating, based on one or more of the image facts, one or more sub-tiles of an image tile ("extract image tiles or patches from larger images according to specified positions or content criteria (e.g., excluding blank areas, defining patches by information density, etc.). These preprocessing steps may be performed automatically or based on parameters defined in the task definition object," paragraph [0028]);
providing, to the one or more image analysis models, the one or more sub-tiles ("The Initial Prompt Set of images can be used to serve as an unstructured component of a refined or tuned instruction set to a vision-language model, to serve as a verified set of examples that have been confirmed by a desired human-in-the-loop," paragraph [0036]); and
receiving, from the one or more image analysis models, one or more further image facts, each further image fact corresponding to a respective one or more of the sub-tiles ("generating or validating a System Prompt that provides context and general diagnostic instructions to the vision-language model. The System Prompt may serve as a static or semi-static textual component that defines the diagnostic context, expected output format, and interpretive role of the model during inference," paragraph [0039] where generating includes receiving from a model),
wherein generating, using the vision and language model, the response to the input query is further based on the one or more further image facts ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]).
Claim 13
 Regarding claim 13, Hall et al. teach the method of claim 1: wherein generating, using a vision and language model, a response to the input query comprises:
generating a natural language response to the input query ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]; and
associating an element of the natural language response with a respective one or more portions of the input image, the respective one or more portions of the input image corresponding to portions of the input image relevant to the element of the natural language response ("block 112 may include associating each Prompt Set entry with contextual metadata from the image or task definition object, such as imaging modality, acquisition parameters, or anatomical region," paragraph [0053]);
wherein causing the response to the input query to be rendered at the client device comprises causing the natural language response to the input query to be rendered at the client device ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]); and
wherein the method further comprises: receiving an indication that the element of the natural language response has been selected at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034]); and
causing an indication of the respective one or more portions of the input image to be rendered at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034] where a prompting a user to select teaches an indication of the element in the response).
Claim 14
 Regarding claim 14 Hall et al. teach the method of claim 13, wherein the respective one or more portions of the input image comprise one or more image tiles  ("the system may crop or re-size images to ensure compatible image dimensions, adjust color gamuts and other visual settings for consistency, segment or threshold pixels or voxels according to any of their properties ( e.g., brightness, color channels, similar groupings, etc.), apply noise reduction or contrast enhancement, or extract image tiles or patches," paragraph [0028]).
Claim 15
 Regarding claim 15, Hall et al. teach the method of claim 14, wherein the indication of the respective one or more portions of the input image comprises one or more bounding boxes, each bounding box corresponding to a respective image tile ("The VLM may then generate structured outputs according to the System Prompt, including bounding boxes, classification labels, and explanatory captions," paragraph [0125]).
Claim 16
 Regarding claim 16, Hall et al. teach the method of claim 13, wherein the method further comprises: receiving an indication that one or more of the respective one or more portions of the input image has been selected at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034]); and
causing an indication of the element of the natural language response to be rendered at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034] where a prompting a user to select teaches an indication of the element in the response).
Claim 17
 Regarding claim 17, Hall et al. teach the method of claim 1: wherein generating, using a vision and language model, a response to the input query comprises:
generating a natural language response to the input query ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]; and
associating an element of the natural language response with a respective one or more portions of the input image, the respective one or more portions of the input image corresponding to portions of the input image relevant to the element of the natural language response ("block 112 may include associating each Prompt Set entry with contextual metadata from the image or task definition object, such as imaging modality, acquisition parameters, or anatomical region," paragraph [0053]);
wherein causing the response to the input query to be rendered at the client device comprises causing the natural language response to the input query to be rendered at the client device ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]); and
wherein the method further comprises: receiving an indication that one or more of the respective one or more portions of the input image has been selected at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034]); and
causing an indication of element of the natural language response to be rendered at the client device  ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034] where a prompting a user to select teaches an indication of the element in the response).
Claim 19
 Regarding claim 19, Hall et al. teach a method implemented by one or more processors ("These models are capable of interpreting and reasoning over both visual and textual inputs, enabling them to perform classification tasks of images," paragraph [0006]), the method comprising:
receiving an input query associated with a client device, the input query comprising an input image and an input text query ("receiving information that defines the scope, criteria, goal, and/or parameters of a domain-specific image diagnostic task. In some embodiments, this information may include a set of possible diagnostic labels, information on suitable or usable imaging modalities or modality specifications ( e.g., resolution, noise ratios, image orientation or perspective, magnification, coloring method, anatomical region, etc.), and task-specific constraints or goals (e.g., diagnostic accuracy thresholds, interpretability requirements)," paragraph [0023]);
generating, from the input image and the input query, a plurality of image tiles, wherein each image tile is a sub-image of the input image ("the system may crop or re-size images to ensure compatible image dimensions, adjust color gamuts and other visual settings for consistency, segment or threshold pixels or voxels according to any of their properties ( e.g., brightness, color channels, similar groupings, etc.), apply noise reduction or contrast enhancement, or extract image tiles or patches," paragraph [0028]);
generating, using a vision and language model, a response to the input query based on the input query and the plurality of image tiles ("instructions for exactly what information the vision-language model should output as well as instructions for how it should handle unclear, ambiguous or marginal cases (e.g., inconsistent or low confidence levels)," paragraph [0044]), comprising:
generating a natural language response to the input query based on the input query and the plurality of image tiles ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]); and
associating an element of the natural language response with a respective one or more portions of the input image, the respective one or more portions of the input image corresponding to portions of the input image relevant to the element of the natural language response  ("block 112 may include associating each Prompt Set entry with contextual metadata from the image or task definition object, such as imaging modality, acquisition parameters, or anatomical region," paragraph [0053]);
causing the natural language response to the input query to be rendered at the client device ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]);
receiving an indication that the element of the natural language response has been selected at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034]); and
causing an indication of the respective one or more portions of the input image to be rendered at the client device ("The outputs presented to the user may include the candidate image, the predicted diagnostic label(s), and the unstructured textual explanation generated by the model in response to the Prompt Set and System Prompt (see block 114)," paragraph [0061]).
Claim 20
 Regarding claim 20, Hall et al. teach the method of claim 19, wherein the respective one or more portions of the input image comprise one or more image tiles ("The cloud resource may use computer vision techniques like object detection, edge detection, etc. to crop images of individual apples from the iPhone images, and process them via specific instructions to the VLM, then return predicted diagnoses with corresponding cropped individual apple images for the user to confirm," paragraph [0121]).
Claim 21
 Regarding claim 2, Hall et al. teach the method of claim 20, wherein the indication of the respective one or more portions of the input image comprises one or more bounding boxes, each bounding box corresponding to a respective image tile ("The VLM may then generate structured outputs according to the System Prompt, including bounding boxes, classification labels, and explanatory captions," paragraph [0125]).
Claim 22
 Regarding claim 22, Hall et al. teach the method of claim 19, wherein the method further comprises: receiving an indication that one or more of the respective one or more portions of the input image has been selected at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034]); and
causing an indication of the element of the natural language response to be rendered at the client device ("In some embodiments, a user may be prompted to participate or aid in the selection of the Active Set. For example, a system operating with process 100 may present a gallery or list of candidate images and allow the user to select images for inclusion based on visual inspection, metadata filters, or textual descriptions," paragraph [0034] where a prompting a user to select teaches an indication of the element in the response).

2nd Claim Rejections - 35 USC § 103
Claim 18 are rejected under 35 U.S.C. 103 as obvious over US Patent Publication 2025 0371863 A1, (Hall et al.) in view of US Patent Publication 2024 0354336 A1, (Gopalkrishna et al.).
Claim 18
 Regarding Claim 18, Hall et al. teach the method of claim 1, as noted above.
Hall et al. is not relied on to teach all of a second resolution lower than the first resolution.
[AltContent: textbox (Gopalkrishna et al. Fig. 3, showing mixed text and image queries that change resolutions.)]
    PNG
    media_image2.png
    314
    494
    media_image2.png
    Greyscale
However, Gopalkrishna et al. teach wherein receiving the input query associated with a client device comprises: receiving an initial input image that is at a first resolution ("Block 204 describes the reception of the input query. This is the juncture where the system interfaces with users, accepting image submissions for semantic analysis," paragraph [0042]); and
generating the input image from the initial input image, wherein the input image is at a second resolution that is lower than the first resolution ("The preprocessing of these images involves normalization, resolution adjustment, and possibly feature enhancement to optimize them for encoding," paragraph [0042] where resolution adjustment includes lowering resolution).

Therefore, taking the teachings of Hall et al. and Gopalkrishna et al. as a whole, it would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify “Active Prompt Tuning of Vision-Language Models for Human Confirmable Diagnostics from Images” as taught by Hall et al. to use “Multimodal Semantic Analysis and Image Retrieval” as taught by Gopalkrishna et al.  The suggestion/motivation for doing so would have been that, “Conventional systems and methods also require extensive, accurately labeled datasets to train models for effective search and retrieval. This necessity presents significant challenges, including the need for substantial manual labor to create and maintain these datasets and the difficulty in covering the breadth of semantic nuances across different domains.” as noted by the Gopalkrishna et al. disclosure in paragraph [0004], which also motivates combination because the combination would predictably have a higher productivity as there is a reasonable expectation that including image data in queries and responses would better respond to queries; and/or because doing so merely combines prior art elements according to known methods to yield predictable results.
Reference Cited
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
US Patent Publication 2024 0160853 A1 to Li et al. discloses a multimodal vision-language model that contains a Generalist Multimodal Transformer capable of complete multiple tasks using the same set of parameters learning from pre-training. The Generalist Multimodal Transformer allows alignment between frozen, unimodal encoders, such as image encoders and large language models. The Generalist Multimodal Transformer eliminates the need for fine-tuning the image encoders and large language models.
Non Patent Publication “A tale of two interfaces: vitrivr at the lifelog search challenge” to Heller et al. discloses two systems for lifelog retrieval: vitrivr and vitrivr-VR, which share a common retrieval model and backend for multi-modal multimedia retrieval.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HEATH E WELLS whose telephone number is (703)756-4696. The examiner can normally be reached Monday-Friday 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ms. Jennifer Mehmood can be reached on 571-272-7882. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Heath E. Wells/Examiner, Art Unit 2664


Date: 2 February 2026
Read full office action
Prosecution Timeline

Feb 12, 2024
Application Filed
Feb 03, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/232,212
Patent 12602755
DEEP LEARNING-BASED HIGH RESOLUTION IMAGE INPAINTING
2y 5m to grant Granted Apr 14, 2026
17/783,931
Patent 12597226
METHOD AND SYSTEM FOR AUTOMATED PLANT IMAGE LABELING
2y 5m to grant Granted Apr 07, 2026
17/620,452
Patent 12591979
IMAGE GENERATION METHOD AND DEVICE
2y 5m to grant Granted Mar 31, 2026
17/828,545
Patent 12588876
TARGET AREA DETERMINATION METHOD AND MEDICAL IMAGING SYSTEM
2y 5m to grant Granted Mar 31, 2026
17/991,910
Patent 12586363
GENERATION OF PLURAL IMAGES HAVING M-BIT DEPTH PER PIXEL BY CLIPPING M-BIT SEGMENTS FROM MUTUALLY DIFFERENT POSITIONS IN IMAGE HAVING N-BIT DEPTH PER PIXEL
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
75%
Grant Probability
93%
With Interview (+18.1%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 77 resolved cases by this examiner. Grant probability derived from career allow rate.