Last updated: May 29, 2026

Application No. 18/757,404

REGION OF INTEREST PROMPT PROCESSING FOR LARGE MULTIMODAL MODELS

Non-Final OA §102§103§112

Filed

Jun 27, 2024

Examiner

CAUDLE, PENNY LOUISE

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Microsoft Technology Licensing, LLC

OA Round

1 (Non-Final)

Interview Optional

— +14.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 68% grant rate with +14.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 73 resolved cases, 2023–2026

Examiner Intelligence

CAUDLE, PENNY LOUISE View full profile →

Grants 68% — above average

Career Allowance Rate

50 granted / 73 resolved

+6.5% vs TC avg

Moderate +15% lift

Without

With

+14.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

13 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

11.8%

-28.2% vs TC avg

§103

78.6%

+38.6% vs TC avg

§102

3.7%

-36.3% vs TC avg

§112

5.9%

-34.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 73 resolved cases

Office Action

§102 §103 §112

DETAILED ACTION
This examination is in response to the communication filed on 06/27/2024. Claims 1-20 are currently pending, where claims 1, 8 and 15 are independent.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/18/2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 9-14 rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  
Claim 9, from which, claims 10-14 depend, recites the “method of claim 1” however, claim 1 is directed to a system. Thus claims 9-14 improperly recite two statutory classes: system and method. For purposes of examination, claims 9-14 are interpreted as further limiting the “method” of claim 8.
 Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 2, 7-9 and 14-16 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kharbanda et al. (US 11,978,271 B1; herein “Kharbanda”).
Regarding claims 1, 8 and 15, Kharbanda teaches a system (Fig. 9A), comprising: a processor (Fig. 9A, processors 112 and 132); and a memory including instructions executable by the processor (FIG. 9A memory 114 and 134 and instructions 118 and 134), a method, and a computer-readable medium (Col. 2, lines 42-46 teaches “The system can includes…one or more non-transitory computer-readable media…”) storing instructions that are operative upon execution by a processor to: 
receive a multimodal prompt including a media file and information related to a region of interest (ROI) of the media file (Under a broadest reasonable interpretation, “information related to a region of interest” is interpreted as being any information utilized to determine/selected objects within the media file; Fig. 2, image data 212 and text data 232; Fig. 4A, elements 402 and 404; Fig. 5, elements 502 and 504; Col. 9, lines 57-64 teaches “…the generative model leveraged search system 200 can obtain input data, which can include image data 212 descriptive of one or more input image …and text data 232 descriptive of a request for particular information…” The one or more input images can be descriptive of an environment and one or more input images” and col. 11, lines 21-26 teaches “In some implementations, the particular object selected for processing may be based on the text data 232 (e.g., ‘what are recipes for this item?’ causes food items to be processed, while ‘what is that on the right?’ causes objects on the right of the input images to be processed” Thus, the input text data 232 is utilized to determine which objects in the input image are of interest); 
determine the ROI of the media file based on the information related to the ROI of the media file (Fig. 2, object recognition 214; col. 11, lines 15-26 teaches “…Alternatively and/or additionally, the object recognition block 14 can determine a focal object and/or an object of interest based on object location, object size, image semantics, image focus, occurrence in a sequence of input images, and/or other contextual attributes. In some implementations, the particular object selected for processing may be based on the text data 232 (e.g., ‘what are recipes for this item?’ causes food items to be processed, while ‘what is that on the right?’ causes objects on the right of the input images to be processed” The focal object and/or object of interest is interpreted as a ROI), wherein the ROI of the media file is smaller than a global version of the media file (In some implementations, the particular object selected for processing may be based on the text data 232 (e.g., ‘what are recipes for this item?’ causes food items to be processed, while ‘what is that on the right?’ causes objects on the right of the input images to be processed” because the object of interest or focal object is selected from a portion of the input image it inherently smaller than a global version of the media file. In addition, col 10, lines 42-48 teaches “The object recognition model [of the object recognition block 214] may be trained and/or configured to process an image, detect an object, segment a portion of the image that includes the object…” Thus, the detected object is a sub-part of the image, making it smaller than the global image); 
generate a plurality of media tiles of interest (MTIs) associated with the ROI of the media file (Under a broadest reasonable interpretation MTI is interpreted as segments or portions of the image corresponding to the object(s) of interest; Col. 10, lines 48-53 teaches “The object recognition model may include a detection model that processes the input images to generate bounding boxes indicating a position of the detected objects. A segmentation model of the object recognition model may then segment the detected objects based on the bounding boxes to generate image segments for the detected objects” ); 
encode the MTIs together with a natural-language input received with the multimodal prompt to generate a modified prompt (Fig. 2, Augmented language output 222; col. 1, lines 4-7 teaches “The object recognition output 216, the language output 220, and/or the text data 232 can then be processed with an augmentation model 224 to generate the augmented language output 222”); 
send the modified prompt to a large multimodal model (LMM) to process the modified prompt (Fig. 2, input to generative model 228; col. 12, lines 33-40 teaches “…the augmented language output 222 and/or the text data 232 can be processed with a generative model 228 (e.g., a large language model…) to generate one or more model-generated responses 230. The augmented language output 222… maybe formatted as a prompt…”); and 
receive a response to the modified prompt from the LMM (Fig. 2, model-generated response 230 ).  
Regarding claims 2, 9 and 16, Kharbanda teaches all of the elements of claims 1, 8, and 15 (see detailed element mapping above). In addition, Kharbanda further teaches the information related to the ROI comprises one of: defined ROI parameters (the “comprises one of” language renders this element optional ); and instructions for automatically determining the ROI of the media file using one or more ROI policies (Under a broadest reasonable interpretation instructions for automatically determining, it interpreted as “instructions associated with the media file”, this is supported by ¶[0019] of the specification; Kharbanda Fig. 4A, element 402 “How Much Water Does This Need?”, Fig. 4B element 432 “Write A Haiku About This Place”, and Fig. 5 element 502 “Create a listing to sell this”.)
Regarding claims 7 and 14, Kharbanda teaches all of the elements of claims 1 and 9 (see detailed element mapping above). In addition, Kharbanda further teaches the media file is an image file (Fig. 4A, element 404, Fig. 4B element 434, and Fig. 5 element 504); and 
the memory further comprises instructions executable by the processor to present the response via a user interface, the response being a natural-language description of the image depicted in the image file (Fig. 9A, user interface 124 and 144; Fig. 5, model-generated response 506; and col. 17, lines14-22 teaches “generative model response system 500 can obtain an input image 504 and input text 502 to generate a model-generated response 506. For example, the input text 502 can include "create a listing to sell this," the input image 504 can depict a white chair in a room, and the model-generated response 506 can include a model-generated listing.”).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or non-obviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 3, 4, 10, 11, 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kharbanda as applied to claims 1, 8 and 15 above, and further in view of M. Cai et al., "ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts," 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 12914-12923. (herein “Cai”).
Regarding claims 3, 10 and 17, Kharbanda teaches all of the elements of claims 2, 9, and 16 (see detailed element mapping above). In addition, Kharbanda further teaches the defined ROI parameters includes one of: mask information defining the ROI of the media file (the one of language renders this element optional); and coordinate information defining the ROI of the media file (col. 11, lines 17-21 teaches “The object recognition block 14 can determine a focal object and/or an object of interest based on object location…” object location is interpreted as coordinate information).
However, Kharbanda fails to disclose that the defined ROI parameters are included in the information provided in the received multimodal prompt as defined in claims 2, 9, and 16, from which claims 3, 10, and 17 depend, respectively.
Cai teaches providing spatial references in multimodal prompts, such as using textual representations of coordinates, learned positional embeddings, or ROI features. (See page 12914, second column).
Kharbanda differs from the claimed invention, as defined in claims 3, 10 and 17, in that Kharbanda fails to disclose providing ROI parameters as part of the multimodal prompt. Multimodal prompts which include spatial references such as ROI features is known in the art as evidenced by Cai. Therefore, it would have been obvious to one having ordinary skill in the art, before the effective filing date of the invention, to have modified the system taught by Kharbanda to include the ROI parameters as part of the multimodal prompt as taught by Cai in order to allow the LLM to process region-specific information in complex scenes. (Cai, page 12914, section 1, second paragraph).
Regarding claims 4, 11 and 18, Kharbanda teaches all of the elements of claims 2, 9, and 16 (see detailed element mapping above). In addition, Kharbanda further teaches instructions executable by the processor to: 
apply the defined ROI parameters to a global tile associated with the media file to determine the ROI of the media file (col. 11, lines 17-21 teaches “The object recognition block 14 can determine a focal object and/or an object of interest based on object location, object size, image semantics…” Col. 10, lines 48-53 teaches “The object recognition model may include a detection model that processes the input images to generate bounding boxes indicating a position of the detected objects. A segmentation model of the object recognition model may then segment the detected objects based on the bounding boxes to generate image segments for the detected objects” ); and 
generate the plurality of MTIs based on the determined ROI (Under a broadest reasonable interpretation MTI is interpreted as segments or portions of the image corresponding to the object(s) of interest; Col. 10, lines 48-53 teaches “The object recognition model may include a detection model that processes the input images to generate bounding boxes indicating a position of the detected objects. A segmentation model of the object recognition model may then segment the detected objects based on the bounding boxes to generate image segments for the detected objects”).
However, Kharbanda fails to disclose that the defined ROI parameters are included in the information provided in the received multimodal prompt as defined in claims 2, 9, and 16, from which claims 3, 10, and 17 depend, respectively.
Cai teaches providing spatial references in multimodal prompts, such as using textual representations of coordinates, learned positional embeddings, or ROI features. (See page 12914, second column).
Kharbanda differs from the claimed invention, as defined in claims 3, 10 and 17, in that Kharbanda fails to disclose providing ROI parameters as part of the multimodal prompt. Multimodal prompts which include spatial references such as ROI features is known in the art as evidenced by Cai. Therefore, it would have been obvious to one having ordinary skill in the art, before the effective filing date of the invention, to have modified the system taught by Kharbanda to include the ROI parameters as part of the multimodal prompt as taught by Cai in order to allow the LLM to process region-specific information in complex scenes. (Cai, page 12914, section 1, second paragraph).
Claims 5, 6, 12, 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Kharbanda as applied to claims 1, 8 and 15 above, and further in view of TAL et al. (US 2010/0322521 A1; herein “TAL”).
Regarding claims 5, 12 and 19, Kharbanda teaches all of the elements of claims 2, 9, and 16 (see detailed element mapping above). In addition, Kharbanda further teaches instructions executable by the processor to:
apply the plurality of rules to a global tile associated with the media file to determine the ROI of the media file (Under a broadest reasonable interpretation “apply the plurality of rules” is interpreted as applying algorithms to determine objects of interest; Kharbanda, col. 9 lines 3-7 teaches “…the object recognition block 14 can determine a focal object and/or an object of interest based on object location, object size, image semantics, image focus, occurrence in a sequence of input images, and/or other contextual attributes” Thus, Kharbanda teaches applying a plurality of algorithm/rules to determine an object of interest); and 
generate the plurality of MTIs based on the determined ROI (Under a broadest reasonable interpretation MTI is interpreted as segments or portions of the image corresponding to the object(s) of interest; Col. 10, lines 48-53 teaches “The object recognition model may include a detection model that processes the input images to generate bounding boxes indicating a position of the detected objects. A segmentation model of the object recognition model may then segment the detected objects based on the bounding boxes to generate image segments for the detected objects”)
Although Kharbanda teaches the instructions are executable by the processor to: utilize a plurality of algorithms to detect objects, Kharbanda fails to explicitly disclose instructions executable by the processor to: access a view composer policy storing a plurality of rules wherein at least some of the plurality of rules instruct the view composer to exclude low-value regions of the media file in the ROI.
TAL teaches a region of interest (ROI) extraction method which incorporates segmentations methods with context-aware saliency which follows four basic principles (1) Local low-level considerations, including factors such as contrast and color, (2) Global consideration, which suppress frequently occurring features, while maintaining features that deviate from the norm, (3) Visual organization rules, and (4) High-level factors (See ¶¶[0129]-[0135]). In addition, TAL teaches in ¶[0138] that “homogeneous or blurred areas should obtain low saliency values” and ¶[0146] teaches “we consider multiple scales, so that the saliency of background pixels is further decreased, improving the contrast between salient and non-salient regions” and ¶[0104] teaches “the non-salient background is eliminated in the images…” Accordingly, TAL teaches the excluding low-value, i.e., non-salient regions such as the background. Thus, TAL teaches a plurality of rules wherein at least some of the plurality of rules instruct the view composer to exclude low-value regions of the media file in the ROI.
Kharbanda differs from the invention, as defined in claims 5, 12 and 19, in that Kharbanda fails to explicitly disclose that the plurality of algorithms/rules for detecting objects within an image includes rules/policies which exclude low-value regions. Policies excluding low-value, e.g., non-salient, regions, such as backgrounds in known in the art as evidenced by TAL. Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to have modified the ROI extraction/segmenting method of Kharbanda to include rules/policies which exclude low-value regions as taught by TAL in order to keep some of the background in the region of interest when the background is required for conveying the context (TAL, ¶[0097]).	 
Regarding claims 6, 13 and 20, the combination of Kharbanda and TAL teaches all of the elements of claims 5, 12 and 19 (see detailed element mapping above). In addition, Kharbanda further teaches the media file is an image file (Fig. 4A, element 404, Fig. 4B element 434, and Fig. 5 element 504).
In addition, TAL further teaches that at least some of the low-value regions are defined as a region of the global media tile containing little-to-no contrast in color or texture (¶[0090] teaches “Manny approaches have been proposed for detecting regions with maximum local saliency of low-level factors. These factors usually consist of intensity, color, orientation, texture, size, and shape” and ¶[0124] teaches “…Local low-level considerations including factors such as contrast, color, orientation, etc. …” ).
Kharbanda differs from the invention, as defined in claims 6, 13 and 20, in that Kharbanda fails to explicitly disclose that at least some of the low-value regions are defined as regions containing little to no contrast in color or texture. Excluding/defining backgrounds regions based on no contrast in color or texture in known in the art as evidenced by TAL. Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the invention to have modified the ROI extraction/segmenting method of Kharbanda to include rules/policies which exclude low-value regions as taught by TAL in order to keep some of the background in the region of interest when the background is required for conveying the context (TAL, ¶[0097]). 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PENNY L CAUDLE whose telephone number is (703)756-1432. The examiner can normally be reached M-Th 8:00 am to 5:00 pm eastern.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PENNY L CAUDLE/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Jun 27, 2024

Application Filed

Mar 31, 2026

Non-Final Rejection mailed — §102, §103, §112

May 14, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

17/852,548

Patent 12626688

METHOD AND SYSTEM FOR SPEECH DETECTION AND SPEECH ENHANCEMENT

3y 10m to grant Granted May 12, 2026

18/461,333

Patent 12620392

INITIATING AN ACTION BASED ON A VOICE MESSAGE TEXT GENERATED FROM A VOICE MESSAGE

2y 8m to grant Granted May 05, 2026

18/178,376

Patent 12609123

AUDIO PROCESSING METHOD AND APPARATUS

3y 1m to grant Granted Apr 21, 2026

18/302,683

Patent 12592243

METHOD AND ELECTRONIC DEVICE FOR PERSONALIZED AUDIO ENHANCEMENT

2y 11m to grant Granted Mar 31, 2026

18/038,631

Patent 12573371

VOCABULARY SELECTION FOR TEXT PROCESSING TASKS USING POWER INDICES

2y 9m to grant Granted Mar 10, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

68%

Grant Probability

83%

With Interview (+14.9%)

2y 11m (~1y 0m remaining)

Median Time to Grant

Low

PTA Risk

Based on 73 resolved cases by this examiner. Grant probability derived from career allowance rate.