Last updated: April 19, 2026
Application No. 18/907,609
Automatic Generation of In-Store Product Information and Navigation Guidance, Using Augmented Reality (AR) and a Vision-and-Language Model (VLM) and Multi-Modal Artificial Intelligence (AI)

Non-Final OA §101§103
Filed
Oct 07, 2024
Examiner
REFAI, SAM M
Art Unit
3621
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
We R Augmented Reality Cloud Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +7.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 427 resolved cases, 2023–2026
Examiner Intelligence

REFAI, SAM M View full profile →
Grants only 34% of cases
Career Allow Rate
146 granted / 427 resolved
-17.8% vs TC avg
Moderate +7% lift
Without
With
+7.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
34 currently pending
Career history
461
Total Applications
across all art units
Statute-Specific Performance

§101
38.3%
-1.7% vs TC avg
§103
25.8%
-14.2% vs TC avg
§102
9.9%
-30.1% vs TC avg
§112
19.2%
-20.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 427 resolved cases
Office Action

§101 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
This Office Action is in response to the Reply to the Restriction Requirement 08/11/2025. 
Applicant elects Species A1 (claims 2 and 22-23), B2 (claim 9), and C2 (claims 14, 17-19, 21, and 24). Therefore, claims 6-8, 10-13, 16, and 20 are withdrawn from consideration as being directed towards the non-elected species. 
Claims 1-5, 9, 14-15, 17-19, and 21-27 are currently pending and examined below. 

Priority
The Examiner notes that at least the independent claims contain subject matter that is not reasonably conveyed in the parent applications. Therefore, the instant application is given priority to the filing of the instant application (i.e., 10/07/2024). 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-5, 9, 14-15, 17-19, and 21-27 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more. 
Claims 1-5, 9, 14-15, 17-19, and 21-27 is/are directed towards a statutory category (i.e., a process, machine, manufacture, or composition of matter) (Step 1, Yes). 
Claim 1 recites (additional elements underlined): 
A method comprising: 
(a) providing to a Vision and Language Model (VLM) one or more images that are captured within a retailer venue by a camera of an electronic device selected from the group consisting of: (i) a smartphone, (ii) an Augmented Reality (AR) device, (iii) smart glasses or smart sunglasses that include at least a camera and a memory unit and a processor; 
(b) automatically feeding the one or more images to said VLM, and automatically commanding said VLM to generate an output that depends at least on analysis of content of said one or more images; 
(c) receiving the output generated by said VLM; and based on said output, providing to said user, via said electronic device, information about one or more products that are depicted in said one or more images.
Under the broadest reasonable interpretation, the limitations outlined above that describe or set forth the abstract idea, cover performance of the limitations in the mind but for the recitation of generic computer(s) and/or generic computer component(s). That is, other than reciting the additional elements identified below, nothing in the claim precludes the limitations from practically being performed in the mind. These limitations are considered a mental process because the limitations include an observation, evaluation, judgement, and/or opinion. These limitations are also similar to “collecting information, analyzing it, and displaying certain results of the collection and analysis” and/or “collecting and comparing known information” which were determined to be mental processes in MPEP 2106.04(a)(2)(III)(A). The Examiner notes that “[c]laims can recite a mental process even if they are claimed as being performed on a computer” (see MPEP 2106.04(a)(2)(III)(C)). The mere nominal recitation of the additional elements identified below do not take the claims out of the mental process grouping. Therefore, the claim recite a mental process (Step 2A Prong One, Yes).
The limitations outlined above also describe or set forth an advertising/marketing activity. Advertising/marketing fall within the certain method of organizing human activity enumerated grouping of abstract ideas. The limitations outlined above also describe or set forth a fundamental economic principle or practice because advertising/marketing is related to commerce and economy, a commercial interaction (e.g., advertising, marketing or sales activities or behaviors, business relations), and managing personal behavior or relationships or interactions between people. Therefore, the claim recites a certain method of organizing human activity (Step 2A Prong One, Yes). 
In Step 2A Prong Two, these additional element(s) are recited at a high level of generality, and under the broadest reasonable interpretation, are generic computer(s) and/or generic computer component(s) that perform generic computer functions. The additional element(s) are merely used as tools, in their ordinary capacity, to perform the abstract idea. The additional element(s) amount adding the words “apply it” with the judicial exception. Merely implementing an abstract idea on generic computer(s) and/or generic computer component(s) does not integrate the judicial exception similar to how the recitation of the computer in the claim in Alice amounted to mere instructions to apply the abstract idea of intermediated settlement on a generic computer. “[T]he use of generic computer elements like a microprocessor or user interface do not alone transform an otherwise abstract idea into patent eligible subject matter" (see pp 10-11 of FairWarning IP, LLC. v. Iatric Systems, Inc. (Fed. Cir. 2016)). The additional elements also amount to generally linking the use of the abstract idea to a particular technological environment or field of use. The type of information being manipulated does not impose meaningful limitations or render the idea less abstract. Further, the courts have found that simply limiting the use of the abstract idea to a particular environment does not integrate the judicial exception into a practical application. Viewing the limitations as an ordered combination does not add anything further than looking at the limitations individually. The additional elements amount no more than mere instructions to apply the abstract idea using generic computer(s) and/or generic computer component(s). Their collective functions merely provide generic computer implementation. There is no indication that the combination of elements improves the functioning of a computer, improves any other technology or technical field, applies or uses the judicial exception to effect a particular treatment or prophylaxis for disease or medical condition, applies the judicial exception with, or by use of a particular machine, effects a transformation or reduction of a particular article to a different state or thing, or applies or uses the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claims as a whole is more than a drafting effort designed to monopolize the exception. (Step 2A Prong Two, No). 
In Step 2B, the additional elements also do not amount to significantly more for the same reasons set forth with respect to Step 2A Prong Two. The Examiner notes that revised Step 2A overlaps with Step 2B, and thus, many of the considerations need not be reevaluated in Step 2B because the answer will be the same. Viewing the limitations as an ordered combination does not add anything further than looking at the limitations individually. The additional elements amount no more than a mere instruction to apply the abstract idea using generic computer(s) and/or generic computer component(s) (Step 2B, No). 
Claims 2-5, 9, 14-15, 17-19, and 21-26 recite further limitations that also fall within the same abstract ideas identified above with respect to claim 1 (i.e., certain methods of organizing human activities and/or mental processes). 
	Claim 2 recites the additional elements of “via said electronic device”, “automatically”, “feeding”, “into said VLM”, “to said VLM”, “from said VLM”, and “VLM generated”. Claim 3 recites the additional elements of “at said VLM”, “wherein said VLM is configured or automatically commanded to”, and “VLM-based”. Claim 4 recites the additional elements of “by said VLM” and “that said VLM”. Claim 5 recites the additional elements of “a Machine Learning (ML)”, “feeding”, “by the electronic device”, “ML-detected”, “into the VLM”, “automatically”, “the VLM to”, and “VLM-based”. Claim 9 recites the additional elements of “into the VLM”, “by said electronic device”, “feeding”, “commanding the VLM to”, and “that the VLM”. Claim 14 recites the additional elements of “via said electronic device”, “into the VLM”, “by said electronic device”, “feeding”, “VLM”, “fed into the VLM”, “by the VLM”, “by said VLM”, “of the electronic device”, and “that the VLM”. Claim 15 recites the additional elements of “into the VLM”, “by said electronic device”, “feeding”, “by said VLM”, ‘of the electronic device”, and “VLM”. Claim 17 recites the additional elements of “fed into said VLM”, “feeding”, ““the VLM”, “of the electronic device”, “into the VLM”, “by said VLM”, and “of the electronic device”. Claim 18 recites the additional elements of “wherein the VLM is configured to autonomously” and “wherein said VLM has”. Claim 19 recites the additional elements of “VLM-based”, “by the electronic device”, “automatically”, and “on a screen of said electronic device an Augmented Reality (AR) element depicted on-screen visual emphasis of”. Claim 21 recites the additional elements of “into the VLM”, “VLM”, “feeding”, ““by said VLM”, and “of the electronic device”. Claim 22 recites the additional elements of “commanding said VLM to operate as a real-time virtual personalized in-store shopping assistant”, “feeding”, “to said VLM”, “by automatically commanding the VLM to autonomously”, “VLM-generated”, “real-time” and “of the electronic device”. Claim 23 recites the additional elements of “continuously feeding into said VLM a real-time video stream that is captured by said electronic device”, “continuously”, “of said electronic device”, “to the VLM”, “in real time or near real time”, “by the VLM”, “VLM”, “feeding”, “continuously fed into the VLM”, “in said real-time video stream that is continuously captured by said electronic device and that is continuously fed into the VLM”, “via said electronic device”, “VLM-generated”, “by the electronic device”, “on-screen”, “on a screen of the electronic device”, “an Augmented Reality (AR) layer or a Mixed Reality layer that is presented to said user via said electronic device”. Claim 24 recites the additional elements of “VLM-generated”, “into the VLM”, “feeding”, “the VLM”, “by the VLM”, and “VLM-based”. Claim 25 recites the additional elements of “commanding said VLM to operate as a real-time virtual personalized in-store shopping assistant”, “feeding”, “to said VLM”, “automatically commanding the VLM to autonomously”, “VLM-generated”, and “of the electronic device”. Claim 26 recites the additional elements of “spatially moving and spatially re-orienting said electronic device with Six Degrees of Freedom (6DoF) within said retailer venue; and capturing by said electronic device images or video during 6DoF spatial movement and re-orientation”, “feeding into the VLM”, “during said 6DoF spatial movement and re-orientation of the electronic device”, “VLM-based”, and “during said 6DoF spatial movement and re-orientation of the electronic device”, “of said electronic device VLM-generated”, “that the VLM”, “during said 6DoF spatial movement and re-orientation of the electronic device”, “the VLM generated”, “VLM-based”, “VLM-generated”, and “VLM”.  However, these additional elements also do not integrate the judicial exception into a practical application or amount to significantly more because they amount to adding the words “apply it” with the judicial exception, mere instructions to implement the idea on a computer, merely using a computer as a tool to perform an abstract idea, and generally linking the use of the judicial exception to a particular technological environment or field of use.
Claim 27 recites substantially similar limitations as claim 1. Therefore, for the same reasons explained above with respect to claim 21, claim 27 also recites an abstract idea in Step 2A Prong One (i.e., certain method of organizing human activities, and mental processes). Claim 27 recites the additional elements of “A system comprising: one or more hardware processors, that are configured to execute code; wherein the one or more hardware processors are operably associated with one or more memory units that are configured to store code”, “wherein the one or more hardware processors are configured to”, “to a Vision and Language Model (VLM)”, “by a camera of an electronic device selected from the group consisting of: (i) a smartphone, (ii) an Augmented Reality (AR) device, (iii) smart glasses or smart sunglasses that include at least a camera and a memory unit and a processor”, “automatically feeding”, “to said VLM”, “automatically commanding said VLM to”, “by said VLM”, and “via said electronic device”.  However, for the same reasons explained above with respect to claim 1, these additional elements also do not integrate the judicial exception into a practical application or amount to significantly more.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 9, 14-15, 17-19, and 21-27 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kharbanda et al. (US 2025/0140006 A1, hereinafter “Kharbanda”) in view of Chachek et al. (US 2020/0302510 A1, hereinafter “Chachek”). 

As per Claim 1, Kharbanda discloses A method comprising (¶ 28 “Generally, the present disclosure is directed to systems and methods for detailed instance-level scene recognition. In particular, the systems and methods disclosed herein can leverage an object recognition system and a vision language model to generate detailed captions, queries, and/or prompts associated with input images.” ¶ 10 “Another example aspect of the present disclosure is directed to a computing system for image captioning. The system can include one or more processors and one or more non-transitory computer-readable media that collectively store instructions that, when executed by the one or more processors, cause the computing system to perform operations. The operations can include obtaining image data. The image data can include an input image.” Also see citations below.): 
(a) providing to a Vision and Language Model (VLM) one or more images that are captured […] by a camera of an electronic device selected from the group consisting of: (i) a smartphone, (ii) an Augmented Reality (AR) device, (iii) smart glasses or smart sunglasses that include at least a camera and a memory unit and a processor (¶ 6 “One example aspect of the present disclosure is directed to a computer-implemented method. The method can include obtaining, by a computing system including one or more processors, image data. The image data can include an input image. The method can include processing, by the computing system, the input image with an object recognition model to generate a fine-grained object recognition output. The fine-grained object recognition output can be descriptive of identification details for an object depicted in the input image. The method can include processing, by the computing system, the input image with a vision language model to generate a language output. The language output can include a set of predicted words predicted to be descriptive of the input image. In some implementations, the set of predicted words can include a coarse-grained term descriptive of predicted identification of the object depicted in the input image. The method can include processing, by the computing system, the fine-grained object recognition output and the language output to generate an augmented language output. The augmented language output can include the set of predicted words with the coarse-grained term replaced with the fine-grained object recognition output.” ¶ 41 “In particular, the detailed image captioning system 10 can obtain input data, which can include image data 12 descriptive of one or more input images. The one or more input images can be descriptive of an environment and one or more objects. The environment can include a room, a landscape, a city, a town, a sky, and/or other environments. In some implementations, the environment is descriptive of a user environment generated with one or more image sensors of a user computing device. The one or more objects can include products, people, plants, animals, art pieces, structures, landmarks, and/or other objects. “¶ 98 “At 802, a computing system can obtain image data. The image data can include an input image. The input image can be descriptive of a particular environment and/or one or more particular objects. The input image may be obtained and/or generated with a visual search application. In some implementations, the image data can be obtained and/or generated with a smart wearable (e.g., a smart watch, smart glasses, a smart helmet, etc.). The image data may be obtained with other input data, which may include text data and/or audio data associated with a question.” ¶ 115 “The user computing system 102 may include and/or receive data from one or more sensors 126. The one or more sensors 126 may be housed in a housing component that houses the one or more processors 112, the memory 114, and/or one or more hardware components, which may store, and/or cause to perform, one or more software packets. The one or more sensors 126 can include one or more image sensors (e.g., a camera), one or more lidar sensors, one or more audio sensors (e.g., a microphone), one or more inertial sensors (e.g., inertial measurement unit), one or more biological sensors (e.g., a heart rate sensor, a pulse sensor, a retinal sensor, and/or a fingerprint sensor), one or more infrared sensors, one or more location sensors (e.g., GPS), one or more touch sensors (e.g., a conductive touch sensor and/or a mechanical touch sensor), and/or one or more other sensors. The one or more sensors can be utilized to obtain data associated with a user's environment (e.g., an image of a user's environment, a recording of the environment, and/or the location of the user).” ¶ 116 “The user computing system 102 may include, and/or pe part of, a user computing device 104. The user computing device 104 may include a mobile computing device (e.g., a smartphone or tablet), a desktop computer, a laptop computer, a smart wearable, and/or a smart appliance. Additionally and/or alternatively, the user computing system may obtain from, and/or generate data with, the one or more one or more user computing devices 104. For example, a camera of a smartphone may be utilized to capture image data descriptive of the environment, and/or an overlay application of the user computing device 104 can be utilized to track and/or process the data being provided to the user. Similarly, one or more sensors associated with a smart wearable may be utilized to obtain data about a user and/or about a user's environment (e.g., image data can be obtained with a camera housed in a user's smart glasses). Additionally and/or alternatively, the data may be obtained and uploaded from other user devices that may be specialized for data obtainment or generation.” Also see citations below.); 
(b) automatically feeding the one or more images to said VLM, and automatically commanding said VLM to generate an output that depends at least on analysis of content of said one or more images (¶ 6 “One example aspect of the present disclosure is directed to a computer-implemented method. The method can include obtaining, by a computing system including one or more processors, image data. The image data can include an input image. The method can include processing, by the computing system, the input image with an object recognition model to generate a fine-grained object recognition output. The fine-grained object recognition output can be descriptive of identification details for an object depicted in the input image. The method can include processing, by the computing system, the input image with a vision language model to generate a language output. The language output can include a set of predicted words predicted to be descriptive of the input image. In some implementations, the set of predicted words can include a coarse-grained term descriptive of predicted identification of the object depicted in the input image. The method can include processing, by the computing system, the fine-grained object recognition output and the language output to generate an augmented language output. The augmented language output can include the set of predicted words with the coarse-grained term replaced with the fine-grained object recognition output.” ¶ 28 “Generally, the present disclosure is directed to systems and methods for detailed instance-level scene recognition. In particular, the systems and methods disclosed herein can leverage an object recognition system and a vision language model to generate detailed captions, queries, and/or prompts associated with input images. For example, an object recognition system (e.g., a system with one or more object recognition models) can process an input image to generate an object recognition output descriptive of a recognition of a particular object of a particular object class. The object recognition output can be descriptive of a detailed identification of the specific object instance. Additionally, a vision language model can process the input image to generate a language output descriptive of a scene recognition for the scene depicted in the input image. The language output can include details descriptive of the environment and one or more objects in the environment. The language output may not include the granularity and/or specificity of the object recognition output. The object recognition model and the vision language model may process the input image in parallel to reduce latency. The object recognition output and the language output can then be processed to generate an augmented language output that is descriptive of the scene recognition of the language output with the specificity and/or particularity of the object recognition output. For example, the language output may include an identification of a particular object class for the object depicted in the input image, while the augmented language output may include a specific indication of an instance-level identification of the depicted object (e.g., a brand and model name for a product, a name of a depicted person, a name for a piece of art, and/or a species and subspecies identification for a plant or animal).” ¶ 32 “Pairing instance level object recognition with vision language model processing can be utilized to generate detailed captions, queries, and/or prompts. Combining scene understanding with instance understanding can be leveraged for image searching, image indexing, automated content generation and/or understanding, and/or other image understanding tasks [i.e., automatically feeding the one or more images to the VLM and automatically commanding the VLM to generate output]. For example, the augmented language output can be leveraged as and/or to generate a detailed query and/or a detailed prompt to obtain and/or generate additional information. The particularity can lead to improved tailoring of search results and/or generative prompts.” ¶ 36 “The systems and methods disclosed herein can be leveraged to process a plurality of different data types (e.g., image data, text data, video data, audio data, statistical data, graph data, latent encoding data, and/or multimodal data) to generate outputs that may be in a plurality of different data formats (e.g., image data, text data, video data, audio data, statistical data, graph data, latent encoding data, and/or multimodal data). For example, the input data may include a video that can be processed to generate a summary of the video, which may include a natural language summary, a timeline, a flowchart, an audio file in the form of a podcast, and/or a comic book. The object recognition system can be leveraged for object specific details, while the scene understanding model (e.g., a vision language model) may be leveraged for scene recognition and/or frame group understanding. In some implementations, one or more additional models may be leveraged for context understanding. For example, a hierarchical video encoder may be utilized for frame understanding, frame sequence understanding, and/or full video understanding. Audio input processing may include the utilization of a text-to-speech model, which may be implemented as part of the language model.” ¶ 72 “Additionally and/or alternatively, the computing system can provide the augmented language output in an augmented-reality experience. The augmented-reality experience can include the augmented language output overlayed over a live video feed of an environment. The augmented-reality experience can be provided via a viewfinder of a mobile computing device, via a smart wearable, and/or via other computing devices. The augmented-reality experience can be leveraged to ask and receive additional information on a user's environment via an augmented-reality interface.” ¶ 149 “ In some implementations, the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user. The generation may be prompted based on a user selection and/or may be automatically performed [i.e., automatically feeding the one or more images to the VLM and automatically commanding the VLM to generate output] (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified).” ¶ 158 “The processes may be performed iteratively and/or continuously. One or more user inputs to the provided user interface elements may condition and/or affect successive processing loops.” Also see citations above.); 
(c) receiving the output generated by said VLM; and based on said output, providing to said user, via said electronic device, information about one or more products that are depicted in said one or more images (¶ 6 “One example aspect of the present disclosure is directed to a computer-implemented method. The method can include obtaining, by a computing system including one or more processors, image data. The image data can include an input image. The method can include processing, by the computing system, the input image with an object recognition model to generate a fine-grained object recognition output. The fine-grained object recognition output can be descriptive of identification details for an object depicted in the input image. The method can include processing, by the computing system, the input image with a vision language model to generate a language output. The language output can include a set of predicted words predicted to be descriptive of the input image. In some implementations, the set of predicted words can include a coarse-grained term descriptive of predicted identification of the object depicted in the input image. The method can include processing, by the computing system, the fine-grained object recognition output and the language output to generate an augmented language output. The augmented language output can include the set of predicted words with the coarse-grained term replaced with the fine-grained object recognition output.” ¶ 28 “Generally, the present disclosure is directed to systems and methods for detailed instance-level scene recognition. In particular, the systems and methods disclosed herein can leverage an object recognition system and a vision language model to generate detailed captions, queries, and/or prompts associated with input images. For example, an object recognition system (e.g., a system with one or more object recognition models) can process an input image to generate an object recognition output descriptive of a recognition of a particular object of a particular object class. The object recognition output can be descriptive of a detailed identification of the specific object instance. Additionally, a vision language model can process the input image to generate a language output descriptive of a scene recognition for the scene depicted in the input image. The language output can include details descriptive of the environment and one or more objects in the environment. The language output may not include the granularity and/or specificity of the object recognition output. The object recognition model and the vision language model may process the input image in parallel to reduce latency. The object recognition output and the language output can then be processed to generate an augmented language output that is descriptive of the scene recognition of the language output with the specificity and/or particularity of the object recognition output. For example, the language output may include an identification of a particular object class for the object depicted in the input image, while the augmented language output may include a specific indication of an instance-level identification of the depicted object (e.g., a brand and model name for a product, a name of a depicted person, a name for a piece of art, and/or a species and subspecies identification for a plant or animal).” ¶ 32 “Pairing instance level object recognition with vision language model processing can be utilized to generate detailed captions, queries, and/or prompts. Combining scene understanding with instance understanding can be leveraged for image searching, image indexing, automated content generation and/or understanding, and/or other image understanding tasks [i.e., automatically feeding the one or more images to the VLM and automatically commanding the VLM to generate output]. For example, the augmented language output can be leveraged as and/or to generate a detailed query and/or a detailed prompt to obtain and/or generate additional information. The particularity can lead to improved tailoring of search results and/or generative prompts.” ¶ 36 “The systems and methods disclosed herein can be leveraged to process a plurality of different data types (e.g., image data, text data, video data, audio data, statistical data, graph data, latent encoding data, and/or multimodal data) to generate outputs that may be in a plurality of different data formats (e.g., image data, text data, video data, audio data, statistical data, graph data, latent encoding data, and/or multimodal data). For example, the input data may include a video that can be processed to generate a summary of the video, which may include a natural language summary, a timeline, a flowchart, an audio file in the form of a podcast, and/or a comic book. The object recognition system can be leveraged for object specific details, while the scene understanding model (e.g., a vision language model) may be leveraged for scene recognition and/or frame group understanding. In some implementations, one or more additional models may be leveraged for context understanding. For example, a hierarchical video encoder may be utilized for frame understanding, frame sequence understanding, and/or full video understanding. Audio input processing may include the utilization of a text-to-speech model, which may be implemented as part of the language model.” ¶ 72 “Additionally and/or alternatively, the computing system can provide the augmented language output in an augmented-reality experience. The augmented-reality experience can include the augmented language output overlayed over a live video feed of an environment. The augmented-reality experience can be provided via a viewfinder of a mobile computing device, via a smart wearable, and/or via other computing devices. The augmented-reality experience can be leveraged to ask and receive additional information on a user's environment via an augmented-reality interface.” ¶ 149 “ In some implementations, the one or more datasets and/or the output(s) of the sensor processing system 60 may be processed with one or more generative models 90 to generate a model-generated content item that can then be provided to a user. The generation may be prompted based on a user selection and/or may be automatically performed [i.e., automatically feeding the one or more images to the VLM and automatically commanding the VLM to generate output] (e.g., automatically performed based on one or more conditions, which may be associated with a threshold amount of search results not being identified).” ¶ 158 “The processes may be performed iteratively and/or continuously. One or more user inputs to the provided user interface elements may condition and/or affect successive processing loops.” Also see at least ¶ 34 and citations above.).
While Kharbanda discloses all of the above limitations, including providing one or more images that are captures within an environment, Kharbanda does not appear to explicitly state that the environment is within a retailer venue. However, in the same field of endeavor, Chachek teaches this limitation in at least ¶ 4 “The present invention provides systems, devices, and methods for Augmented Reality (AR) based mapping of a venue and its real-time inventory, as well as real-time navigation within such venue. The system utilizes flexible and modular technology, that is continuously updated in real-time or in near-real-time, for localization of a user (having an end-user electronic device, such as smartphone or tablet or smart-watch) within a venue (e.g., a store, a supermarket, a shopping center, a mall, or the like), and for navigating or routing or guiding such user among products or elements in such venue based on the spatial localization and recognition of inventory elements and other elements that are located in such venue. The present invention is capable of operating in a continuously-changing environment of a busy store or mega-store of a retailer, in which shelves and aisles are continuously changing, and in which the views captured by store cameras and/or by electronic devices of users are continuously shopping, for example, due to customers taking or removing products from shelves, employees stocking or re-stocking products into shelves, employees moving and re-arranging items within the store (e.g., particularly before and after holidays and sale events), and many other changes that occur” Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date to modify the environment as disclosed by Kharbanda, for the retailer venue as taught by Chachek, because doing so would enable the system to route or guide or navigate a user among products (Chachek, ¶ 4). The combination would enable the system to help users find products on their shopping list. Additionally, Since each individual element and its function are shown in the prior art, albeit shown in separate reference, the difference between the claimed subject matter and the prior art rests not on any individual element or function, but in the very combination itself – that is in the substitution of the retailer venue of Chachek for the environment of Kharbanda. Thus, the simple substitution of one known element for another producing a predictable result renders the claim obvious (KSR Rationale B).

As per Claim 2, Kharbanda discloses comprising: 
receiving from the user, via said electronic device, a question that pertains to one or more products that are depicted in said one or more images (¶ 29 “The augmented language output may then be leveraged as a query and/or a prompt to obtain additional information associated with the scene and/or objects depicted in the input image. In some implementations, input text may be received with the input image, and the language output and/or the augmented language output may be generated based in part on the input text. Therefore, a user may ask a question about a depicted scene, a detailed scene recognition can be generated, and a detailed query and/or prompt can be generated that includes the semantic intent of the question and the recognition information of the augmented language output. The augmented language output may be processed with a search engine and/or a generative model (e.g., a large language model, a vision language model, an image generation model, etc.) to generate the additional information, which may be responsive to the input question.” ¶ 30 “Vision language models can leverage learned image and language associations to generate natural language captions for images; however, vision language models can struggle with details including object particularity. The lack of particularity can lead to the generation of generalized queries and/or prompts, which may fail to provide results that are specific to and/or applicable to the features depicted in the image. For example, a user may provide an image with a question “how do I take care of this?” The vision language model may process the image to determine the image depicts a plant, which can be leveraged to generate a refined query of “what do plants need to stay alive and grow?” The refined query can be processed to determine search results that may be associated with general care instructions for plants, which may include watering twice a week, half a day of direct sunlight, and loamy soil. However, the generalized care instructions may not be suitable for the specific plant depicted in the image (e.g., a succulent (e.g., an agave plant) needs less water and different soil, and a shuttlecock fern may thrive in shade over direct sunlight). Therefore, the utilization of generalized information for the object class may be detrimental to the caretaking and counter to the original purpose of the inputs.” ¶ 82 “FIG. 4C depicts processing an input image 452 and input text to generate a refined query 458. The input image 453 can depict a particular humidifier. The input text may include a question about the depicted product. For example, the input text can include “what's it use?” Also see at least ¶¶ 85-90, 95-98, 153, and citations above.); 
automatically feeding said question into said VLM, and also feeding to said VLM the one or more images (¶¶ 29-30 and 82. Also see at least ¶¶ 85-90, 95-98, 153, and citations above.); and 
automatically commanding said VLM to generate a response to said question based on said one or more images (¶¶ 29-30 and 82. Also see at least ¶¶ 85-90, 95-98, 153, and citations above.); 
receiving from said VLM a VLM-generated response to said question, and providing said VLM-generated response to said user via said electronic device (¶¶ 29-30 and 82. Also see at least ¶¶ 85-90, 95-98, 153, and citations above.). 

As per Claim 3, Kharbanda discloses comprising: 
recognizing at said VLM a particular product that is depicted in said one or more images (¶ 28. Also see at least ¶¶ 34, 41, 43, 45, 54, 65-66, 82-87, and citations above.), 
wherein said one or more images do not depict a barcode of said particular product (¶¶ 82-83. Also see at least ¶¶ 28, 34, 41, 43, 45, 54, 65-66, 82-87, and citations above.), 
wherein said VLM is configured or automatically commanded to perform VLM-based image analysis that recognizes products […] based on external visual appearance of products and without recognizing or analyzing product barcodes (¶ 34. Also see at least ¶¶ 28, 41, 43, 45, 54, 65-66, 82-87, and citations above.).
While Kharbanda analyzes every product in an aisle (¶ 34), Kharbanda does not appear to explicitly indicate that the products are on shelves. However, in the same field of endeavor, Chachek discloses this limitation in at least ¶¶ 12-13 and 63. Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date to modify the products that are displayed in aisles as disclosed by Kharbanda, to explicitly include product that are the shelves as taught by Chachek, because doing so would enable the system and/or customers to detect products that are missing on shelves in retail stores (Chachek, ¶¶ 112 and 139-140). The combination would enable customers to easily find and purchase products of interest by displaying the products on shelves. Additionally, since each individual element and its function are shown in the prior art, albeit shown in separate reference, the difference between the claimed subject matter and the prior art rests not on any individual element or function, but in the very combination itself – that is in the substitution of the on-shelf products of Chachek for the products displayed in aisles of Kharbanda. Thus, the simple substitution of one known element for another producing a predictable result renders the claim obvious (KSR Rationale B).   

As per Claim 4, Kharbanda discloses comprising: 
recognizing the particular product by said VLM, by taking into account information that said VLM deduced from said one or more images […] (¶ 28. Also see at least ¶¶ 34, 41, 43, 45, 54, 65-66, 82-87, and citations above.). 
While Kharbanda recognizes the particular product by taking account information that the VLM deduced from said one or more images, Kharbanda does not appear to disclose one or more images about another, neighboring, product. However, in the same field of endeavor, Chachek discloses this limitation in at least ¶ 139 “Some embodiments of the present invention may generate an Augmented Reality (AR) view, of an area or region of a venue (store, mall, department within a store, or the like), which shows an actual image of the region (e.g., captured in real time via an imager of the smartphone or electronic device of the user), and Augmented with additional information or visual elements (graphics, animation, text, video, labels, tags, prices, filters, emphasis or highlighting of specific products or features, or the like). This may be performed by an AR Element Adding Unit 134, which may generate such additional elements and add them or overlay them onto a real-time imaging output of the venue or region. For example, the AR Element Adding Unit 134 may generate and add: (a) a price and/or a name of a particular product that is seen within the image or the ongoing AR view; (b) a tag or label of a product (e.g., “Diet Coke, 1.5 liters, 2 dollars”), or for an entire section or shelf or table of products (e.g., “Breakfast Cereals” or “Sugar” or “Rice”); (c) an emphasis or highlighting of a particular product that is within the region of interest or the viewed image (e.g., highlighting “Diet Coke 1.5 liters, which you have purchased in the past!”; or highlighting “Diet Pepsi 2 liters, which is on sale today!”; or highlighting “Sprite Zero is calorie free!” because a manufacturer of this product had paid or performed real-time bidding to have this product highlighted or emphasized in the view); (d) an AR avatar or AR decorations or additions, such as an avatar that is walking or flying around the store, animated or static decorations of flowers or bees or birds; (e) a navigation trail, shown in first-person view or third-person view or perspective view, the trail indicating a common or popular shopping trail of this particular user and/or of a group of users and/or of the population of visitors, and/or the trail including one or more segments or points-of-interest that are included in this trail due to their being on a current or past shopping list or wish list of this user, or due to their having a product that is promoted or on sale or that its manufacturer had paid or had performed real-time bidding in order for this product to be included in the navigation trail; (f) an on-screen virtual compass, optionally as an overlay onto the floor or ground of the store or venue, indicating the relative directions of departments (e.g., “dairy”, “fruits”) or product-types (e.g., “mineral water”, “rice”) or specific products (e.g., “Diet Coke” or “Pepsi), and allowing the user to interact with such on-screen compass by rotating and/or selecting one or more of its rings or ring-portions, and/or automatically rotating or spinning or updating such on-screen compass in response to detection that the user has moved or turned or rotated in real life; (g) an on-screen addition of particular products that are currently Missing or our “out of stock”, by performing image analysis and computer vision analysis of the captured view, detecting that a first product is indeed shown on the shelf, whereas a second, nearby, product, is lacking from the image and instead of it there is a void or an empty space on the shelf, and then obtaining from the Store Map and from the Inventor Database the data about the missing product or out-of-stock product and then generating an overlay AR element that shows a virtual image of that missing product in the exact location of the empty shelf as viewed on the user's device; and/or other AR elements which may be generated and/or added on-the-fly as the user walks within the store.” ¶ 140 “Reference is made to FIG. 18, which is a schematic illustration of two images 1801-1802 displayed on a user device, in accordance with some demonstrative embodiments of the present invention. For example, image 1801 shows a non-augmented view or a non-modified view of a shelf: the user points the camera of his smartphone (or other electronic device) towards a shelf, and image 1801 demonstrates the view as originally captured and shown to the user. The user may notice that the top-left portion is empty and appears to be missing items; and may provide to the user a request or a query of “show me which products are missing from this top shelf”. The system may check the store's planogram of map, and may determine which products are typically intended to be stored on that shelf-region; and may then generated the modified image or the Augmented Reality (AR) image 1802, which depicts stock images or stock photos of the missing products, optionally with a label or tag of their name and/or the fact that they are currently missing or out-of-stock. In some embodiments, system may automatically perform computer vision analysis of the original image 1800, and may recognize or detect that there is an empty shelf space which is greater than a pre-defined threshold and thus indicates a large quantity of missing products; and may perform the AR-based augmentation or addition of the missing products onto the screen of the user device, automatically upon recognizing such missing products or such empty shelf-space.” Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date to modify the step of recognizing products by taking into account information that the VLM deduced from one or more images as disclosed by Kharbanda, to include the one or more images about another neighboring product as taught by Chachek, because doing so would enable customers and retailers to determine when products are out-of-stock (Chachek, ¶ 139). The combination would also enable retailers to determine when products are out-of-stock so that a store clerk can restock the shelf in a timely manner. The combination is also merely a combination of old elements, and in the combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable (KSR Rationale A).  

As per Claim 5, Kharbanda discloses comprising: 
firstly, invoking a Machine Learning (ML) product detection process and image slicing process, to slice an image that was captured by the electronic device and that depicts a plurality of […] product, into a corresponding plurality of discrete image-portions, each image-portion depicting only a single ML-detected product or ML-detected object (¶¶ 34, 44, 55-56, 66-67, 94, and 140. Also see citations above.); 
then, feeding each of said discrete image-portions into the VLM, and automatically commanding the VLM to perform VLM-based product recognition on each of said discrete image-portions; wherein said image-portions do not depict barcodes of products (¶¶ 34, 44, 55-56, 66-67, 94, and 140. Also see citations above.).
While Kharbanda analyzes every product in an aisle (¶ 34), Kharbanda does not appear to explicitly indicate that the products are on-shelf. However, in the same field of endeavor, Chachek discloses this limitation in at least ¶¶ 12-13 and 63. Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date to modify the products that are displayed in aisles as disclosed by Kharbanda, to explicitly include product that are the shelves as taught by Chachek, because doing so would enable the system and/or customers to detect products that are missing on shelves in retail stores (Chachek, ¶¶ 112 and 1
Read full office action
Prosecution Timeline

Oct 07, 2024
Application Filed
Nov 07, 2025
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/365,837
Patent 12597047
SYSTEM AND METHOD FOR PROVIDING EXTERNAL NOTIFICATIONS OF EVENTS IN A VIRTUAL SPACE TO USERS
2y 5m to grant Granted Apr 07, 2026
18/914,984
Patent 12586102
HEURISTIC CLUSTERING
2y 5m to grant Granted Mar 24, 2026
18/072,487
Patent 12548070
DYNAMIC AUGMENTED REALITY AND GAMIFICATION EXPERIENCE FOR IN-STORE SHOPPING
2y 5m to grant Granted Feb 10, 2026
18/375,152
Patent 12462276
METHODS, SYSTEMS, AND MEDIA FOR IDENTIFYING AUTOMATICALLY REFRESHED ADVERTISEMENTS
2y 5m to grant Granted Nov 04, 2025
18/224,726
Patent 12443973
DEEP LEARNING-BASED REVENUE-PER-CLICK PREDICTION MODEL FRAMEWORK
2y 5m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
34%
Grant Probability
42%
With Interview (+7.4%)
3y 2m
Median Time to Grant
Low
PTA Risk
Based on 427 resolved cases by this examiner. Grant probability derived from career allow rate.