Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant's claim for foreign priority based on an application filed in Japan on 09/26/2023. It is noted, however, that applicant has not filed a certified copy of the JP2023-162954 application as required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/06/2024, 02/04/2025, and 09/05/2025 are considered by the examiner.
Drawings
The drawing submitted on 08/06/2024 is considered by the examiner.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claim(s) 11-30 are rejected under 35 U.S.C. 102(a)(2)as being anticipated by William et al. “Towards Language Models That can see: Computer Vision Through LENS of Natural Language”.
Regarding Claims 11 and 21, William teaches: An explanation generating system comprising a database (Fig.2 frozen LLM and a set of “vision modules” ) which can be searched by a worker in natural language; a processor which stores a sentence to said database (Introduction: Large Language Models (LLMs) …capabilities in semantic understanding, question answering and text generation …Figure 2: The LENS framework. LENS executes computer vision and visual reasoning tasks through a frozen LLM and a set of “vision modules”. LENS leverages these vision modules to retrieve a textual description for an image which is used by the “reasoning module” (LLM) to generate a response for a given query); wherein said processor: receives scene information indicating a scene recognized by a generative model (LENS) which can analyze an image from a camera and output in natural language (3.2, LENS Components, LENS consists of 3 distinct vision modules and 1 reasoning module, each serving a specific purpose based on the task at hand. Tag Module. Given an image, this module identifies and assigns tags to the image. To accomplish this, we employ a vision encoder(CLIP) that selects the most suitable tags for each image. In our work, we adopt a common prompt: "A photo of{classname}" … Attributes Module. We utilize this module to identify and assign relevant attributes to the objects present in the image. Intensive Captioner. We utilize an image captioning model called BLIP and apply stochastic top-k sampling [12] to generate N captions per image. Reasoning Module. We adopt a frozen LLM as our reasoning module, which is capable of generating answers based on the textual descriptions fed by the vision modules, along with the task-specific instructions.); inputs a prompt to said generative model, wherein said prompt corresponds to said scene information and said prompt is generated based on explanation necessity of an object set for each scene; receives situation explanatory sentence generated by said generative model according to said prompt (3.3 Prompt Design With the textual information obtained from the vision modules ,we construct complete prompts for the LLM by combining them. We formatted the tags module as Tags: {Top-k tags},the attributes modules as Attributes: {Top-K attributes}, the intensive captioning module as Captions: {Top-N Captions}. In particular, for the hateful-memes task, we incorporate an OCR prompt as OCR: this is an image with written"{meme text}" on it. Finally, we append the specific question prompt: Question: {task-specific prompt} \n Short Answer: at the end. Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place.”, and LENS generated answer with explanation of the object with the scene, “The Great Wall of China is a fortification built by the ancient Chinese to keep out invaders.”.); and associates said situation explanatory sentence with said image, and stores said situation explanatory sentence into said database (3.2 LENS Components LENS consists of 3 distinct vision modules and 1 reasoning module, each serving a specific purpose based on the task at hand. 4.2 Implementation Details, We use OpenCLIP-H/142 and CLIP-L/143 as our default vision encoders in both tags and attributes modules.
We adopt BLIP-large4 captioning checkpoint finetuned on COCO [36] in intensive captioning module. In this module, we perform a top-k sampling [12], where k represents the desired number of captions and generates a maximum of k = 50 captions per image. Finally, we adopt Flan-T5 models as our default family of frozen LLMs [37]. To generate answers in line with the evaluation tasks, we employ beam search with number of beams equal to 5. ).
Regarding Claims 12 and 22, William teach: The explanation generating system according to claim 11 wherein said generative model includes a language model (LLM ) (See rejection of claim 11 and 3.2, LENS Components, Reasoning Module. We adopt a frozen LLM as our reasoning module, which is capable of generating answers based on the textual descriptions fed by the vision modules, along with the task-specific instructions.).
Regarding Claims 13 and 23, William teach: The explanation generating system according to claim 11 wherein said prompt is generated based on said explanation necessity and recognition necessity (See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place. ).
Regarding Claims 14 and 24, William teach: The explanation generating system according to claim 11 wherein said prompt includes an image explanatory sentence(See rejection of claim 11 and Fig.4).
Regarding Claims 15 and 25, William teach: The explanation generating system according to claim 11 wherein said prompt includes a control explanatory sentence (See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place.).
Regarding Claims 16 and 26, William teach: The explanation generating system according to claim 11 wherein said prompt includes a GPS (China) explanatory sentence(See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place and LENS generated answer with explanation of the object with the scene, “The Great Wall of China is a fortification built by the ancient Chinese to keep out invaders.”.).
Regarding Claims 17 and 27, William teaches: The explanation generating system according to claim 12 wherein said prompt is generated based on said explanation necessity and recognition necessity (See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place.).
Regarding Claims 18 and 28, William teaches: The explanation generating system according to claim 17 wherein said prompt includes an image explanatory sentence (See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place.).
Regarding Claims 19 and 29, William teaches: The explanation generating system according to claim 18 wherein said prompt includes a control explanatory sentence(See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place.).
Regarding Claims 20 and 30, William teaches: The explanation generating system according to claim 19 wherein said prompt includes a GPS (China) explanatory sentence(See rejection of claim 11 and Also see Fig.4, where an user query scene/image with “Tell me something about the history of this place and LENS generated answer with explanation of the object with the scene, “The Great Wall of China is a fortification built by the ancient Chinese to keep out invaders.”.).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art of record Chae et al. ( KR 102785215 B1) teach: System And Method For Providing Conversational Artificial Intelligence Service Using Complex Analysis Of Image And Query.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878. The examiner can normally be reached Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras Shah can be reached at 571-270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2653