Last updated: May 29, 2026
Application No. 18/588,278
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND COMPUTER PROGRAM PRODUCT

Non-Final OA §101§103§112
Filed
Feb 27, 2024
Priority
Jul 04, 2023 — JP 2023-110203
Examiner
KOETH, MICHELLE M
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Kabushiki Kaisha Toshiba
OA Round
1 (Non-Final)
Interview Optional

— +16.3% interview lift. Examiner has a relatively high allowance rate (77%); +16.3% interview lift. A written response may suffice.
Based on 434 resolved cases, 2023–2026
Examiner Intelligence

KOETH, MICHELLE M View full profile →
Grants 77% — above average
Career Allowance Rate
335 granted / 434 resolved
+15.2% vs TC avg
Strong +16% interview lift
Without
With
+16.3%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 2m
Avg Prosecution
22 currently pending
Career history
465
Total Applications
across all art units
Statute-Specific Performance

§101
1.5%
-38.5% vs TC avg
§103
91.2%
+51.2% vs TC avg
§102
1.2%
-38.8% vs TC avg
§112
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 434 resolved cases
Office Action

§101 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. The following title is suggested: Device, method and computer program product for visual question answering processing.

Claim Objections
Claim 8 is objected to because of the following informalities: the last line recites “based on a result of vote,” but vote is earlier introduced in the claim, and therefore, claim 8 should instead recite “based on a result of the vote.”  Appropriate correction is required.
Claim 3 is objected to because of the following: the limitation “wherein the detection unit is configured to detect at least one piece of object information including an object area,” is a duplicate limitation as this limitation is earlier recited in claim 1 from which claim 3 depends as “a detection unit configured to detect at least one piece of object information including an object area.” Therefore, the duplicate limitation in claim 3 must be deleted. Appropriate correction is required.
Claim 6, and therefore claim 7 which depends therefrom, is objected to because of the following informalities: line 2 recites “the acquisition unit is configure to,” but should instead recite “the acquisition unit is configured to.”  Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: “detection unit configured to detect,” “a cut-out unit configured to generate,” “an acquisition unit configured to acquire,” and “a visual question answering (VQA) processing unit configured to perform,” in claim 1, “a storage unit configured to store,” in claim 2, “a transformation unit configured to transform,” in claim 5, “a display control unit configured to display,” in claim 7, and “a voting unit configured to vote,” in claim 8.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. Specifically, the “detection unit configured to detect,” “a cut-out unit configured to generate,” “an acquisition unit configured to acquire,” and “a visual question answering (VQA) processing unit configured to perform,” in claim 1, “a transformation unit configured to transform,” in claim 5, and “a voting unit configured to vote,” in claim 8 are interpreted as one or more processors executing functions by computer programming as given in ¶¶ 90–100 of the originally filed specification. The claimed “a storage unit configured to store,” in claim 2, is interpreted as a computer-readable storage medium as given in ¶¶ 92 and 94. The claimed “a display control unit configured to display,” in claim 7, is interpreted as a display device as a liquid crystal display as given in ¶93.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3 and 9 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 3, lines 5–6 recite “an object type read out from the storage unit,” however, claim 2 from which claim 3 depends, also recites “acquire …the object type … from the storage unit,” and therefore it is unclear and indefinite whether the object type read out from the storage unit in claim 3 is the same one in claim 2 that is “acquired” from the storage unit, or is a different object type altogether.
Regarding claim 9, lines the preamble recites the elements of the claim using the transitional phrase in the preamble “the instructions causing a computer to function as:” where such a preamble “as” is indefinite for failing to define the scope of claim 9 with respect to what unrecited additional components or steps, if any, are excluded from the scope of the claim. See MPEP 2111.03. The applicant must select a proper transitional phrase, with examples provided in MPEP 2111.03, to overcome this rejection. For the purposes of examination, claim 9 is interpreted as requiring the instructions to be comprising the recited units and associated functionality, the recited units being interpreted as software units.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1–10 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without a practical application or significantly more.
Regarding claims 1, 9 and 10, these claims recite the following limitations which are found to be abstract ideas not reciting a practical application or significantly more, with claim 1 being exemplary:
detect at least one piece of object information including an object area containing an object to be detected and object identification information for identifying the object to be detected, from an image; generate at least one object image, by cutting out at least one object area from the image; acquire at least one question according to the object identification information; perform a VQA process with the at least one question, for each of the at least one object image (abstract idea as a mental process as a human mind practically performs detecting regions in an image where an object is pictured, and understands information regarding the object’s identification, can draw with physical aids (pen and paper) an object image by removing background information from an image, presuppose a question that would be asked about the object and presuppose an answer to that question about the object).
Claims 1, 9 and 10 further recite additional elements: claim 1 is directed towards a processing device, and claim 9 is directed towards a product comprising a non-transitory computer-readable medium, and both claims 1 and 9 recite a detection unit, a cut-out unit, an acquisition unit and a visual question answering processing unit. Claim 10 is directed towards a method, but also recites an information processing device. While the devices, units and non-transitory computer-readable medium of the claims are additional elements, they are not sufficient to recite a practical application of the abstract ideas recited in claims 1, 9 and 10 as they amount to mere generic computer elements and thus amount to no more than a recitation of the words "apply it" (or an equivalent) or are no more than mere instructions to implement an abstract idea or other exception on a computer. see MPEP §2106.05(f). 
Further, the claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, the above recited additional elements from claims 1, 9 and 10 do not add significantly more (also known as an “inventive concept”) to the exception. Rather, the additional elements disclosed above perform well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Therefore, independent claims 1, 9 and 10 are directed towards an abstract idea without a practical application or significantly more.
Regarding claim 2, the limitations further narrowing the object identification information is merely an extension of the abstract ideas recited in claim 1, and the limitations of acquiring a question from a storage unit are merely directed towards further abstract ideas, specifically mental process as the human mind can practically recall/acquire/imagine questions according to object type. As for the claimed “store at least one question for each object type indicating a type of object to be detected,” these additional elements, while not necessarily being abstract ideas, are insignificant extra solution activity since they are merely data gathering and data output (see MPEP §2106.05(g)). Moreover, these elements amount to receiving and outputting data in a computer based system and are well understood, routine, conventional activity. See MPEP 2106.05(d), subsection II. Further, the claimed “storage unit” and “acquisition unit” are not sufficient to recite a practical application of the abstract ideas recited in claim 2 as they amount to mere generic computer elements and thus amount to no more than a recitation of the words "apply it" (or an equivalent) or are no more than mere instructions to implement an abstract idea or other exception on a computer. see MPEP §2106.05(f). Nor are they sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, the above recited additional elements in claim 2 do not add significantly more (also known as an “inventive concept”) to the exception. Rather, the additional elements disclosed above perform well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Regarding claim 3, the limitations of detecting at least one piece of object information including an object area, and the object identification information from the image are merely directed towards further abstract ideas, specifically mental process as the human mind can practically detect object information from an image.
Regarding claim 4, the limitations of receive a question sentence on the image from a user, identify the object type from the question sentence, and detect at least one piece of object information including an object area and the object identification information, the object area containing an object to be detected of the identified object type are merely directed towards further abstract ideas, specifically mental process as the human mind can practically receive questions from a person (in communicating with someone for example), and detect information about an object including its area and type.
Regarding claim 5, the limitations of transform the object area according to at least one of the object type and the question sentence, and generate at least one object image, by cutting out at least one object area or at least one object area transformed by the transformation unit, from the image are merely directed towards further abstract ideas, specifically mental process as the human mind can practically, at least with pen and paper physical aids, transform an object area according to the object type and the question sentence and generate (i.e. draw) at least one object image by cutting and transforming an object area. The transformation unit and cut-out unit, are not sufficient to recite a practical application of the abstract ideas recited in claim 5 as they amount to mere generic computer elements and thus amount to no more than a recitation of the words "apply it" (or an equivalent) or are no more than mere instructions to implement an abstract idea or other exception on a computer. see MPEP §2106.05(f). Nor are they sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, the above recited additional elements in claim 2 do not add significantly more (also known as an “inventive concept”) to the exception. Rather, the additional elements disclosed above perform well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Regarding claim 6, the limitations of assign question identification information to at least one question applied to each object image, and output VQA process result information in which the object identification information, the question identification information, and an answer to a question identified by the question identification information are associated with one another, are merely directed towards further abstract ideas, specifically mental process as the human mind can practically, at least with pen and paper physical aids, assign information to a question (i.e. draw a graph or chart) and output an answer. As previously discussed above regarding claim 1 from which claim 6 depends, the acquisition unit and the VQA processing unit are not a practical application or significantly more.
Regarding claim 7, the limitations of display information in which the VQA process result information is assigned to an object to be detected identified by the object identification information included in the VQA process result information, are merely directed towards further abstract ideas, specifically mental process as the human mind can practically, at least with pen and paper physical aids, display information resulting from question and answering process, and assign to an object to be detected identified by the object identification information. As for the claimed “display control unit,” and “display device” these additional elements, are not sufficient to recite a practical application of the abstract ideas recited in claim 7 as they amount to mere generic computer elements and thus amount to no more than a recitation of the words "apply it" (or an equivalent) or are no more than mere instructions to implement an abstract idea or other exception on a computer. see MPEP §2106.05(f). Nor are they sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, the above recited additional elements in claim 7 do not add significantly more (also known as an “inventive concept”) to the exception. Rather, the additional elements disclosed above perform well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Regarding claim 8, the limitations further narrowing the image as a frame included in a moving image is merely an extension of the abstract ideas recited in claim 1, and the limitations of “detect at least one piece of object information from the frame,” and “vote an answer to a question obtained by performing the VQA process on the object image generated for each frame, and determine an answer to the question for each object image, based on a result of vote,” are merely directed towards further abstract ideas, specifically mental process as the human mind can practically, at least with pen and paper physical aids, detect information from a frame, vote an answer and determine an answer based on a vote. As for the claimed “detection unit,” and “voting unit” these additional elements, are not sufficient to recite a practical application of the abstract ideas recited in claim 8 as they amount to mere generic computer elements and thus amount to no more than a recitation of the words "apply it" (or an equivalent) or are no more than mere instructions to implement an abstract idea or other exception on a computer. see MPEP §2106.05(f). Nor are they sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, the above recited additional elements in claim 8 do not add significantly more (also known as an “inventive concept”) to the exception. Rather, the additional elements disclosed above perform well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1–4, 6–7, 9 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Costabello et al., US Patent No. US 10,949,718 B2 (herein “Costabello”) in view of Pham et al., US Patent Application Publication No. US 2022/0129693 A1 (herein “Pham”).
Regarding claims 1, 9 and 10, with deficiencies of Costabello noted in square brackets [], and significant differences between the claims noted with curly brackets {}, Costabello teaches { an information processing device comprising – claim 1 / A computer program product comprising a non-transitory computer-readable medium including programmed instructions, the instructions causing a computer to function as: - claim 9 / An information processing method comprising: - claim 10} (Costabello col. 15, ll. 5–25, system 100 including hardware and software combinations, the hardware including a processor, where col. 3, ll. 4–6 and 21–26 teaches a system processing an input image and query and outputting a response (information processing)): 
{a detection unit configured to – claims 1 and 9 / by an information processing device – claim 10} detect at least one piece of object information including an object area containing an object to be detected (Costabello col. 3, ll. 31–34, and  col. 3, l. 47–col. 4, l. 2, computer vision techniques are applied to generate symbolic representations of information in the image such as the pixel locations in the image that identify a region of the input image that corresponds to an object shown in the image) and object identification information for identifying the object to be detected, from an image (Costabello col. 3, ll. 53–64, symbolic representations including content classification for detected content in the image which can include type or category of an object); 
{a cut-out unit configured to – claims 1 and 9 / by the information processing device – claim 10} generate at least one object image, by cutting out at least one object area from the image (Costabello col. 4, ll. 10–23, col. 6, ll. 15–37, and col. 12, ll. 7–28, image-processing framework, further described in fig. 2, encodes a portion of the image as a sub-symbolic feature vector/embedding, the portion being for example that of a cat (object) in the image, therefore “cutting” out of the image the cat object by generating the sub-symbolic embedding data for the pixels including the cat region of the image);
[{an acquisition unit configured to – claims 1 and 9 / by the information processing device – claim 10} acquire at least one question according to the object identification information]; and 
{a visual question answering (VQA) processing unit configured to – claims 1 and 9/ by the information processing device – claim 10} perform a {visual question answering – claim 10} VQA process with the at least one question, for each of the at least one object image (Costabello fig. 1, col. 5, l. 38–col. 6, l. 14, at inference time, an inference controller receives an inference query (question) and provides a natural language answer regarding the inquired object, for example an input query of “Which animal in this image is able to climb trees,” and the answer being provided as “The cat can climb the trees.”).
Costabello does not, but Pham teaches {an acquisition unit configured to – claims 1 and 9/ by the information processing device – claim 10} acquire at least one question according to the object identification information (Pham ¶¶39–40 and 42, question and answer acquisition unit extracts a set of questions (at least one question) based on an image feature, where ¶99 teaches the image feature calculated from a region of the image corresponding to an object (object identification information)).
Therefore, taking the teachings of Costabello and Pham together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the image processing of Costabello to include the question acquisition unit acquiring questions as disclosed in Pham at least because doing so would allow for answers to be provided that are sourced in official authoritative guides/manuals such as a work manual or safety manual, and thus provide more accurate answers. See Pham ¶52.
Regarding claim 2, with deficiencies of Costabello noted in square brackets [], Costabello teaches further comprising a storage unit configured to store [at least one question] (Costabello col. 4, ll. 46–62, fig. 1, multi-modal embedding model shown in fig. 1 as a database symbol, where col. 8, ll. 64–66 teach the multi-modal embeddings being stored in the multi-modal embedding model, and where col. 15, ll. 41–49 teaches that all of the system and its logic and data structures (of which the multi-modal embedding model would be understood to be a data structure) is stored on non-transitory computer readable storage media) for each object type indicating a type of object to be detected, wherein the object identification information includes information indicating an object type (Costabello col. 4, ll. 48–50, multi-modal embeddings include an aggregation of the symbolic embeddings and the sub-symbolic embeddings, where col. 3, ll. 50–60 teach the symbolic embeddings including a type of object detected in the content of the image), and [the acquisition unit is configured to acquire at least one question] according to the object type included in the object identification information, from the storage unit (Costabello col. 5, ll. 9–18, multi-modal embedding framework identifies (acquires) an embeddings result set having specific multi-modal embeddings based on an input query from the multi-modal embedding model).
Costabello does not but Pham teaches storage including at least one question and the acquisition unit is configured to acquire at least one question (Pham ¶¶39–40 and 42, fig. 3, question and answer acquisition unit extracts a set of questions (acquire at least one question) from a table storing the questions (storage) based on an image feature, where ¶99 teaches the image feature calculated from a region of the image corresponding to an object).
Therefore, taking the teachings of Costabello and Pham together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the image processing of Costabello to include the question acquisition unit acquiring questions and question storage as disclosed in Pham at least because doing so would allow for answers to be provided that are sourced in official authoritative guides/manuals such as a work manual or safety manual, and thus provide more accurate answers. See Pham ¶52.
Regarding claim 3, Costabello teaches wherein the detection unit is configured to detect at least one piece of object information including an object area (Costabello col. 3, ll. 31–34, and  col. 3, l. 47–col. 4, l. 2, computer vision techniques are applied to generate symbolic representations of information in the image such as the pixel locations in the image that identify a region of the input image that corresponds to an object shown in the image), and the object identification information, from the image, the object area containing an object to be detected of an object type read out from the storage unit (Costabello col. 5, ll. 9–18, col. 4, ll. 48–50, multi-modal embeddings include an aggregation of the symbolic embeddings and the sub-symbolic embeddings, where the multi-modal embedding framework identifies (acquires) an embeddings result set having specific multi-modal embeddings based on an input query from the multi-modal embedding model (storage unit), and where noted above the symbolic representations include regions (area) corresponding to detected object, and object types).
Regarding claim 4, Costabello teaches wherein the detection unit is configured to receive a question sentence on the image from a user, identify the object type from the question sentence, and detect at least one piece of object information including an object area and the object identification information, the object area containing an object to be detected of the identified object type (Costabello fig. 1, col. 5, ll. 9–18, and col. 5, l. 57–col. 6, l .5, the multi-modal embedding framework 116 identifies an embeddings result set having specific multi-modal embeddings based on an input query received (question sentence on the image from the user) from the multi-modal embedding model (storage unit), and where noted above the symbolic representations include regions (area) corresponding to detected object, and object types).
Regarding claim 6, Costabello teaches wherein the acquisition unit is configure to assign question identification information to at least one question applied to each object image (Costabello col. 5, ll. 19–62, fig. 1, inference query assigned parameters of ?, has_skill, climb_trees to an input unstructured query that corresponds to an input image with objects), and the VQA processing unit is configured to output VQA process result information in which the object identification information, the question identification information, and an answer to a question identified by the question identification information are associated with one another (Costabello col. 5, l. 57–col. 6, l. 14, inference query parameters are input to the inference controller to generate an embedding query with the content classifications (associating object identification information with the question identification information) and identifying a specific multi-modal embedding that is a best replacement for the ? parameter of the inference query parameters, thus associating a multi-modal embedding that defines the embedding result as an inference response (answer to a question) with the embedding query and content classifications).
Regarding claim 7, Costabello teaches further comprising a display control unit configured to display, on a display device, display information in which the VQA process result information is assigned to an object to be detected identified by the object identification information included in the VQA process result information (Costabello col. 14, ll. 1–6, the system displays the natural language response on a display through a graphical user interface, where col. 6, l. 5–14, teaches that the natural language response is the result of the conversion of the inference response generated as a result of processing the image and input query (question) regarding (assigned to) a detected object in the image (such as “cat” having a skill of tree climbing)).
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Costabello in view of Pham, further in view of Kodama, US Patent No. US 12,361,670 B2 (herein “Kodama”).
Regarding claim 5, Costabello as modified by Pham does not teach, but Kodama teaches further comprising a transformation unit configured to transform the object area according to at least one of the object type and the question sentence (Kodama col. 10, ll. 3–12, numbers 200–209 as functional blocks realized by processors (unit), where first variable magnification unit enlarges or reduces (transform) image data of a target region (object area), where col. 10, ll. 39–42 teaches that the target region is determined through object detection before the first magnification, the object detection also obtaining the category of the object (according to object type)), wherein the cut-out unit is configured to generate at least one object image, by cutting out at least one object area or at least one object area transformed by the transformation unit, from the image (Kodama col. 10, ll. 13–15, fig. 8, the image cutting-out unit cuts out the region of the target object from the region division map on which semantic segmentation is performed, after the first variable magnification unit enlarges or reduces).
Therefore, taking the teachings of Costabello as modified by Pham and Kodama together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the image processing of Costabello to include the enlarging/reducing and the cutting-out unit as disclosed in Kodama at least because doing so would allow for appropriately setting the size of processing stages to accommodate the size of the image, thereby reducing the need for increased calculation and memory size. See Kodama col. 10, ll. 55–63.
Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Costabello in view of Pham, further in view of Bharadwaj et al., US Patent Application Publication No. US 2024/0242029 A1 (herein “Bharadwaj”).
Regarding claim 8, with deficiencies of Costabello noted in square brackets [], Costabello teaches wherein the image is a frame included in a moving image (Costabello col. 3, ll. 11–12, the input image as a video frame), and the detection unit is configured to detect at least one piece of object information from the frame (Costabello col. 6, ll. 15–37, image features are extracted from the image and content classifications for the detected objects are associated to pixel regions), [and the device further comprises a voting unit configured to vote an answer to a question obtained by performing the VQA process on the object image generated for each frame, and determine an answer to the question for each object image, based on a result of vote]. Costabello as modified by Pham does not teach, where Bharadwaj teaches and the device further comprises a voting unit configured to vote an answer to a question obtained by performing the VQA process on the object image generated for each frame, and determine an answer to the question for each object image, based on a result of vote (Bharadwaj ¶¶160–166, VQA process resulting in answers which are scored by a common sense scorer (CSS) module 602 that also applies a majority voting approach, where the majority vote wins to determine the best answer to a question).
Therefore, taking the teachings of Costabello as modified by Pham and Bharadwaj together as a whole, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the image processing of Costabello to include the voting on answers as disclosed in Bharadwaj at least because doing so would allow for more accurate and stable predictions in a VQA answer. See Bharadwaj ¶166 and Abstract.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kuo et al., US Patent Application Publication No. US 2024/0289981 A1, directed towards processing object localization directed natural language questions for an image.
Lee et al., US Patent Application Publication No. US 2023/0214418 A1, directed towards visual question answering by analyzing video data, identifying objects and metadata thereof.
Ziaeefard et al., US Patent No. US 11,599,749 B1, directed towards visual question answering using a knowledge graph with pre-defined questions and answers.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MICHELLE M. KOETH
Primary Examiner
Art Unit 2671



/MICHELLE M KOETH/Primary Examiner, Art Unit 2671
Read full office action
Prosecution Timeline

Feb 27, 2024
Application Filed
Apr 13, 2026
Non-Final Rejection mailed — §101, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/242,213
Patent 12626495
MULTIMODAL EMBEDDINGS
2y 8m to grant Granted May 12, 2026
18/297,396
Patent 12586221
METHOD AND APPARATUS FOR ESTIMATING DEPTH INFORMATION OF IMAGES
2y 11m to grant Granted Mar 24, 2026
17/886,027
Patent 12579651
IMPEDED DIFFUSION FRACTION FOR QUANTITATIVE IMAGING DIAGNOSTIC ASSAY
3y 7m to grant Granted Mar 17, 2026
17/988,795
Patent 12567241
Method For Generating Training Data Used To Learn Machine Learning Model, System, And Non-Transitory Computer-Readable Storage Medium Storing Computer Program
3y 3m to grant Granted Mar 03, 2026
18/132,751
Patent 12567177
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR IMAGE PROCESSING
2y 10m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
94%
With Interview (+16.3%)
2y 2m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 434 resolved cases by this examiner. Grant probability derived from career allowance rate.