Last updated: April 19, 2026
Application No. 18/479,108
METHOD OF GENERATING LANGUAGE FEATURE EXTRACTION MODEL, INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM

Non-Final OA §102§103§112
Filed
Oct 01, 2023
Examiner
TRAN, DUY ANH
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Fujifilm Corporation
OA Round
1 (Non-Final)
Interview Optional

— +17.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 128 resolved cases, 2023–2026
Examiner Intelligence

TRAN, DUY ANH View full profile →
Grants 81% — above average
Career Allow Rate
104 granted / 128 resolved
+19.3% vs TC avg
Strong +18% interview lift
Without
With
+17.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
29 currently pending
Career history
157
Total Applications
across all art units
Statute-Specific Performance

§101
12.9%
-27.1% vs TC avg
§103
42.0%
+2.0% vs TC avg
§102
26.7%
-13.3% vs TC avg
§112
11.3%
-28.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 128 resolved cases
Office Action

§102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. JP-2022161178, filed on 11/21/2023.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/01/2023 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Status
Claims 3 and 8-9 is/are objected to because of the minor informalities.
Claims 2-4, 6 and 13  is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph.
Claim(s) 1-3, 5, 6-14 and 17-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Nishida et al (WO-2021171732 A1; Nishida).
Claim(s) 4, 6 and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nishida et al (WO-2021171732 A1; Nishida), in view of MA (U.S. 20220300706 A1).
Examiner Note: See PDF translation of WO-2021171732  provided by Examiner and/or U.S. 20230076576 A1. 
Claim Objections
Claims 3 and 8-9 is/are objected to because of the following informalities: 
In claim 3, line 6, “the position information” should read “ the first position information”.
In claim 8, line 2, “the position information” should read “ the first position information”.
In claim 9, line 2, “the position information” should read “ the first position information”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2-4, 6 and 13  is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as failing to set forth the subject matter which the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the applicant regards as the invention. 

Claim 2 recites the limitation “a correct answer” in line 11.There is insufficient antecedent basic for this limitation in the claim. It is not clear to the Examiner if this is the same as “a correct answer” in claim 1, line 13.
Claim 3 is also rejected under 35 USC 112(b) because they are dependent on claim 2.
Appropriate correction is required.

Claim 4 recites the limitation “a correct answer” in line 11.There is insufficient antecedent basic for this limitation in the claim. It is not clear to the Examiner if this is the same as “a correct answer” in claim 1, line 13.
Claim 6 is also rejected under 35 USC 112(b) because they are dependent on claim 4.
Appropriate correction is required.

Claim 13 recites the limitation “a language feature amount” in lines 3-4.There is insufficient antecedent basic for this limitation in the claim. It is not clear to the Examiner if this is the same as “a language feature amount” in claim 11, line 8.
Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-3, 5, 6-14 and 17-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Nishida et al (WO-2021171732 A1; Nishida).

Regarding claim 1, Nishida discloses a method of generating a language feature extraction model that causes a computer to execute processing of extracting a feature from a text related to an image, (Paragraph 5: “a learning device, a text generation device, a learning method, a text generation method, and a program.”; Paragraph 13: “the question answering device 10 according to the present embodiment is not only the position and size of the text in the image, but also visual information such as a graph and a photograph included in the image (in other words, an aid to help understanding the text).” ) the method comprising: by a system including one or more processors, (Paragraph 220: “ Each functional unit included in the question answering device 10 is implemented, for example, through processing that the one or more programs stored in the memory device 206 causes the processor 205 to execute.”)
with performing of machine learning using a plurality of pieces of training data including a first image, ((Figs. 1 and 10: Image including text) first position information related to a region of interest in the first image, (Figs. 10: set of correct feature regions); and a first text (Figs. 1 and 10: question text related to image) that describes the region of interest (Paragraph 17; Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”) to 
input the first text (Figs. 1 and 10: question text related to image)  into a first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104) to cause the first model to output a first feature amount representing a feature of the first text, (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”)
input the first image (Figs. 1 and 10: image including text) and the first feature amount (Figs. 1 and 10: Text analysis unit: output a sequence of tokens from  input question text; Language with visual effect understand unit 104: coded sequence in which visual information is taken into consideration is obtained) into a second model (Figs. 1 and 10: related feature region determination unit 108)  different from the first model to cause the second model to estimate the region of interest in the first image, (Paragraphs 125: “The related feature region determination unit 108 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, calculates a probability indicating whether or not a feature region extracted by the feature region extraction unit 101 is information necessary to answer a question.”; Paragraphs 154-155: “the language understanding unit 104 with visual effects converts the coded sequence H obtained in step S802 above into a vector sequence H' by the Transformer Encoder of the M layer (step S803). That is, the language understanding unit 104with visual effects sets H'= TransformerEncoder (H) …  the related feature area determination unit 108 calculates a probability indicating whether or not the feature area is a region necessary for answer generation (step S804). That is, if the element of H' corresponding to the subword token x (however, the area token or the document token) in the input token series is h', the related feature area determination unit 108 corresponds to the subword token x. The probability that the characteristic area to be used is necessary for the correct answer is calculated”) and
train the first model and the second model such that an estimated region of interest output from the second model matches the region of interest of a correct answer indicated by the first position information, (Paragraphs 125-126: “ the learning model parameters stored in the parameter storage unit 107 also include the learning model parameters of the neural network model that realizes the related feature region determination unit 108. The parameter learning unit 106 calculates a loss by using also the probability calculated by the related feature region determination unit 108 and the set of correct feature regions, and updates the model parameters being learned that are stored in the parameter storage unit 107.”; Paragraph 182: “the language understanding unit 104 with visual effects, the answer text generation unit 105, and the related feature area determination unit 108 use the learned model parameters stored in the parameter storage unit 107 …  The related feature area determination unit 108 is calculated or determined from the probability indicating whether or not the feature area extracted by the feature area extraction unit 101 is information necessary for answering a question ( Related feature area score) may be output.)
generating the first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104), which is the language feature extraction model. (Paragraph 22-25: “the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration … The parameter storage unit 107 stores the model parameters being learned (that is, model parameters to be learned) of the neural network models that implement the language-with-visual-effect understanding unit 104 and the answer text generation unit 105.”) 

Regarding claim 2, Nishida discloses comprising: by the system, using a third model (Fig. 10: answer text generation unit) that receives inputs of an image feature amount extracted from the image and a language feature amount extracted from the text and outputs a degree of association between the two feature amounts; (Paragraphs 23-24: “The answer text generation unit 105 is realized by a neural network, and the generation probability of the answer text is calculated from the coded sequence obtained by the language understanding unit 104 with visual effects by using the learning model parameters stored in the parameter storage unit 107. Calculate the probability distribution to represent. …the parameter learning unit 106 updates the learning model parameters stored in the parameter storage unit 107 by using the loss between the answer text generated by the answer text generation unit 105 and the input correct answer text.”,  it shows that the answer text generated from image tokens read as “second feature amount” and sequences of text token read as “first feature amount”) 
in the machine learning, inputting of a second feature amount extracted from the first image (Paragraph 39: “FIG. 4 shows an example of the extraction of feature regions by the feature region extraction unit 101. In the example shown in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”) and the first feature amount into the third model (Fig. 10: answer text generation unit)  to cause the third model to estimate a degree of association between the first image and the first text; (Paragraph 23: “The answer text generation unit 105 is realized by a neural network, and the generation probability of the answer text is calculated from the coded sequence obtained by the language understanding unit 104 with visual effects by using the learning model parameters stored in the parameter storage unit 107. Calculate the probability distribution to represent.”,  it shows that the answer text generated from image tokens read as “second feature amount” and sequences of text token read as “first feature amount”), and training of the first model and the third model such that an estimated degree of association output from the third model matches a degree of association of a correct answer.(Paragraph 24: “the parameter learning unit 106 updates the learning model parameters stored in the parameter storage unit 107 by using the loss between the answer text generated by the answer text generation unit 105 and the input correct answer text.”)

Regarding claim 3, Nishida discloses comprising by the system, using a fourth model (Figs. 1 and 10: feature region extraction unit 101) that extracts the second feature amount from the input first image, (Paragraph 39: “in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”)   
in the machine learning, inputting of the first image and the position information into the fourth model to cause the fourth model to output the second feature amount, (Paragraph 39: “in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”; Paragraphs 132-133: “the feature region extraction unit 101 extracts K feature regions from an image included in the read training data (step S702).  …   for the location information, any information may be used as long as the information can specify a location of the feature region,  …these nine types of regions are examples, and other region types may be set… at least two types are set, that is, an area type indicating that the feature area does not contain text and an area type indicating that the feature area contains text”) and 
training of the first model, the third model, and the fourth model such that the estimated degree of association output from the third model matches the degree of association of the correct answer. (Paragraphs 23-24: “The answer text generation unit 105 is realized by a neural network, and the generation probability of the answer text is calculated from the coded sequence obtained by the language understanding unit 104 with visual effects by using the learning model parameters stored in the parameter storage unit 107. Calculate the probability distribution to represent. …the parameter learning unit 106 updates the learning model parameters stored in the parameter storage unit 107 by using the loss between the answer text generated by the answer text generation unit 105 and the input correct answer text.”,  it shows that the answer text generated from image tokens read as “second feature amount” and sequences of text token read as “first feature amount”)

Regarding claim 5, Nishida discloses the text and the first text are structured texts. (Paragraph 210: “a model that generates an answer to a question sentence by inputting a question sentence, a feature area, and a token of an OCR token”; Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)

Regarding claim 7, Nishida discloses comprising by the system, performing of processing of displaying the region of interest estimated by the second model. (Paragraphs 125-126: “ the learning model parameters stored in the parameter storage unit 107 also include the learning model parameters of the neural network model that realizes the related feature region determination unit 108. The parameter learning unit 106 calculates a loss by using also the probability calculated by the related feature region determination unit 108 and the set of correct feature regions, and updates the model parameters being learned that are stored in the parameter storage unit 107.”; Paragraphs 216-217: “As shown in FIG. 17, the question-and-answer device 10 according to the embodiment is realized by a general computer or a computer system, and includes an input device 201, a display device 202 … The question answering device 10 can read or write the recording medium 203a via the external I / F 203. … one or more programs that realize the related feature area determination unit 108) may be stored.”)

Regarding claim 8, Nishida discloses the position information includes coordinate information that specifies a position of the region of interest in the first image. (Paragraph 37: “the feature region extraction unit 101 extracts K feature regions from the image included in the read training data (step S202). The feature area is an area based on visual features, and is represented by a rectangular area in the present embodiment. The k-th feature area includes position information (7 dimensions in total) including upper left coordinates, lower right coordinates, width, height, and area, a rectangular image representation (D dimension), and an area type (C type).”)

Regarding claim 9, Nishida discloses the first image is a cropped image including the position information. (Paragraph 37: “the feature region extraction unit 101 extracts K feature regions from the image included in the read training data (step S202). The feature area is an area based on visual features, and is represented by a rectangular area in the present embodiment. The k-th feature area includes position information (7 dimensions in total) including upper left coordinates, lower right coordinates, width, height, and area, a rectangular image representation (D dimension), and an area type (C type).”)

Regarding claim 10, Nishida discloses an information processing apparatus (Paragraph 5: “a learning device, a text generation device, a learning method, a text generation method, and a program.”)  comprising: one or more storage devices that store a program including the language feature extraction model generated by the method of generating a language feature extraction model according to claim 1; and one or more processors that execute the program. (Paragraph 220: “ Each functional unit included in the question answering device 10 is implemented, for example, through processing that the one or more programs stored in the memory device 206 causes the processor 205 to execute.”)

Regarding claim 11, Nishida discloses an information processing apparatus (Paragraph 5: “a learning device, a text generation device, a learning method, a text generation method, and a program.” comprising: one or more processors; and one or more storage devices that store a command executed by the one or more processors, wherein the one or more processors (Paragraph 220: “ Each functional unit included in the question answering device 10 is implemented, for example, through processing that the one or more programs stored in the memory device 206 causes the processor 205 to execute.”) are configured to: 
acquire a text that describes a region of interest in an image; (Paragraph 17: “set of training data (training data set) including an image including text, a question text related to this image, and a correct answer text indicating a correct answer to this question text is input to the question answering device 10 at the time of learning”) and 
execute processing of inputting the text into a first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104)  to cause the first model to output a language feature amount representing a feature of the text, (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”) and 
the first model is a model obtained by performing machine learning using a plurality of pieces of training data including a first image (Figs. 1 and 10: Image including text) for training, first position information related to a region of interest in the first image, (Figs. 10: set of correct feature regions), and a first text that describes the region of interest (Figs. 1 and 10: question text related to image) ; (Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)  to input the first text (Figs. 1 and 10: question text related to image) into the first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104) to cause the first model to output a first feature amount representing a feature of the first text (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”)and 
inputting of the first image (Figs. 1 and 10: image including text) and the first feature amount (Figs. 1 and 10: Text analysis unit output a sequence of tokens from  input question text; Language with visual effect understand unit 104: coded sequence in which visual information is taken into consideration is obtained) into a second model (Fig. 10: The related feature region determination unit 108) different from the first model to cause the second model to estimate the region of interest in the first image, (Paragraphs 125: “The related feature region determination unit 108 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, calculates a probability indicating whether or not a feature region extracted by the feature region extraction unit 101 is information necessary to answer a question.”; Paragraphs 154-155: “the language understanding unit 104 with visual effects converts the coded sequence H obtained in step S802 above into a vector sequence H'by the Transformer Encoder of the M layer (step S803). That is, the language understanding unit 104with visual effects sets H'= TransformerEncoder (H) …  the related feature area determination unit 108 calculates a probability indicating whether or not the feature area is a region necessary for answer generation (step S804). That is, if the element of H'corresponding to the subword token x (however, the area token or the document token) in the input token series is h', the related feature area determination unit 108 corresponds to the subword token x. The probability that the characteristic area to be used is necessary for the correct answer is calculated”) and
train the first model and the second model such that an estimated region of interest output from the second model matches the region of interest of a correct answer indicated by the first position information. (Paragraphs 125-126: “ the learning model parameters stored in the parameter storage unit 107 also include the learning model parameters of the neural network model that realizes the related feature region determination unit 108. The parameter learning unit 106 calculates a loss by using also the probability calculated by the related feature region determination unit 108 and the set of correct feature regions, and updates the model parameters being learned that are stored in the parameter storage unit 107.”; Paragraph 182: “the language understanding unit 104 with visual effects, the answer text generation unit 105, and the related feature area determination unit 108 use the learned model parameters stored in the parameter storage unit 107 …  The related feature area determination unit 108 is calculated or determined from the probability indicating whether or not the feature area extracted by the feature area extraction unit 101 is information necessary for answering a question ( Related feature area score) may be output.)

Regarding claim 12, Nishida discloses the one or more processors are configured to: 
input an image feature amount extracted from a second image (Paragraph 39: “in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”; Paragraph 143: “it is assumed that the token corresponding to the area type with the k-th feature region. For example, if the area type of the kth feature area is "image"”),  and a language feature amount extracted from the text (The text analysis unit 103 divides the input question text into token series) into a third model to cause the third model to output a degree of association between the second image and the text. (Paragraphs 21-23: “The text analysis unit 103 divides the text output by the text recognition unit 102 and the input question text into token series, respectively. … The answer text generation unit 105 is realized by a neural network, and the generation probability of the answer text is calculated from the coded sequence obtained by the language understanding unit 104 with visual effects by using the learning model parameters stored in the parameter storage unit 107. Calculate the probability distribution to represent.”,  it shows that the answer text generated from image tokens which is k-regions read as “feature amount extract from a second image” and sequences of text token read as “ a language feature amount extract from text”)

Regarding claim 13, Nishida discloses the one or more processors are configured to: 
input an image feature amount extracted from a second image (Paragraph 39: “in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”; Paragraph 143: “it is assumed that the token corresponding to the area type with the k-th feature region. For example, if the area type of the kth feature area is "image"”),  and a language feature amount extracted from the text (The text analysis unit 103 divides the input question text into token series) into a third model to cause the third model to output a degree of association between the second image and the text. (Paragraphs 21-23: “The text analysis unit 103 divides the text output by the text recognition unit 102 and the input question text into token series, respectively. … The answer text generation unit 105 is realized by a neural network, and the generation probability of the answer text is calculated from the coded sequence obtained by the language understanding unit 104 with visual effects by using the learning model parameters stored in the parameter storage unit 107. Calculate the probability distribution to represent.”,  it shows that the answer text generated from image tokens which is k-regions read as “feature amount extract from a second image” and sequences of text token read as “ a language feature amount extract from text”)

Regarding claim 14, Nishida discloses the one or more processors are configured to: acquire the second image and second position information related to a region of interest in the second image; (Paragraph 39: “in FIG. 4, a case is illustrated in which five feature regions including a feature region 1100, a feature region 1200, a feature region 1300, a feature region 1400, and a feature region 1500 are extracted from an image 1000 including text.”; Paragraph 143: “it is assumed that the token corresponding to the area type with the k-th feature region. For example, if the area type of the kth feature area is "image"”),  and input the second image and the second position information into a fourth model (Figs.1 and 10: the feature region extraction unit 101) to cause the fourth model to output the image feature amount. (Paragraph 136; Paragraph 37: “the feature region extraction unit 101 extracts K feature regions from the image included in the read training data (step S202). The feature area is an area based on visual features, and is represented by a rectangular area. The k-th feature area includes position information (7 dimensions in total) including upper left coordinates, lower right coordinates, width, height, and area, a rectangular image representation (D dimension), and an area type (C type). It shall be represented as an image token i .sup.k with. However, as the position information, any information may be used as long as the position of the feature area can be specified (for example, at least one information of width, height and area may not be used, and the upper left coordinates and the lower right may be used. Instead of the coordinates, the upper right and lower left coordinates may be used, or the center coordinates may be used).”, it shows that the feature extraction unit extract K feature regions read as “plurality of image or second image” and include position information such as upper and/or lower left/right coordinate is interpreted as “plurality position information and/or second position information);. 

Regarding claim 17, Nishida discloses the text and the first text are structured texts. (Paragraph 210: “a model that generates an answer to a question sentence by inputting a question sentence, a feature area, and a token of an OCR token”; Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)

Regarding claim 18, Nishida discloses the text and the first text are structured texts. (Paragraph 210: “a model that generates an answer to a question sentence by inputting a question sentence, a feature area, and a token of an OCR token”; Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)

Regarding claim 19, Nishida discloses An information processing method (Paragraph 5: “a learning device, a text generation device, a learning method, a text generation method, and a program.” ) comprising: by one or more processors, (Paragraph 220: “ Each functional unit included in the question answering device 10 is implemented, for example, through processing that the one or more programs stored in the memory device 206 causes the processor 205 to execute.”)  
acquiring a text that describes a region of interest in an image; (Paragraph 17: “set of training data (training data set) including an image including text, a question text related to this image, and a correct answer text indicating a correct answer to this question text is input to the question answering device 10 at the time of learning”)  and 
executing processing of inputting the text into a first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104)  to cause the first model to output a language feature amount representing a feature of the text, (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”) 
wherein the first model is a model obtained by performing machine learning using training data including a first image for training (Figs. 1 and 10: Image including text), a first text that describes a region of interest in the first image (Figs. 1 and 10: question text related to image), and first position information related to the region of interest in the first image (Figs. 10: set of correct feature regions) ; (Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)  
to input the first text (Figs. 1 and 10: question text related to image)  into the first model to cause the first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104)  to output a first feature amount representing a feature of the first text (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”) and 
inputting of the first image (Figs. 1 and 10: image including text)  and the first feature amount (Figs. 1 and 10: Text analysis unit output a sequence of tokens from  input question text; Language with visual effect understand unit 104: coded sequence in which visual information is taken into consideration is obtained) into a second model (Fig. 10: The related feature region determination unit 108)  different from the first model to cause the second model to estimate the region of interest in the first image, (Paragraphs 125: “The related feature region determination unit 108 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, calculates a probability indicating whether or not a feature region extracted by the feature region extraction unit 101 is information necessary to answer a question.”; Paragraphs 154-155: “the language understanding unit 104 with visual effects converts the coded sequence H obtained in step S802 above into a vector sequence H'by the Transformer Encoder of the M layer (step S803). That is, the language understanding unit 104with visual effects sets H'= TransformerEncoder (H) …  the related feature area determination unit 108 calculates a probability indicating whether or not the feature area is a region necessary for answer generation (step S804). That is, if the element of H'corresponding to the subword token x (however, the area token or the document token) in the input token series is h', the related feature area determination unit 108 corresponds to the subword token x. The probability that the characteristic area to be used is necessary for the correct answer is calculated”) and 
train the first model and the second model such that the region of interest estimated by the second model matches the region of interest indicated by the first position information. (Paragraphs 125-126: “ the learning model parameters stored in the parameter storage unit 107 also include the learning model parameters of the neural network model that realizes the related feature region determination unit 108. The parameter learning unit 106 calculates a loss by using also the probability calculated by the related feature region determination unit 108 and the set of correct feature regions, and updates the model parameters being learned that are stored in the parameter storage unit 107.”; Paragraph 182: “the language understanding unit 104 with visual effects, the answer text generation unit 105, and the related feature area determination unit 108 use the learned model parameters stored in the parameter storage unit 107 …  The related feature area determination unit 108 is calculated or determined from the probability indicating whether or not the feature area extracted by the feature area extraction unit 101 is information necessary for answering a question ( Related feature area score) may be output.)

Regarding claim 20, Nishida discloses a non-transitory, computer-readable tangible recording medium which records thereon a program that causes a computer to realize a function Paragraph 220: “ Each functional unit included in the question answering device 10 is implemented, for example, through processing that the one or more programs stored in the memory device 206 causes the processor 205 to execute.”) of extracting a feature from a text related to an image, (Paragraph 5: “a learning device, a text generation device, a learning method, a text generation method, and a program.”) the program causing the computer to realize: 
a function of acquiring a text that describes a region of interest in the image; (Paragraph 17: “set of training data (training data set) including an image including text, a question text related to this image, and a correct answer text indicating a correct answer to this question text is input to the question answering device 10 at the time of learning”)  and 
a function of inputting the text into a first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104) to cause the first model to output a language feature amount representing a feature of the text, (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”) 
wherein the first model is a model obtained by performing machine learning using training data including a first image for training (Figs. 1 and 10: Image including text), first position information related to a region of interest in the first image (Figs. 10: set of correct feature regions), and a first text that describes the region of interest in the first image (Figs. 1 and 10: question text related to image); (Paragraph 121: “It is assumed that training data input into a question answering device 10 in the learning time includes a set of correct feature regions, in addition to an image including text, a question text, and a correct answer. The set of correct feature regions is a set of feature regions necessary to obtain the correct answer, among feature regions extracted from the image.”)
 to input the first text (Figs. 1 and 10: question text related to image) into the first model to cause the first model (Figs. 1 and 10: text analysis unit 103 and  Language with visual effect understanding unit 104) to output a first feature amount representing a feature of the first text (Paragraphs 22-23: “The text analysis unit 103 divides each of the text output from the text recognition unit 102 and an input question text into a sequence of tokens … the language-with-visual-effect understanding unit 104 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, encodes sequences of tokens obtained by the text analysis unit 103. Thus, an encoded sequence can be obtained that takes visual information into consideration. In other words, language understanding can be achieved that also takes a visual effect in the image into consideration.”) and 
inputting of the first image (Figs. 1 and 10: image including text) and the first feature amount (Figs. 1 and 10: Text analysis unit output a sequence of tokens from  input question text; Language with visual effect understand unit 104: coded sequence in which visual information is taken into consideration is obtained) into a second model (Fig. 10: The related feature region determination unit 108) different from the first model to cause the second model to estimate the region of interest in the first image, (Paragraphs 125: “The related feature region determination unit 108 is implemented by a neural network and, by using model parameters being learned that are stored in the parameter storage unit 107, calculates a probability indicating whether or not a feature region extracted by the feature region extraction unit 101 is information necessary to answer a question.”; Paragraphs 154-155: “the language understanding unit 104 with visual effects converts the coded sequence H obtained in step S802 above into a vector sequence H'by the Transformer Encoder of the M layer (step S803). That is, the language understanding unit 104with visual effects sets H'= TransformerEncoder (H) …  the related feature area determination unit 108 calculates a probability indicating whether or not the feature area is a region necessary for answer generation (step S804). That is, if the element of H'corresponding to the subword token x (however, the area token or the document token) in the input token series is h', the related feature area determination unit 108 corresponds to the subword token x. The probability that the characteristic area to be used is necessary for the correct answer is calculated”) and 
train the first model and the second model such that an estimated region of interest output from the second model matches the region of interest indicated by the first position information. (Paragraphs 125-126: “ the learning model parameters stored in the parameter storage unit 107 also include the learning model parameters of the neural network model that realizes the related feature region determination unit 108. The parameter learning unit 106 calculates a loss by using also the probability calculated by the related feature region determination unit 108 and the set of correct feature regions, and updates the model parameters being learned that are stored in the parameter storage unit 107.”; Paragraph 182: “the language understanding unit 104 with visual effects, the answer text generation unit 105, and the related feature area determination unit 108 use the learned model parameters stored in the parameter storage unit 107 …  The related feature area determination unit 108 is calculated or determined from the probability indicating whether or not the feature area extracted by the feature area extraction unit 101 is information necessary for answering a question ( Related feature area score) may be output.)

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 4, 6 and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nishida et al (WO-2021171732 A1; Nishida), in view of MA (U.S. 20220300706 A1).

Regarding claim 4, Nishida discloses all the claims invention except comprising by the system, using a fifth model that receives an input of a language feature amount extracted from each of a plurality of texts and outputs a degree of association between the plurality of the texts, in the machine learning, inputting of a third feature amount, which is extracted, by the first model from a second text different from the first text by inputting the second text into the first model, and the first feature amount into the fifth model to cause the fifth model to estimate a degree of association between the first text and the second text, and training of the first model and the fifth model such that an estimated degree of association output from the fifth model matches a degree of association of a correct answer.

 Ma discloses comprising by the system, using a fifth model (Fig.1: PS model training unit 102) that receives an input of a language feature amount extracted from each of a plurality of texts and outputs a degree of association between the plurality of the texts, (Paragraph 30: “The input text is a text sentence described in a natural language. For example, the training dataset creation unit 101 may create input text by randomly extracting a passage having a length equal to or longer than a predetermined length from a known unlabeled corpus (for example, document database or long text sentence).” ; Paragraph 55: “he PS model training unit 102 uses the first training dataset to train the PS model. Input text of the first training dataset may be referred to as first input data. Furthermore, an application result of the first training dataset may be referred to as second input data.”)
in the machine learning, inputting of a third feature amount (Fig.3: second input data), which is extracted, by the first model (Fig.1: training dataset creation unit 101) from a second text different from the first text by inputting the second text into the first model, (Paragraph 30: “The input text is a text sentence described in a natural language. For example, the training dataset creation unit 101 may create input text by randomly extracting a passage having a length equal to or longer than a predetermined length from a known unlabeled corpus (for example, document database or long text sentence).” and the first feature amount (Fig. 3: first input data) into the fifth model to cause the fifth model to estimate a degree of association between the first text and the second text, (Paragraphs 56-58: “he PS model training unit 102 performs training (machine learning) on the PS model by using the input text (first input data) and the execution result (second input data) of the first training dataset as training data ….  The PS model training unit 102 may optimize parameters by updating a parameter of tensor decomposition and a parameter of a neural network in a direction for decreasing a loss function that defines an error between the inference result of the machine learning model, with respect to the training data, and correct answer data, for example, using a gradient descent method.”; Paragraph 67: “an example is given in which the PS model is trained using a training dataset that combines the above input text “ . . . Lassen county had a population of 34,895. The racial makeup of Lassen county was 25,532 (73.2%) white (U.S. census), 2,834 (8.1%) African American (U.S. census) . . . ”, the application instruction sequence “DIFF (9, SUM (10, 12))”, and the application result “6529”) and 
training of the first model and the fifth model such that an estimated degree of association output from the fifth model matches a degree of association of a correct answer. (Paragraphs 57-58: “PS model training unit 102 performs reinforcement learning (reinforcement training) so that the instruction sequence estimated by the PS model approaches the application instruction sequence (correct answer data). … The PS model training unit 102 may optimize parameters by updating a parameter of tensor decomposition and a parameter of a neural network in a direction for decreasing a loss function that defines an error between the inference result of the machine learning model, with respect to the training data, and correct answer data, for example, using a gradient descent method.”; Paragraph 69: “The PS model training unit 102 performs training based on a similarity between instruction sequences on the basis of the output “SUM (9, DIFF (11, 12))” and the correct answer data “DIFF (9, SUM (10, 12))” of the PS model.”)

	Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Nishida by including a program synthesis (PS) model training unit that is taught by MA, to make the invention that generates a natural language processing (NLP) model that is a machine learning model that executes processing on a document written in a natural language; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving a natural language processing model using a neural network as well as enhancing the program output from the NLP model.
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.

Regarding claim 6, Nishida, as modified by Ma, discloses all the claims invention. Ma further discloses the second text is a structured text. (Fig. 8 and Paragraph 146: “example has been given in which an English sentence is used as the text sentence. However, the embodiment is not limited to this, and may be applied to languages other than English, and may be variously modified and implemented.”)

Regarding claim 15, Nishida discloses all the claims invention except the one or more processors are configured to: input a language feature amount extracted from each of a plurality of texts by the first model into a fifth model to cause the fifth model to output a degree of association between the plurality of the texts.
Ma discloses input a language feature amount extracted from each of a plurality of texts by the first model (Fig.1: training dataset creation unit 101) (Paragraph 30: “The input text is a text sentence described in a natural language. For example, the training dataset creation unit 101 may create input text by randomly extracting a passage having a length equal to or longer than a predetermined length from a known unlabeled corpus (for example, document database or long text sentence).”) into a fifth model (Fig. 1 and 3: PS model training unit 102) to cause the fifth model to output a degree of association between the plurality of the texts. (Paragraph 55: “the PS model training unit 102 uses the first training dataset to train the PS model. Input text of the first training dataset may be referred to as first input data. Furthermore, an application result of the first training dataset may be referred to as second input data.” Paragraph 67: “an example is given in which the PS model is trained using a training dataset that combines the above input text “ . . . Lassen county had a population of 34,895. The racial makeup of Lassen county was 25,532 (73.2%) white (U.S. census), 2,834 (8.1%) African American (U.S. census) . . . ”, the application instruction sequence “DIFF (9, SUM (10, 12))”, and the application result “6529”)

Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Nishida by including a program synthesis (PS) model training unit that is taught by MA, to make the invention that generates a natural language processing (NLP) model that is a machine learning model that executes processing on a document written in a natural language; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving a natural language processing model using a neural network as well as enhancing the program output from the NLP model.
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.

Regarding claim 16, Nishida discloses all the claims invention except the one or more processors are configured to: input a language feature amount extracted from each of a plurality of texts by the first model into a fifth model to cause the fifth model to output a degree of association between the plurality of the texts.
Ma discloses input a language feature amount extracted from each of a plurality of texts by the first model (Fig.1: training dataset creation unit 101) (Paragraph 30: “The input text is a text sentence described in a natural language. For example, the training dataset creation unit 101 may create input text by randomly extracting a passage having a length equal to or longer than a predetermined length from a known unlabeled corpus (for example, document database or long text sentence).”) into a fifth model (Fig. 1 and 3: PS model training unit 102) to cause the fifth model to output a degree of association between the plurality of the texts. (Paragraph 55: “the PS model training unit 102 uses the first training dataset to train the PS model. Input text of the first training dataset may be referred to as first input data. Furthermore, an application result of the first training dataset may be referred to as second input data.” Paragraph 67: “an example is given in which the PS model is trained using a training dataset that combines the above input text “ . . . Lassen county had a population of 34,895. The racial makeup of Lassen county was 25,532 (73.2%) white (U.S. census), 2,834 (8.1%) African American (U.S. census) . . . ”, the application instruction sequence “DIFF (9, SUM (10, 12))”, and the application result “6529”)

Therefore, it would been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Nishida by including a program synthesis (PS) model training unit that is taught by MA, to make the invention that generates a natural language processing (NLP) model that is a machine learning model that executes processing on a document written in a natural language; thus, one of ordinary skilled in the art would have been motivated to combine the references since this will improving a natural language processing model using a neural network as well as enhancing the program output from the NLP model.
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filling date of the claimed invention.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Izuka et al (U.S. 20100189366 A1), “Diagnosis Support Apparatus and Control Method Therefor”, teaches about a diagnosis support apparatus that supports the creation of findings in interpretation or image diagnosis. It also teaches about the apparatus acquires image feature information of a target area designated on an image to be interpreted, searches the storage unit for image feature information similar to the acquired image feature information, acquires a finding sentence stored in correspondence with the retrieved image feature information from the storage unit, and creates a finding sentence concerning interpretation of the designated target area by changing a description of the acquired finding sentence based on image feature information of the designated target area.
Lv et al (U.S 20210406619 A1), “ Method and Apparatus for Visual Question Answering, Computer Device and Medium”, teaches about a method for visual question answering, comprising: acquiring an input image and an input question; detecting visual information and position information of each of at least one text region in the input image; determining semantic information and attribute information of each of the at least one text region based on the visual information and the position information; determining a global feature of the input image based on the visual information, the position information, the semantic information, and the attribute information; determining a question feature based on the input question; and generating a predicted answer for the input image and the input question based on the global feature and the question feature.
Lubbers et al (U.S. 20180329892 A1), “Captioning a Region of an Image”, teaches about The method comprises providing a dataset of triplets. Each triplet includes a respective image, a respective region of the respective image, and a respective caption of the respective region. The method also comprises learning, with the dataset of triplets, a function that is configured to generate an output caption based on an input image and on an input region of the input image.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Duy A Tran whose telephone number is (571)272-4887. The examiner can normally be reached Monday-Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ONEAL R MISTRY can be reached at (313)-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DUY TRAN/            Examiner, Art Unit 2674                                                                                                                                                                                            

/ONEAL R MISTRY/            Supervisory Patent Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Oct 01, 2023
Application Filed
Jan 29, 2026
Non-Final Rejection — §102, §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/947,989
Patent 12573024
IMAGE AUGMENTATION FOR MACHINE LEARNING BASED DEFECT EXAMINATION
2y 5m to grant Granted Mar 10, 2026
18/182,536
Patent 12561934
AUTOMATIC ORIENTATION CORRECTION FOR CAPTURED IMAGES
2y 5m to grant Granted Feb 24, 2026
18/009,189
Patent 12548284
METHOD FOR ANALYZING ONE OR MORE ELEMENT(S) OF ONE OR MORE PHOTOGRAPHED OBJECT(S) IN ORDER TO DETECT ONE OR MORE MODIFICATION(S), AND ASSOCIATED ANALYSIS DEVICE
2y 5m to grant Granted Feb 10, 2026
17/599,872
Patent 12530798
LEARNED FORENSIC SOURCE SYSTEM FOR IDENTIFICATION OF IMAGE CAPTURE DEVICE MODELS AND FORENSIC SIMILARITY OF DIGITAL IMAGES
2y 5m to grant Granted Jan 20, 2026
18/096,835
Patent 12505539
CELL BODY SEGMENTATION USING MACHINE LEARNING
2y 5m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+17.5%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 128 resolved cases by this examiner. Grant probability derived from career allow rate.