DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments and amendments in the Amendment filed January 28, 2026 (herein “Amendment”), with respect to the rejection of claim 11 under 35 U.S.C. §101 as being directed towards non-statutory subject matter have been fully considered and are persuasive. The rejection of claim 11 under 35 U.S.C. §101 has been withdrawn.
Applicant's arguments filed in the Amendment regarding the rejection of independent claims 1, 11 and 12, and claims depending therefrom under 35 U.S.C. 103 have been fully considered but they are not persuasive.
Applicant presents two main arguments: 1) the Huang1 and Chan references do not teach or suggest “the output image is a reproduction of a corresponding image from the training dataset;” and 2) the motivation to combine Huang1 and Chan employs impermissible hindsight reconstruction.
Regarding applicant’s first argument that Huang and Chan do not teach or suggest “the output image is a reproduction of a corresponding image from the training dataset,” it is noted that MPEP 2111.01(IV) requires a special definition, clearly stated in the specification in order for a claim limitation to be excepted from being given its plain meaning for the broadest reasonable interpretation. In the present application, applicant has not provided a special definition for the word “reproduction,” and therefore, the plain meaning controls the broadest reasonable interpretation. Per the Merriam-Webster dictionary, “reproduction” has a plain meaning of “something reproduced: copy” where the definition of “copy” includes “ an imitation, transcript, or reproduction of an original work (such as a letter, a painting, a table, or a dress).” Therefore, Huang1’s generative AI framework which generates an image according to a neural network trained to optimize output according to ground truth images is literally trained and designed to create a “reproduction.” Moreover, the examiner’s understanding is that the present application features written description support for the claimed “reproduction” in the same way—through generative AI models. Therefore, Applicant’s remarks are not persuasive.
Regarding applicant’s second argument, applicant contends that impermissible hindsight analysis was employed in the motivational reasoning simply because Huang1 mentions that a metric “Inception score” is not used as a training metric because it can present an overfitting situation. At best this citation to Huang1 would be evidence for a teaching away, under a different theory of non-obviousness than impermissible hindsight (see MPEP 2145(X)(D) vs. MPEP 2145(X)(A)). Notwithstanding, because the rationale to combine Huang1 and Chan comes from Chan directly as a teaching of Chan in ¶19, then the record reflects that the motivation did not come from applicant’s specification/disclosure, and therefore, does not constitute impermissible hindsight analysis. Accordingly, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning. But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper. See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971). Further, it is noted that Huang1’s discussion about one training metric out of many being undesirable for overfitting qualities is not necessary and sufficient to overcome the finding that a person having ordinary skill in the art would have understood that overfitting has advantages as suggested by Chan.
Accordingly, in view of the above, while all of applicant’s arguments have been fully considered, they are not found persuasive, and the rejection of claims 1, 11 and 12, and claims depending therefrom under 35 U.S.C. 103 has been maintained.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 6, 10–12, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al., “Unifying Multimodal Transformer for Bi-directional Image and Text Generation,” arXiv:2110.09753v1 [cs.CV], October 19, 2021, https://doi.org/10.48550/arXiv.2110.09753 (herein “Huang1”) in view of Chan et al., WIPO International PCT Application Publication No. WO 2025/019263 A1 (herein “Chan”).
Regarding claims 1, 11 and 12, with substantive differences between the claims noted in curly brackets {}, and deficiencies of Huang noted in square brackets [], and with claim 1 as exemplary, Huang teaches {a computer-implemented method – claim 1 / a computer program product, comprising: one or more tangible non-transitory computer-readable storage media and program instructions stored on at least one of the one or more tangible non-transitory computer-readable storage media, the program instructions executable by a processor to cause the processor to perform operations – claim 11 / a system comprising: [a memory; and at least one processor, coupled to said memory,] and operative to perform operations – claim 12 (Huang1 section 5.1, implementation details, the computer code for executing the disclosed functionality available on a github repository (computer program product stored on a github storage media)} comprising:
training a text-to-image machine learning model by using a training dataset (Huang1 sections 3.1, 4 and 5.1, text-to-image generation task performed by a multimodal transformer which is trained and in the experimental setup, the MS-COCO training dataset was used to train the model), wherein the training dataset comprises images and respective natural language descriptions of the images (Huang1 section 5.1, MS-COCO dataset comprised of images with five annotated captions for each image), [wherein the training causes the text-to-image machine learning model to be overfit on the training dataset];
[storing the trained overfit text-to-image machine learning model;]
submitting a given natural language description to the [stored overfit] text-to-image machine learning model (Huang1 figure 2, section 5.2, text-to-image generation, text is given/input into the framework including the multimodal transformer model); and
in response to the submitting, receiving as output from the [stored overfit] text-to-image learning machine learning model an output image, wherein the output image is a reproduction of a corresponding image from the training dataset and corresponds to the submitted natural language description (Huang1 figure 2, sections 4.2, 5.1 and 5.2, the text-to-image generation generates an image based on the text using a testing set of the MS-COCO dataset having five annotated captions for each image, the output image from the text-to-image model under test is a generated image (reproduction) corresponding to the ground truth image from the MS-COCO dataset).
Huang1 does not explicitly teach, but Chan teaches a memory; and at least one processor, coupled to said memory (Chan ¶72, computing system including one or more processors with a memory storing instructions executed by the processor), wherein the training causes the text-to-image machine learning model to be overfit on the training dataset (Chan ¶19, text-to-image synthesis models are designed to overfit the target subject present in the sample input images and are fine-tuned as such (trained)), storing the trained overfit text-to-image machine learning model, and the stored overfit model (Chan ¶¶ 19 and 73, machine learning models are trained and then stored at the computing system, the text-to-image model being designed to overfit).
Therefore, taking the teachings of Huang1 and Chan together as a whole, it would have been obvious to a person having ordinary skill in the art (herein “PHOSITA”) before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the teachings of Chan of an overfit model, and storage of the model at least because doing so would provide accurate recreations of a target subject. See Chan ¶19. Further, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to modify the bi-directional image and text generation system of Huang1 to include the processor and memory of Chan at least because conventionally machine learning models and software executing the models are stored in memory and executed by a processor, and as such would be use of known technique to improve similar devices (methods, or products) in the same way. See MPEP §2143(I)(C).
Regarding claims 6 and 17, with claim 6 as exemplary, Huang1 teaches the submitting and receiving steps as given above, but does not explicitly teach, where Chan teaches further comprising: receiving the overfit text-to-image machine learning model (Chan ¶19, text-to-image synthesis models are designed to overfit the target subject present in the sample input images and are fine-tuned as such (trained)) at a remote node via a network (Chan fig. 4A, ¶¶64–66, user computing device (remote node) receives image synthesis model from the server computing system), wherein the submitting and the receiving steps are performed with the text-to-image machine learning model positioned at the remote node (Chan ¶¶66–71, user computing device (remote node) stores and implements (interacting with – such as Huang1’s submitting and receiving) the image synthesis models).
Therefore, taking the teachings of Huang1 and Chan together as a whole, it would have been obvious to a person having ordinary skill in the art (herein “PHOSITA”) before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the teachings of Chan of an overfit model, and storage and implementation of the model at least because doing so would provide accurate recreations of a target subject. See Chan ¶19.
Regarding claim 10, Huang1 teaches wherein the text-to-image machine learning model produces embeddings selected from the group consisting of token embeddings, visual feature embeddings, segment embeddings, and sequence position embeddings (Huang1, section 3.1, text-to-image generation by the trained models using text tokens and image tokens, where the text tokens are represented by position embedding and word embedding (token embeddings)).
Claims 2 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Huang1 in view of Chan, and further in view of Huang et al., “Turbo Learning for CaptionBot and DrawingBot,” 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada (herein “Huang2”).
Regarding claims 2 and 13, with claim 2 as exemplary, Huang1 does not explicitly teach the limitations of claims 2 and 13. Huang2 teaches further comprising producing the natural language descriptions of the training dataset via inputting the images of the training dataset into an image-to-text machine learning model and, in response, receiving the natural language descriptions as output from the image-to-text machine learning model (Huang2 page 6 section 4.3, and page 3, figure 1, in step 2, for a given ground-truth/gold training image I*, the CaptionBot (image-to-text machine learning model) generates a sentence
S
^
r
, and the generated sentence is supplied (receiving the natural language description) to the DrawingBot).
Therefore, taking the teachings of Huang1 as modified above by Chan and Huang2 together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the use of an image-to-text model to generate captions from training images as disclosed in Huang2 at least because doing so would provide improved performance in both image-to-text and text-to-image processing tasks. See Huang2 bottom of page 1, to end of paragraph on the top of page 2.
Claims 3–4 and 14–15 are rejected under 35 U.S.C. 103 as being unpatentable over Huang1 in view of Chan, and further in view of Mei et al., US Patent No. 12,271,399 B2 (herein “Mei”).
Regarding claims 3 and 14, with claim 3 as exemplary, Huang1 in combination with Chan teaches the stored overfit text-to-image machine learning model (Chan ¶¶ 19 and 73, machine learning models are trained and then stored at the computing system, the text-to-image model being designed to overfit), however, Huang1 as modified by Chan does not teach the rest of claims 3 and 14.
Mei teaches further comprising: storing the natural language descriptions of the training dataset as a text collection (Mei col. 21, ll. 12–31, second modality database stores attribute information of the second modality data including a first modality description information which col. 8, ll. 59–66 teaches is a text description (natural language) for an image);
searching the text collection based on a first text input; and in response to the searching, retrieving the given natural language description from the text collection (Mei col. 21, l. 51–col. 22, l. 17, search performed from text input into search box, where the input text is recommended text generated from the image description in the stored attribute information of the second modality data) for the submitting of the given natural language description to the stored text-to-image (Mei col. 22, ll. 12–23, text is searched to find an image matching to the text).
Therefore, taking the teachings of Huang1 as modified by Chan, and Mei together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the searching stored text information associated with searchable images as disclosed in the cited passages above in Mei at least because doing so would improve cross-modal search efficiency and diversity and comprehensiveness of a cross-modal search result. See Mei col. 1, ll. 34–37.
Regarding claims 4 and 15, with claim 4 as exemplary, Huang1 does not explicitly teach but Mei teaches further comprising storing the natural language descriptions of the training dataset as a text collection, wherein the natural language descriptions are stored with at least one of a respective image label and respective image metadata (Mei col. 21, ll. 12–31, the second modality database storing attribute information and the second modality, the attribute information including first modality description information associated with the second modality data (natural language description), a category label, and first modality description information recognized from the second modality data (metadata)).
Therefore, taking the teachings of Huang1 as modified by Chan, and Mei together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the searching stored information associated with searchable images as disclosed in the cited passages above in Mei at least because doing so would improve cross-modal search efficiency and diversity and comprehensiveness of a cross-modal search result. See Mei col. 1, ll. 34–37.
Claims 7 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Huang1 in view of Chan, and further in view of Zia et al., “Text-to-Image Generation with Attention Based Recurrent Neural Networks,” arXiv:2001.06658v1 [cs.CV], Jan. 18, 2020, https://doi.org/10.48550/arXiv.2001.06658 (herein “Zia”).
Regarding claims 7 and 18, Huang1 in view of Chan as modified above teaches the overfit text-to-image machine learning model (Chan ¶¶ 19 and 73, machine learning models are trained and then stored at the computing system, the text-to-image model being designed to overfit), but does not explicitly teach, where Zia teaches wherein the training the text-to-image machine learning model is performed using sequence-to-sequence training (Zia pages 4 and 6–7, sections 4 and 6, for text-to-image generation, a model using an encoder-decoder framework in a sequence-to-sequence model is trained).
Therefore, taking the teachings of Huang1 as modified by Chan, and Zia together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the searching stored information associated with searchable images as disclosed in the cited passages above in Zia at least because doing so would improve image quality in the generated images. See Zia Abstract.
Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Huang1 in view of Chan, and further in view of Nagao et al., EP 0467527 (herein “Nagao”).
Regarding claims 8 and 19, with claim 8 as exemplary, Nagao teaches further comprising storing the natural language descriptions in one or more of a data heap and a data tree (Nagao page 5, ll. 37-43, a knowledgebase for natural language analysis comprised of stored trees of data (data tree) representing sentences (natural language descriptions)).
Therefore, taking the teachings of Huang1 as modified by Chan, and Nagao together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the storing of sentences in a tree of data for a knowledge base for natural language analysis as disclosed in the cited passages above in Nagao at least because doing so would provide semantic information to overcome processing bottlenecks in analyzing natural language. See Nagao page 3, l. 48 – page 4, l. 12.
Claims 9 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Huang1 in view of Chan, and further in view of Lu et al., “Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering,” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI'18), 2018, AAAI Press, Article 884, 7218–7225. (herein “Lu”).
Regarding claims 9 and 20, with claim 9 as exemplary, Huang1 teaches the text-to-image machine learning model (Huang1 sections 3.1, 4 and 5.1, text-to-image generation task performed by a multimodal transformer), but does not teach where Lu teaches comprises a Fast Region-based Convolutional Neural Network that generates appearance features and geometry embeddings for a respective image that is input (Lu page 7220, section 3.1, a faster-RCNN (region-based convolutional neural network), is used to obtain object detection boxes(geometry embeddings) in the image, including visual (appearance) features).
Therefore, taking the teachings of Huang1 as modified by Chan, and Lu together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the bi-directional image and text generation system of Huang1 to include the processing by the faster-RCNN as disclosed in the cited passages above in Lu at least because doing so would provide better fusing of features from different modalities leading to better output accuracy. See Lu page 7219, last paragraph of the Introduction section, and page 7220, next to last paragraph of section 3.
Allowable Subject Matter
Claims 5 and 16 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The reasons for allowability for claims 5 and 16 have already been set forth in the Non-Final Office Action issued 10/30/2025 on page 12.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
MICHELLE M. KOETH
Primary Examiner
Art Unit 2671
/MICHELLE M KOETH/Primary Examiner, Art Unit 2671