DETAILED ACTION
Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
2. A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on January 13, 2026 has been entered.
Response to Amendment
3. The amendment filed January 13, 2026 has been entered. Claims 1-20 remain pending in the application.
Response to Arguments
4. Applicant's arguments filed January 13, 2026 have been fully considered but they are not persuasive.
5. Applicant argues that Li et al. ("StoryGAN: A Sequential Conditional GAN for Story Visualization"), hereinafter referred to as Li, fails to teach using the first generated image to generate a second image as recited in the amended independent claims.
Examiner replies that Li is no longer used to teach using the first generated image to generate a second image as recited in the amended independent claims. Instead, Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, is used instead.
Guo’s WIPO Publication No. 2024/207872 A1 translation was used in the previous rejections to reject claims 1-18 and 20. Now, the Examiner uses Guo’s US 2025/0259466 A1 publication which offers a better translation that more clearly teaches using the first generated image to generate a second image. Guo Paragraph 221 teaches using the historical image or first generated image to help select the second image to be selected for the second prompt through a similarity calculation. Selecting the second image to be output teaches generating a second image using the first generated image as amended in the independent claims.
Claim Rejections - 35 USC § 103
6. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
7. The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
8. Claim(s) 1, 6-8, 10, 14-16, and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1 -- valid through foreign priority to Chinese Application 202310399237.0 which is published under CN118781225A and supports the subject matter in the U.S. Publication), hereinafter referred to as Guo.
9. Regarding claim 1, Zeng teaches a computer-implemented method for generating contextually-persistent images across a text document (Abstract teaches keeping global consistency across scenes when generating a sequence of images from a multi-sentence paragraph or text document), comprising: generating a first entity identifier for a first entity identified in the text document (Section 3.2 teaches creating a vector for each word and measuring the importance of words. One of the words most relevant is the first entity identified and the vector for each word is the first entity identifier);
identifying a set of semantic text chunks from the text document including a first text chunk and a second text chunk (Section 3.1 teaches using an aligned sentence encoder to identify sentences and extract semantic vectors for each sentence. The sentence is a semantic text chunk. The encoder identifies all sentences so all the sentences identified are a set of semantic text chunks that include a first and second text chunk for a first and second sentence; Figure 2 shows more than two sentences extracted and passed into the aligned sentence encoder. One sentence can be the first text chunk and the other the second text chunk);
generating a first synthetic image utilizing the first text chunk and the first entity identifier associated with the first entity (Section 3, Paragraph 1 teaches after using the text chunk, or sentence, and the first entity identifier, or word feature
w
t
, a synthetic image is generated by the image generator; Figure 2 teaches using the first text chunk and word vector, which is the first entity identifier associated with the first entity, to create a first synthetic image
x
^
1
);
However, Zeng is not relied on for the following claim language: associating the first entity identifier with a first instance of the first entity within the first text chunk and with a second instance of the first entity within the second text chunk; determining a first visual entity embedding for the first entity identifier from the first synthetic image; and generating a second synthetic image by providing the second text chunk with the first synthetic image and the first entity identifier associated with the first entity and the first visual entity embedding to an image generation model, wherein the first entity in the first synthetic image matches the first entity in the second synthetic image.
Guo teaches associating the first entity identifier with a first instance of the first entity within the first text chunk and with a second instance of the first entity within the second text chunk (Paragraph 206-207 and Figure 11 teach “each prompt may be split for recognizing a noun element included in the prompt…the noun element is recorded and stored in a sequence element library 1101”. The split prompt is stored in the sequence element library with the noun element identifier. The noun element identifier teaches the first entity identifier. Thus, this teaches identifying a first instance of the first entity in the first text chunk; Paragraph 218 teaches “if it is determined that a noun element appears in the second prompt…whether the noun element exists in the historical elements (the noun elements of the first prompt) stored in the sequence element library is queried.” This teaches associating the noun element of the first text chunk with the same entity in the second text chunk.);
determining a first visual entity embedding for the first entity identifier from the first synthetic image (Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding for the first entity identifier from the first synthetic image; Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image);
and generating a second synthetic image by providing the second text chunk (Paragraph 215 teaches generating a second image for a second prompt. The second prompt is the second text chunk) with the first synthetic image and the first entity identifier associated with the first entity and the first visual entity embedding to an image generation model (Paragraph 218 teaches “when the noun element in the second prompt appears in the corresponding retained image, whether the noun element exists in historical elements (the noun elements of the first prompt) stored in the sequence element library is queried. If the noun element exists, cumulative correlations of elements are scored based on the retained image.” The historical element or noun elements of the first prompt teaches the first entity identifier associated with a first entity. Paragraph 221 teaches “a similarity between the current retained image and the historical image is calculated and used as a historical-image similarity. That is, a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated” and Paragraph 223 teaches based on the similarity scores, the second synthetic image is chosen and output. The second synthetic image is then “stored in the sequence element library 1101 as a new historical image.” The historical image teaches the first synthetic image and the image feature teaches the first visual entity embedding. The Figure 11 process teaches an image generation model. This teaches that the first entity identifier, first synthetic image, and first visual entity embedding are stored in the sequence element library and provided to the image generation model in order to generate the second synthetic image),
wherein the first entity in the first synthetic image matches the first entity in the second synthetic image (Paragraph 219 teaches “image-text similarities between the historical elements of all element types … and the current image … are respectively calculated” and Paragraph 221 teaches “a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated”. This teaches comparing the entities in the first and second images. Paragraph 223 teaches the image with the highest fusion score or similarity score is selected as the second synthetic image to be output. Having a high similarity score teaches the entities match).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with generating a second synthetic image using the first entity identifier, first visual entity embedding, and first synthetic image taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
10. Regarding claim 6, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng is not relied on for the following claim language: the method further comprising associating the first visual entity embedding with the first entity identifier in an entity table.
Guo teaches the method further comprising associating the first visual entity embedding with the first entity identifier in an entity table (Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding and the first entity identifier is the prop-element. The sequence element library teaches an entity table. Thus, Guo teaches the first visual entity embedding is associated with a first entity identifier in an entity table).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the entity table taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
11. Regarding claim 7, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng is not relied on for the following claim language: the method further comprising determining the first visual entity embedding from the first synthetic image using a visual entity embedding extraction model that generates visual entity embeddings for entities detected in digital images.
Guo teaches the method further comprising determining the first visual entity embedding from the first synthetic image using a visual entity embedding extraction model that generates visual entity embeddings for entities detected in digital images (Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image. The CLIP model teaches a visual entity embedding model that generates visual entity embeddings or image features; Paragraph 200-202 teaches extracting image features through the CLIP model and comparing them to the noun elements in the prompts. The noun elements are entities. Thus, the visual entity embeddings are generated through the visual entity embedding model for the entities detected in the digital images; Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding and the first entity identifier is the prop-element. Thus, a visual entity embedding is generated for entities detected in the images).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the visual entity embedding extraction model taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
12. Regarding claim 8, Zeng in view of Guo teaches the limitations of claim 6. Zeng further teaches generating the second synthetic image using the image generation model based on the second text chunk (Section 3, Paragraph 1 teaches creating a synthetic image using the image generator and the text chunk or sentence; Figure 2 teaches using the second text chunk to create a second synthetic image
x
^
2
),
However, Zeng is not relied on for the following claim language: generating the second synthetic image using the image generation model based on the first synthetic image and the first entity identifier, wherein the first entity identifier includes the first entity and the first visual entity embedding.
Guo teaches generating the second synthetic image using the image generation model based on the second text chunk (Paragraph 215 teaches generating a second image for a second prompt. The second prompt is the second text chunk), the first synthetic image, and the first entity identifier, wherein the first entity identifier includes the first entity and the first visual entity embedding (Paragraph 218 teaches “when the noun element in the second prompt appears in the corresponding retained image, whether the noun element exists in historical elements (the noun elements of the first prompt) stored in the sequence element library is queried. If the noun element exists, cumulative correlations of elements are scored based on the retained image.” The historical element or noun elements of the first prompt teaches the first entity identifier associated with a first entity. Paragraph 221 teaches “a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated” and Paragraph 223 teaches based on the similarity scores, the second synthetic image is chosen and output. The second synthetic image is then “stored in the sequence element library 1101 as a new historical image.” The historical image teaches the first synthetic image and the image feature teaches the first visual entity embedding. The Figure 11 process teaches an image generation model. This teaches that the first entity identifier, first synthetic image, and first visual entity embedding are stored in the sequence element library and provided to the image generation model in order to generate the second synthetic image).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the generating of the second image using the first entity identifier, visual entity embedding, and synthetic image taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
13. Regarding claim 10, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng is not relied on for the following claim language: the method further comprising determining the first visual entity embedding from the first synthetic image based on receiving the first visual entity embedding extracted as an output from the image generation model in connection with receiving the first synthetic image.
Guo teaches the method further comprising determining the first visual entity embedding from the first synthetic image based on receiving the first visual entity embedding extracted as an output from the image generation model in connection with receiving the first synthetic image (Paragraph 197 and Figure 10 teaches “the CLIP model is called based on an inputted sentence and corresponding generated images … to obtain … CLIP image features of the generated images”. The CLIP image feature is the first visual entity embedding and the generated image is the first synthetic image. This teaches obtaining the visual entity embedding extracted as an output from the image generation model in connection with receiving the synthetic image).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the first visual entity embedding taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
14. Regarding claim 14, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng is not relied on for the following claim language: the method further comprising providing a user interface element with a passage of the text document to request a synthetic image of the passage, wherein the synthetic image is previously generated or is generated in response to detecting a selection of a request.
Guo teaches the method further comprising providing a user interface element with a passage of the text document to request a synthetic image of the passage, wherein the synthetic image is previously generated or is generated in response to detecting a selection of a request (Paragraph 37 teaches the user can input a prompt to an application and when “the server 200 receives the prompt sent by the terminal, first obtains a plurality of generated images of the prompt to determine an illustration of the prompt from the plurality of generated images.” The application teaches the user interface which generates a synthetic image in response to detecting the user’s request; Paragraph 56 teaches the user can input a prompt like a novel or script article which teaches a passage of a text document being used to request a synthetic image).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with generating an image in response to a user’s request by Guo in order to generate a sequence of images for stories that maintain the consistency of the previous image (Guo Paragraph 116).
15. Regarding claim 15, Zeng teaches a system for generating contextually-persistent images across a text document (Abstract teaches keeping global consistency across scenes when generating a sequence of images from a multi-sentence paragraph or text document), comprising: computer-based models including an entity recognition model, a semantic text chunking model, an image generation model (Section 3 teaches the PororoGAN model which contains an entity recognition model through the attentional word encoder, the semantic text chunking model through the aligned sentence encoder, and the image generation model through the image generator),
generating a first entity identifier for a first entity identified in the text document utilizing the entity recognition model (Section 3.2 teaches using the Attentional Word Encoder or entity recognition model to create a vector for each word and measuring the importance of words. One of the words most relevant is the first entity identified and the vector for each word is the first entity identifier);
identifying a set of semantic text chunks from the text document including a first text chunk and a second text chunk utilizing the semantic text chunking model (Section 3.1 teaches using the aligned sentence encoder, or semantic text chunking model, to identify sentences and extract semantic vectors for each sentence. The sentence is a semantic text chunk. The encoder identifies all sentences so all the sentences identified are a set of semantic text chunks that include a first and second text chunk for a first and second sentence; Figure 2 shows more than two sentences extracted and passed into the aligned sentence encoder. One sentence can be the first text chunk and the other the second text chunk);
generating a first synthetic image utilizing the first text chunk and the first entity identifier associated with the first entity using the image generation model (Section 3, Paragraph 1 teaches after using the text chunk, or sentence, and the first entity identifier, or word feature
w
t
, a synthetic image is generated by the image generator; Figure 2 teaches using the first text chunk and word vector, which is the first entity identifier associated with the first entity, to create a first synthetic image
x
^
1
);
However, Zeng is not relied on for the following claim language: system comprising: a visual entity embedding extraction model; a processing system comprising a processor; and a computer memory comprising instructions that, when executed by the processing system, cause the system to perform operations comprising: associating the first entity identifier with a first instance of the first entity within the first text chunk and with a second instance of the first entity within the second text chunk; determining a first visual entity embedding for the first entity identifier from the first synthetic image using the visual entity embedding extraction model; and generating a second synthetic image by providing the first synthetic image and the first entity identifier associated with the first entity and the first visual entity embedding to the image generation model, wherein the first entity in the first synthetic image is continuous with the first entity in the second synthetic image.
Guo teaches a system comprising: a visual entity embedding extraction model (Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image. The CLIP model is the visual entity embedding extraction model); a processing system comprising a processor; and a computer memory comprising instructions that, when executed by the processing system, cause the system to perform operations comprising (Paragraph 8 teaches the system consists of a memory with instructions and a processor configured to execute those instructions): associating the first entity identifier with a first instance of the first entity within the first text chunk and with a second instance of the first entity within the second text chunk (Paragraph 206-207 and Figure 11 teach “each prompt may be split for recognizing a noun element included in the prompt…the noun element is recorded and stored in a sequence element library 1101”. The split prompt is stored in the sequence element library with the noun element identifier. The noun element identifier teaches the first entity identifier. Thus, this teaches identifying a first instance of the first entity in the first text chunk; Paragraph 218 teaches “if it is determined that a noun element appears in the second prompt…whether the noun element exists in the historical elements (the noun elements of the first prompt) stored in the sequence element library is queried.” This teaches associating the noun element of the first text chunk with the same entity in the second text chunk);
determining a first visual entity embedding for the first entity identifier from the first synthetic image using the visual entity embedding extraction model (Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding for the first entity identifier from the first synthetic image; Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image. The CLIP model is the visual entity embedding extraction model);
and generating a second synthetic image by providing the second text chunk (Paragraph 215 teaches generating a second image for a second prompt. The second prompt is the second text chunk) with the first synthetic image and the first entity identifier associated with the first entity and the first visual entity embedding to the image generation model (Paragraph 218 teaches “when the noun element in the second prompt appears in the corresponding retained image, whether the noun element exists in historical elements (the noun elements of the first prompt) stored in the sequence element library is queried. If the noun element exists, cumulative correlations of elements are scored based on the retained image.” The historical element or noun elements of the first prompt teaches the first entity identifier associated with a first entity. Paragraph 221 teaches “a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated” and Paragraph 223 teaches based on the similarity scores, the second synthetic image is chosen and output. The second synthetic image is then “stored in the sequence element library 1101 as a new historical image.” The historical image teaches the first synthetic image and the image feature teaches the first visual entity embedding. The Figure 11 process teaches an image generation model. This teaches that the first entity identifier, first synthetic image, and first visual entity embedding are stored in the sequence element library and provided to the image generation model in order to generate the second synthetic image),
wherein the first entity in the first synthetic image is continuous with the first entity in the second synthetic image (Paragraph 219 teaches “image-text similarities between the historical elements of all element types … and the current image … are respectively calculated” and Paragraph 221 teaches “a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated”. This teaches comparing the entities in the first and second images. Paragraph 223 teaches the image with the highest fusion score or similarity score is selected as the second synthetic image to be output. Having a high similarity score teaches the entities match).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the system of generating contextually-persistent images taught by Zeng with generating the second synthetic image using first entity identifier, first visual entity embedding, and first synthetic image taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
16. Regarding claim 16, Zeng in view of Guo teaches the limitations of claim 15. However, Zeng is not relied on for the following claim language: the system wherein the operations further include utilizing an image tagging model, in connection with the visual entity embedding extraction model, to determine the first visual entity embedding of the first entity identifier within the first synthetic image.
Guo teaches the system wherein the operations further include utilizing an image tagging model, in connection with the visual entity embedding extraction model, to determine the first visual entity embedding of the first entity identifier within the first synthetic image (Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image. The CLIP model teaches a visual entity embedding model and image tagging model that generates visual entity embeddings or image features. Applicant does not define the image tagging model so it can be interpreted to be the same as the visual entity embedding extraction model as both determine the first visual entity embedding; Paragraph 200-202 teaches extracting image features through the CLIP model and comparing them to the noun elements in the prompts. The noun elements are entities. Thus, the visual entity embeddings are generated through the visual entity embedding model for the entities detected in the digital images; Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding and the first entity identifier is the prop-element. Thus, a visual entity embedding is generated for entities detected in the images).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the system of generating contextually-persistent images taught by Zeng with image tagging model taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
17. Regarding claim 19, Zeng teaches a computer-implemented method for generating contextually-persistent images across a text document, comprising (Abstract teaches keeping global consistency across scenes when generating a sequence of images from a multi-sentence paragraph or text document):
generating a set of entity identifiers for a set of entities identified in the text document (Section 3.2 teaches creating a vector for each word and measuring the importance of words. The vectors for each word are the set of entity identifiers for a set of entities identified);
identifying a set of semantic text chunks from the text document (Section 3.1 teaches using an aligned sentence encoder to identify sentences and extract semantic vectors for each sentence. The sentence is a semantic text chunk. The encoder identifies all sentences so all the sentences identified are a set of semantic text chunks);
associating the set of entity identifiers within the set of semantic text chunks (Section 3.3 teaches passing in the semantic text chunk and entity identifiers into the context encoder’s GRU layer. The combination of the semantic text chunk within the set of semantic text chunks and entity identifiers are associating them with each other);
for a text chunk of the set of semantic text chunks, generating a synthetic image by providing the text chunk with one or more entity identifiers associated with entities in the text chunk (Section 3, Paragraph 1 teaches after using the text chunk, or sentence, and the entity identifier, or word feature
w
t
, a synthetic image is generated by the image generator; Figure 2 teaches for each text chunk, the model uses the text chunk and word vector to create a synthetic image
x
^
);
and providing, throughout the text document, a set of synthetic images having a common artistic style and contextually-persistent entity (Figure 2 teaches images being generated that maintain a common artistic style and a contextually-persistent entity by having the Text2Gist and GRU latent vectors passed into each generation of the next image to be synthesized; Section 4.3 teaches evaluating the model for visual quality, consistency, and relevance is ensuring a common artistic style and contextually-persistent entity).
However, Zeng is not relied on for the following claim language: generating a synthetic image by providing the text chunk with the visual entity embeddings associated with the entities in the text chunk to an image generation model.
Guo teaches for a text chunk of the set of semantic text chunks, generating a synthetic image by providing the text chunk with one or more entity identifiers associated with entities in the text chunk and visual entity embeddings associated with the entities in the text chunk to an image generation model (Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the visual entity embedding and the entity identifier is the prop-element. Thus, Guo teaches the visual entity embedding is associated with an entity identifier which are entities in a text chunk; Paragraph 218 teaches “when the noun element in the second prompt appears in the corresponding retained image, whether the noun element exists in historical elements (the noun elements of the first prompt) stored in the sequence element library is queried. If the noun element exists, cumulative correlations of elements are scored based on the retained image.” The historical element or noun elements of the first prompt teaches entity identifiers associated with entities in the text chunk; Paragraph 221 teaches “a similarity between an image feature of an image (the historical image) generated based on a previous prompt and a current image (the retained image of the second prompt) is calculated” and Paragraph 223 teaches based on the similarity scores, the second synthetic image is chosen and output. The second synthetic image is then “stored in the sequence element library 1101 as a new historical image.” The image feature teaches the first visual entity embedding which are associated with entities in the text chunk. The Figure 11 process teaches an image generation model. Thus, teaches that the entity identifier and visual entity embeddings are stored in the sequence element library and provided to the image generation model in order to select and generate the second synthetic image),
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the visual entity embeddings taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
18. Regarding claim 20, Zeng in view of Guo teaches the limitations of claim 19. However, Zeng is not relied on for the following claim language: the method further comprising associating a first visual entity embedding from the visual entity embeddings with a first entity from the set of entities and a first entity identifier from the set of entity identifiers in an entity table.
Guo teaches the method further comprising associating a first visual entity embedding from the visual entity embeddings with a first entity from the set of entities and a first entity identifier from the set of entity identifiers in an entity table (Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the first visual entity embedding and the first entity identifier is the prop-element. The sequence element library teaches an entity table. Thus, Guo teaches the first visual entity embedding is associated with a first entity identifier in an entity table).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the entity table taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
19. Claim(s) 2 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, as applied to claim 1 and 15 above, and further in view of Nanda et al. (“Story Visualization: Generation of Scenes Sequentially for a Given Fable Using NPL and Image Processing” – cited in IDS), hereinafter referred to as Nanda.
20. Regarding claim 2, Zeng in view of Guo teaches the limitations of claim 1. Zeng further teaches the method further comprising generating a set of entity identifiers including the first entity identifier from the text document utilizing an entity recognition model (Section 3.2 teaches generating a set of entity identifiers through the attentional word encoder which creates vectors for relevant or important words in the text chunk. The word vectors are a set of entity identifiers),
However, Zeng and Guo are not relied on for the following claim language: wherein the set of entity identifiers correspond to people, objects, or places within the text document, and wherein the first entity identifier is associated with a sub-entity identifier corresponding to a characteristic or attribute of the first entity.
Nanda teaches the method further comprising generating a set of entity identifiers including the first entity identifier from the text document utilizing an entity recognition model, wherein the set of entity identifiers correspond to people, objects, or places within the text document, and wherein the first entity identifier is associated with a sub-entity identifier corresponding to a characteristic or attribute of the first entity (Section 6 ‘Implementation’, Subsection A, Paragraph 3-4 teaches extracting nouns, verbs, adjectives from the text using an NLP model or entity recognition model. The extracted entities are placed in a JSON object which has an entry for the entity like “tiger” and sub-entity identifiers like their action, position, orientation, color and more which correspond to the entity’s characteristics and attributes).
Zeng, Guo, and Nanda are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo with the sub-entity identifiers taught in Nanda in order to generate scenes with necessary attributes for the story (Nanda Section 4, Paragraph 3).
21. Regarding claim 18, Zeng in view of Guo teaches the limitations of claim 15. However, Zeng and Guo are not relied on for the following claim language: the system wherein the operations further include re-writing the set of semantic text chunks to resolve co-reference terms.
Nanda teaches the system wherein the operations further include re-writing the set of semantic text chunks to resolve co-reference terms (Section 6 ‘Implementation’, Subsection A, Paragraph 2 teaches using the StanfordCoreNLP method to perform co-reference resolution which replaces pronouns with their corresponding references in the fragments or text chunks).
Zeng, Guo, and Nanda are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the system of generating contextually-persistent images taught by Zeng in view of Guo with the co-reference resolution taught in Nanda in order to generate scenes with necessary attributes for the story (Nanda Section 4, Paragraph 3).
22. Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, and Nanda et al. (“Story Visualization: Generation of Scenes Sequentially for a Given Fable Using NPL and Image Processing” – cited in IDS), hereinafter referred to as Nanda, as applied to claim 2 above, and further in view of Hearst (“TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages”).
Regarding claim 3, Zeng in view of Guo and Nanda teach the limitations of claim 2. However, Zeng, Guo, and Nanda are not relied on for the following claim language: the method further comprising generating the set of semantic text chunks from the text document utilizing a semantic text chunking model that determines how to separate portions of the text document based on semantic differences.
Hearst teaches the method further comprising generating the set of semantic text chunks from the text document utilizing a semantic text chunking model that determines how to separate portions of the text document based on semantic differences (Abstract teaches a semantic text chunking model called ‘TextTiling’ which separates portions of the text into text chunks based on subtopic shifts or semantic differences; Page 40, Section 4 teaches that lexical items or vocabulary changes when the subtopic changes. Thus, detecting a lexical change is a semantic difference; Page 43, Section 4.1 teaches grouping sentences, or separating portions of the text, into blocks when they have a high lexical score. Low lexical scores would indicate a gap or subtopic change; Page 51, Section 5.3 teaches segmenting boundaries based on depth scores which depend on lexical scores. The segmented boundaries can represent the separated portions of the text document based on semantic differences).
Zeng, Guo, and Nanda are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Hearst is considered analogous to the claimed invention because it is in the same field of organizing text semantically. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo and Nanda with the semantic text chunking model as taught by Hearst in order to allow for better text analysis tasks and see when a subtopic has shifted (Hearst Abstract and Introduction Paragraphs 1-2).
23. Claim(s) 4-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, Nanda et al. (“Story Visualization: Generation of Scenes Sequentially for a Given Fable Using NPL and Image Processing” – cited in IDS), hereinafter referred to as Nanda, and Hearst (“TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages”) as applied to claim 3 above, and further in view of Sawane et al. (“An Approach to Extract the Relation and Location from the Short Stories”), hereinafter referred to as Sawane.
24. Regarding claim 4, Zeng in view of Guo, Nanda, and Hearst teach the limitations of claim 3. However, Zeng, Guo, and Hearst are not relied on for the following claim language: the method further comprising re-writing the first text chunk utilizing a semantic text recharacterization model to mark entities with corresponding entity identifiers, resolving co-reference terms, and removing non-contextual information.
Nanda teaches the method further comprising re-writing the first text chunk utilizing a semantic text recharacterization model (Section 6 ‘Implementation’, Subsection A, Paragraph 2 teaches using StanfordCoreNLP, a semantic text recharacterization model, to perform co-reference resolution which replaces pronouns with their corresponding references in the fragments or text chunks)
Zeng, Guo, and Nanda are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Hearst is considered analogous to the claimed invention because it is in the same field of organizing text semantically. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo and Hearst with the co-reference resolution taught in Nanda in order to generate scenes with necessary attributes for the story (Nanda Section 4, Paragraph 3).
However, Zeng, Guo, Nanda, and Hearst are not relied on for the following claim language: re-writing the first text chunk utilizing a semantic text recharacterization model to mark entities with corresponding entity identifiers, and removing non-contextual information.
Sawane teaches re-writing the first text chunk utilizing a semantic text recharacterization model to mark entities with corresponding entity identifiers, and removing non-contextual information (Section 4.5 and Figure 6 teaches marking the text chunk with entity identifiers; Section 4.6 teaches rewriting the text chunk into a summary that compresses the text without changing meaning. This is removing non-contextual information by only retaining the important words or sentences).
Zeng, Guo, and Nanda are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Hearst is considered analogous to the claimed invention because it is in the same field of organizing text semantically. Sawane is considered analogous to the claimed invention because it is in the same field of extracting features or entities in the text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo, Nanda, and Hearst with the markings and removing non-contextual information as taught by Sawane in order to reduce the source text into an exact form (Sawane Introduction Paragraph 1).
25. Regarding claim 5, Zeng in view of Guo, Nanda, Hearst, and Sawane teach the limitations of claim 4. Zeng further teaches the method further comprising generating the first synthetic image using an image generation model based on the first text chunk and the first entity identifier associated with the first entity (Section 3, Paragraph 1 teaches using an image generator that takes in as inputs the text chunk, or sentence, and first entity identifier, which is the word vector
w
t
; Figure 2 teaches the first text chunk and word vector or first entity identifier generates an image
x
^
1
through the image generator).
26. Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, as applied to claim 8 above, and further in view of Chen et al. ("Character-Centric Story Visualization via Visual Planning and Token Alignment"), hereinafter referred to as Chen.
Regarding claim 9, Zeng in view of Guo teaches the limitations of claim 8. However, Zeng and Guo are not relied on for the following claim language: the method wherein a first instance of a person in the first synthetic image associated with the first entity identifier is contextually persistent with a second instance of the person in the second synthetic image based on the image generation model using the first visual entity embedding from the first synthetic image when generating the second instance of the person in the second synthetic image.
Chen teaches the method wherein a first instance of a person in the first synthetic image associated with the first entity identifier is contextually persistent with a second instance of the person in the second synthetic image based on the image generation model using the first visual entity embedding from the first synthetic image when generating the second instance of the person in the second synthetic image (Section 4.2 ‘Character Region Extraction’ subsection teaches extracting characters and creating tokens, or visual entity embeddings, from an image. This can be done on the first synthetic image; Section 4.2 ‘Visual Token Completion’ subsection teaches using visual tokens for entities to create the image; Figure 11 shows in the VP-CSV row that the person in the first column is continuous with the second column with the same identity ‘Betty’).
Zeng, Guo, and Chen are considered analogous to the claimed invention because all are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo with using the visual entity embedding to ensure the person is continuous between two images taught by Chen in order to preserve the characters essential in the stories when generating images (Chen Abstract).
27. Claim(s) 11 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, as applied to claim 1 and 15 above, and further in view of Hossain et al. (“Text to Image Synthesis for Improved Image Captioning”), hereinafter referred to as Hossain.
28. Regarding claim 11, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng is not relied on for the following claim language: determining the first visual entity embedding from the first synthetic image by: generating tag candidate entities in the first synthetic image; generating a first image caption from the first synthetic image; comparing the first image caption to the first text chunk to determine a correlation between a first tag candidate entity and the first entity; and associating the first visual entity embedding generated for the first tag candidate entity with the first entity identifier.
Guo teaches the method of determining the first visual entity embedding from the first synthetic image by: generating tag candidate entities in the first synthetic image (Paragraph 197 and Figure 10 teach the image feature can be obtained through a CLIP model used on the generated image. The CLIP model teaches a visual entity embedding model that generates image features. The CLIP image features teaches the tag candidate entity and visual entity embedding); (Paragraph 200-202 teaches extracting image features through the CLIP model and comparing them to the noun elements in the prompts. The noun elements are entities. Thus, the visual entity embeddings are generated through the visual entity embedding model for the entities detected in the digital images; Paragraph 213 teaches an association format in the sequence element library may be: prop-element 1-cat-[prompt ID-image ID-image feature]. The image feature is the visual entity embedding and tag candidate entity generated for the prop-element. The prop-element is the first entity identifier).
Zeng and Guo are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng with the determination of the first visual entity embedding taught by Guo in order to generate a sequence of images that maintain the consistency of the previous image (Guo Paragraph 116).
However, Zeng and Guo are not relied on for the following claim language: generating a first image caption from the first synthetic image and comparing the first image caption to the first text chunk to determine a correlation between a first tag candidate entity and the first entity.
Hossain teaches generating a first image caption from the first synthetic image (Section 1, Paragraph 7 teaches creating captions for synthetic images; Figure 1 teaches creating a caption output from a synthetically generated image and its image features or tag candidate entries
x
t
) and comparing the first image caption to the first text chunk to determine a correlation between a first tag candidate entity and the first entity (Section 4, Subsection B1 ‘Qualitative Analysis’ teaches comparing the generated caption to the original caption, or first text chunk. Comparing the two captions includes noting what words are present in the first text chunk and what are present in the first image caption. Comparing the captions and words includes determining a correlation between the entities and tag candidate entities since the tag candidate entities
x
t
shown in Figure 1 create the output captions; Section 4, Subsection B2 ‘Quantitative Analysis’ teaches comparing the generated captions to the ground-truth captions using BLEU metrics).
Zeng, Guo, and Hossain are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo with generation of captions taught by Hossain in order to generate captions that can help visually impaired people to understand the context of the image (Hossain Introduction Paragraph 1).
29. Regarding claim 17, Zeng in view of Guo teaches the limitations of claim 15. However, Zeng and Guo are not relied on for the following claim language: the system wherein the operations further include utilizing an image captioner model, in connection with the visual entity embedding extraction model, to generate a caption of the first synthetic image and determine the first visual entity embedding of the first entity identifier within the first synthetic image.
Hossain teaches wherein the operations further include utilizing an image captioner model, in connection with the visual entity embedding extraction model, to generate a caption of the first synthetic image (Section 1, Paragraph 7 teaches creating captions for synthetic images using the caption generator model; Figure 1 teaches creating a caption output from a synthetically generated image using the caption generation module) and determine the first visual entity embedding of the first entity identifier within the first synthetic image (Section 4, ‘Implementation details’ subsection, Paragraph 1 teaches using a CNN in the caption generation module or image captioner model to extract the image feature vectors. These image feature vectors can be considered visual entity embeddings. Thus, the image captioner model can also be considered to be a visual entity embedding extraction model; Figure 1 teaches image features being extracted by the image encoder within the caption generation module which is the image captioner model).
Zeng, Guo, and Hossain are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the system of generating contextually-persistent images taught by Zeng in view of Guo with the image captioner model taught by Hossain in order to generate captions that can help visually impaired people to understand the context of the image (Hossain Introduction Paragraph 1).
30. Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, as applied to claim 1 above, and further in view of Chowdhury et al. (“Story-Oriented Image Selection and Placement”), hereinafter referred to as Chowdhury.
Regarding claim 12, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng and Guo are not relied on for the following claim language: the method further comprising: providing the first synthetic image in a first location of the text document corresponding to the first text chunk; and providing the second synthetic image in a second location of the text document corresponding to the second text chunk.
Chowdhury teaches the method further comprising: providing the first synthetic image in a first location of the text document corresponding to the first text chunk; and providing the second synthetic image in a second location of the text document corresponding to the second text chunk (Section 4.2 teaches aligning an image with a text unit or text chunk and then placing it within the text document; Section 5.2 teaches placing images throughout the text document. Thus, one of the images placed is the first synthetic image and the text chunk it is placed next to is the first text chunk. Another of the images placed is the second synthetic image and the text chunk it is placed next to is the second text chunk).
Zeng, Guo, and Chowdhury are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo with the image placement taught by Chowdhury in order to build a multimodal commentary for digital consumption that consists of a combination of words and pictures (Chowdhury Abstract and Introduction Paragraph 1).
31. Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zeng et al. ("PororoGAN: An Improved Story Visualization Model on Pororo-SV Dataset" -- cited in IDS), hereinafter referred to as Zeng, in view of Guo et al. (U.S. Patent Application Publication No. 2025/0259466 A1), hereinafter referred to as Guo, as applied to claim 1 above, and further in view of Chowdhury et al. (“Story-Oriented Image Selection and Placement”), hereinafter referred to as Chowdhury, and Hearst (“TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages”).
Regarding claim 13, Zeng in view of Guo teaches the limitations of claim 1. However, Zeng and Guo are not relied on for the following claim language: the method further comprising further comprising analyzing the text document for semantic changes that satisfy an image location threshold to determine where in the text document to place synthetic images.
Chowdhury teaches further comprising analyzing the text document for semantic changes (Section 1, Paragraph 7 teaches the SANDI method taught by Chowdhury captures the semantic coherence between paragraphs to place images; Section 5.2 teaches placing images throughout the text document based on semantic similarity).
Zeng, Guo, and Chowdhury are considered analogous to the claimed invention because both are in the same field of generating coherent images based on text. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo with the image placement taught by Chowdhury in order to build a multimodal commentary for digital consumption that consists of a combination of words and pictures (Chowdhury Abstract and Introduction Paragraph 1).
However, Zeng, Guo, and Chowdhury are not relied on for the following claim language: analyzing the text document for semantic changes that satisfy an image location threshold.
Hearst teaches the method further comprising further comprising analyzing the text document for semantic changes that satisfy an image location threshold (Abstract teaches a semantic text chunking model called ‘TextTiling’ which separates portions of the text into text chunks based on subtopic shifts or semantic differences; Page 40, Section 4 teaches that lexical items or vocabulary changes when the subtopic changes. Thus, detecting a lexical change is a semantic difference; Page 43, Section 4.1 teaches grouping sentences, or separating portions of the text, into blocks when they have a high lexical score. Low lexical scores would indicate a gap or subtopic change; Page 51, Section 5.3 teaches segmenting boundaries based on depth scores which depend on lexical scores. The segmented boundaries can represent the separated portions of the text document based on semantic differences; Page 52, Section 5.5 teaches a cutoff for depth scores to have boundaries between text chunks).
Zeng, Guo, and Chowdhury are considered analogous to the claimed invention because all are in the same field of outputting images based on text. Hearst is considered analogous to the claimed invention because it is in the same field of organizing text semantically. Thus, it would have been obvious to a person holding ordinary skill in the art before the effective filing date to modify the method of generating contextually-persistent images taught by Zeng in view of Guo and Chowdhury with the analyzing for semantic changes fulfilling a threshold as taught by Hearst in order to allow for better text analysis tasks and see when a subtopic has shifted (Hearst Abstract and Introduction Paragraphs 1-2).
Conclusion
32. The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Gong et al. (“TaleCrafter: Interactive Story Visualization with Multiple Characters”) teaches using tokens for identified characters and passing in the first generated image in order to generate the second image.
33. Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHRISTINE Y AHN whose telephone number is (571)272-0672. The examiner can normally be reached M-F 8-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alicia Harrington can be reached at (571)272-2330. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHRISTINE YERA AHN/Examiner, Art Unit 2615
/ALICIA M HARRINGTON/Supervisory Patent Examiner, Art Unit 2615