DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Application claims priority to foreign application with application number CN202311118784.3 dated 08/31/2023. Copies of certified papers required by 37 CFR 1.55 have been received. Priority is acknowledged under 35 USC 119(e) and 37 CFR 1.78.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 6, 8-10, 15, and 17-19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin (KR 20220046373 A) in view of Mou et al. ("T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models". arXiv preprint (20 Mar 2023). https://arxiv.org/abs/2302.08453, hereinafter "Mou") and Xie et al. (US 20220138249 A1, hereinafter "Xie").
Regarding claim 1, Shin teaches: A comic image generation method, comprising:
acquiring pose text information corresponding to a comic storyboard, the pose text information being used for describing target action poses of at least one target object ([0177] and [0178] describe how “basic comic information” may be input as text;
[0184] describes how the comic text is divided into sections corresponding to individual panels, analogous to a storyboard;
[0187] describes keywords being extracted from the comic text;
[0203] describes that keywords can indicate “expression items” or “action items”;
[0223] explains that “expression items” and “action items” correspond to facial expressions and body poses, respectively); and
generating comic images corresponding to the comic storyboard according to the pose text information, target objects in the comic images having the target action poses ([0223] describes generating an image for a character using an action object image indicating a pose;
[0226] describes placing the character image into the scene;
see figs. 10-11 and paragraphs [0233] to [0234] for completed examples).
Shin does not explicitly teach: generating comic images corresponding to the comic storyboard using an artificial intelligence model according to the pose text information and the pose assistance image.
Mou teaches: generating an image using an artificial intelligence model according to the pose text information and the pose assistance image (figs. 1, 6, and 8 each provide an example of an image generated based on a text prompt and a reference pose; fig. 3 shows the structure of the artificial intelligence model).
Mou is analogous to the claimed invention because it pertains to the same issue of generating an image from a text prompt and pose input. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Shin with the teachings of Mou to use an artificial intelligence model to generate the comic image rather than relying on a database of pre-drawn objects as taught by Shin. The motivation would have been to generate a more unique or customized image for each comic panel, as a regular reader would likely notice the reuse of assets caused by Shin’s method.
The combination of Shin in view of Mou is not relied upon to teach: determining a pose assistance image corresponding to the comic storyboard according to the pose text information, reference action poses of a reference object in the pose assistance image matching the target action poses.
Xie teaches: determining a pose assistance image corresponding to the comic storyboard according to the pose text information, reference action poses of a reference object in the pose assistance image matching the target action poses (fig. 5, keyword query (“pose text information”) is used to search for reference images with poses corresponding to the keywords, and a virtual mannequin showing the pose is generated from a selected reference image and displayed on the screen; see paragraphs [0089] to [0094] for a detailed explanation).
Xie is analogous to the claimed invention because it pertains to the issue of generating human poses from text input. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Shin in view of Mou to implement the teachings of Xie as a pose assistance image selection system. The motivation would have been to add a pose assistance image to improve a user’s control over the output of the image generation model, and to add an interface to select the pose generation image which prioritizes accuracy, efficiency, and flexibility (Xie [0002] to [0004]).
Regarding claim 6, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, wherein the pose assistance image comprises a skeletal point image (Xie fig. 5 virtual mannequin); and the determining the pose assistance image corresponding to the comic storyboard according to the pose text information comprises:
acquiring an initial three-dimension (3D) skeletal point model (Xie [0094] “In one or more embodiments, the pose search system 102 utilizes a 2D-to-3D neural network to generate a three-dimensional virtual mannequin. Indeed, the pose search system 102 utilizes the 2D-to-3D neural network to generate or determine a three-dimensional pose of a selected digital image (e.g., as selected in the act 506) and to generate a three-dimensional virtual mannequin with three-dimensional, manipulable joints and segments arranged according to the determined pose.”);
performing pose adjustment on the initial 3D skeletal point model according to the target action poses of at least one target object to obtain a target 3D skeletal point model corresponding to each target object (Xie [0092] “In addition, the pose search system 102 generates a virtual mannequin from the determined pose of the selected digital image. Indeed, the pose search system 102 generates a virtual mannequin that includes manipulable joints (as indicated by dots or circles) and segments (as indicated by line segments connecting the dots or circles) at locations corresponding to locations of joints and segments of the selected digital image. As shown in FIG. 5, for instance, the virtual mannequin includes joints and segments arranged according to the pose of the human figure in the selected digital image from the act 506 (e.g., with the hands overhead, standing on one leg).”); and
determining the skeletal point image according to the target 3D skeletal point model (Xie, virtual mannequin is displayed on the screen).
The motivation to combine the additional teachings of Xie with the invention of Shin in view of Mou would have been similar to the motivations described for claim 1.
Regarding claim 8, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, wherein the determining the pose assistance image corresponding to the comic storyboard according to the pose text information comprises:
determining pose types of the target action poses described in the pose text information (Shin [0198] to [0199] “To select an object image, the database lookup module (122) can compare the information of the keyword and the object tag to select an object image related to the keyword. For example, when the keyword is ‘man’, the database lookup module (122) determines the type of object as ‘person’ and can select one of the object images related to ‘man’ from the detailed item ‘gender item’.”;
[0203] describes that keywords can indicate “expression items” or “action items”;
[0223] explains that “expression items” and “action items” correspond to facial expressions and body poses, respectively); and
determining the pose assistance image according to the pose text information in a case where the pose types meet a preset type condition, wherein the preset type condition is used for indicating a pose type that the target action poses need to conform to when the pose assistance image needs to be used (Shin, see previously cited paragraphs [0198] and [0223], also fig. 13;
it is suggested that the type of object image, corresponding to the claimed “pose assistance image”, is selected based on a “preset type condition” (whether a keyword corresponds to either a facial expression or a body/action image)).
Regarding claim 9, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, further comprising:
acquiring background text information corresponding to the comic storyboard, the background text information being used for indicating a target image background corresponding to the comic storyboard (Shin [0202] “At this time, the object image selected for the corresponding comic panel may be at least one of the background object image, character object image, prop object image, and speech bubble object image of the corresponding panel.”
[0203] “Therefore, if the keywords for the comic panel are 'coffee shop' and 'man', the object image can be selected from among the object images classified as 'background' using the keyword 'coffee shop' by considering the 'place item', 'location item' and 'color item', and the object image can be selected from among the object images classified as 'person' using the keyword 'man' by considering the 'categorization item', 'expression item' and 'action item'.” – “background text information” is identified from the set of keywords);
wherein the generating comic images corresponding to the comic storyboard using the artificial intelligence model according to the pose text information and the pose assistance image comprises:
generating the comic images using the artificial intelligence model according to the pose text information (see claim 1), the pose assistance image (see claim 1), and the background text information (Shin [0206] “At this time, the background object image may be selected randomly or by considering weights for the background that indicate the degree of association with the keyword.”),
wherein the target objects in the comic image have the target action poses (see claim 1) and the comic images have the target image background (Shin fig. 12 and 14 for an example).
Claim(s) 2-4, 11-13, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin (KR 20220046373 A) in view of Mou ("T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models") and Xie (US 20220138249 A1) as applied to claims 1, 10, and 19 above, and further in view of Kim (KR 20230114415 A).
Regarding claim 2, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, wherein determining the pose text information comprises:
acquiring a target novel to be converted into a comic (Shin [0162] “[At this time, the comic information input screen (W11) is a screen for inputting comic information, such as words or sentences related to the comic content to be produced, and the comic information input may be at least one word or at least one sentence.”; this description implicitly includes longer texts such as novels);
splitting the target novel to obtain comic text information of each comic storyboard corresponding to the target novel (Shin [0183] to [0185] “Therefore, the input analysis module (242) can separate the basic comic information into sentence units and store it in the storage unit (13) as the final comic information (S110).
Then, the input analysis module (242) can extract the number of sentences from the final comic information, determine the number of comic panels of the comic content using the extracted number of sentences, and then store the determined number of comic panels in the storage unit (13).
At this time, the sequence number of the comic panels can be determined according to the order of each sentence, and the sequence number of the comic panels can be stored corresponding to the order of the sentences.”), the comic text information containing text keywords of the comic storyboard (Shin [0187] “In this way, when the number of comic panels for the final comic information is determined, the input analysis module (242) can extract at least one keyword for each of the remaining sentences starting from the first sentence corresponding to the first comic panel and store it in the storage unit (13).”) under multiple image generation dimensions (Shin [0202] to [0203] gives several examples of dimensions/categories for image generation prompted by the keywords: “At this time, the object image selected for the corresponding comic panel may be at least one of the background object image, character object image, prop object image, and speech bubble object image of the corresponding panel.
Therefore, if the keywords for the comic panel are 'coffee shop' and 'man', the object image can be selected from among the object images classified as 'background' using the keyword 'coffee shop' by considering the 'place item', 'location item' and 'color item', and the object image can be selected from among the object images classified as 'person' using the keyword 'man' by considering the 'categorization item', 'expression item' and 'action item'.”); and
generating the pose text information corresponding to the comic storyboard according to the text keywords in the comic text information ([0203] describes that keywords can indicate “expression items” or “action items”;
[0223] explains that “expression items” and “action items” correspond to facial expressions and body poses, respectively;
where the keywords may also be considered the “pose text information” because they are used to directly select the pose images).
The combination of Shin in view of Mou and Xie does not explicitly teach: splitting the target novel according to novel content of the target novel to obtain comic text information of each comic storyboard corresponding to the target novel – instead, Shin describes a simple split of one comic panel per sentence.
Kim teaches: splitting the target novel according to novel content of the target novel to obtain comic text information of each comic storyboard corresponding to the target novel ([0029] “The scenario analysis unit (110) can analyze the background composition of the scenario, the composition of paragraphs or sections according to the beginning, middle, and end, and the composition of characters including the protagonist. The scenario analysis unit (110) can divide the scenario into multiple cut units based on the analysis results.”;
[0031] “The storyboard composition unit (120) can compose a storyboard for a webtoon based on the analysis results of the scenario analysis unit (110), that is, on the scenario for each of the multiple cuts divided by the scenario analysis unit (110). The storyboard composition unit (120) can generate a text storyboard for each of a plurality of cuts and, based on the text storyboard, can compose an image storyboard containing one or more objects per cut.”).
Kim is analogous to the claimed invention because it is in the same field of generating a comic from text input. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Shin in view of Mou and Xie with the teachings of Kim to divide the text input into panels based on the content of the story, rather than individual sentences. The motivation would have been to create a more logical flow to the comic, condensing and emphasizing plot points rather than putting equal emphasis on every sentence.
Regarding claim 3, the combination of Shin in view of Mou and Xie and further in view of Kim teaches: The method according to claim 2, wherein the generating the pose text information corresponding to the comic storyboard according to the text keywords in the comic text information comprises:
screening out at least one target keyword related to the target action poses of the target object from the text keywords ([0198] to [0199] teaches that keywords are associated with a particular object category: “To select an object image, the database lookup module (122) can compare the information of the keyword and the object tag to select an object image related to the keyword. For example, when the keyword is ‘man’, the database lookup module (122) determines the type of object as ‘person’ and can select one of the object images related to ‘man’ from the detailed item ‘gender item’.”;
[0223] teaches that there is a separate object category for “action object images”, which show body poses); and
generating the pose text information according to at least one target keyword (the target keywords may also be considered the “pose text information” because they are used to directly select the pose images).
Regarding claim 4, the combination of Shin in view of Mou and Xie and further in view of Kim teaches: The method according to claim 2, wherein the determining the pose assistance image corresponding to the comic storyboard according to the pose text information comprises:
searching for the pose assistance image matching the target action poses of at least one target object from a picture information library according to the pose text information and/or the text keywords corresponding to the pose text information (Shin [0057] to [0058] “The database lookup module (122) receives a keyword transmitted from the user terminal (200) through the control module (121), searches for an object image suitable for the keyword in the database unit (300), and transmits object image identification information for the searched object image to the user terminal (200) through the control module (121).
The operation of such database query module (122) can be performed through artificial intelligence learning.”;
[0203] describes that keywords can indicate “expression items” or “action items”;
[0223] explains that “expression items” and “action items” correspond to facial expressions and body poses, respectively).
Claim(s) 5 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin (KR 20220046373 A) in view of Mou ("T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models") and Xie (US 20220138249 A1) as applied to claims 1 and 10 above, and further in view of Peshkov et al. (US 20240135619 A1, hereinafter "Peshkov").
Regarding claim 5, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, wherein the generating comic images corresponding to the comic storyboard using the artificial intelligence model according to the pose text information and the pose assistance image comprises:
calling the object generation model to obtain a target object image (Mou fig. 3, pre-trained diffusion model generates an image given a text input);
performing image recognition on the pose assistance image using a pose recognition model to obtain the reference action poses (Mou fig. 3, trainable adapters add additional conditioning to the diffusion model; pg. 4 subsection “Structure controlling” describes how an adapter can perform keypose recognition: “Our proposed T2I-Adapter has a good generalization to support various structural control, including sketch, depth map, semantic segmentation map, and keypose. The condition maps of these modes are directly input into task-specific adapters to extract condition features Fc”); and
generating the comic images using the artificial intelligence model according to the target object image, the reference action poses, and the pose text information (Mou figs. 1, 6, and 8 each provide an example of an image generated based on a text prompt and a reference pose; fig. 1 shows that other types of reference/guide inputs are possible, including images).
The combination of Shin in view of Mou and Xie does not explicitly teach performing the above method in response to the existence of an object generation model matching the target object.
Peshkov teaches a system of avatar models for simulated video communication, where the system checks if an avatar model already exists for a particular user so it may not need to generate a new one ([0030] “In some embodiments, avatar models obtained by a user device can be stored locally in a storage location, database, memory, or cache accessible to the user device. When the user device receives a request to initiate an avatar communication session with a remote user, the user device can determine whether the storage location, database, memory, or cache includes an avatar model for the remote user. If so, that avatar model may be used for the avatar communication session.”)
Peshkov is analogous to the claimed invention because it pertains to the same issue of generating images of people and storing associated models. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Shin in view of Mou and Xie with the teachings of Peshkov to reuse stored models for character generation. The motivation would have been to improve efficiency by saving on repeat processing.
Claim(s) 7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Shin (KR 20220046373 A) in view of Mou ("T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models") and Xie (US 20220138249 A1) as applied to claims 1 and 10 above, and further in view of Fei et al. (US 20180285636 A1, hereinafter "Fei").
Regarding claim 7, the combination of Shin in view of Mou and Xie teaches: The method according to claim 1, wherein the target action poses comprise a first action pose of the target object and a second action pose when the target object interacts with props; the reference action poses comprise a first reference pose matching the first action pose and a second reference pose matching the second action pose (Shin figs. 11 and 15, panels may include multiple characters and therefore multiple action poses; one or both characters may be interacting with props such as the flowers in fig. 11 or the coffee in fig. 15; see claim 1 for discussion concerning identifying a pose based on keywords and generating a reference pose image); and
the generating comic images corresponding to the comic storyboard using the artificial intelligence model according to the pose text information and the pose assistance image comprises:
generating the comic images corresponding to the comic storyboard according to the first reference pose, the second reference pose and the pose text information, wherein the target objects in the comic image have the first action pose and the second action pose (Shin figs. 11 and 15 show finished comic images; see claim 1 for discussion concerning using a machine learning model for image generation).
The combination of Shin in view of Mou and Xie does not explicitly teach:
the generating comic images corresponding to the comic storyboard using the artificial intelligence model according to the pose text information and the pose assistance image comprises:
performing skeletal point recognition on the reference object in the pose assistance image using an artificial intelligence model by calling a skeletal point recognition function, to obtain the first reference pose, and performing contour recognition on the reference object in the pose assistance image by calling a contour recognition function, to determine the second reference pose.
Fei teaches identification of the pose of a human arm holding an object including:
performing skeletal point recognition on the reference object in the pose assistance image using an artificial intelligence model by calling a skeletal point recognition function, to obtain the first reference pose (fig. 7; [0042] “(block 504) in response to determining that the physical hand holds an object: executing a 3D Hand Skeleton Joints And Robust Object Pose Recognition algorithm to determine 3D skeleton joints of the physical hand in 26 degrees-of-freedom and a 3D object pose of the object in the current frame”), and performing contour recognition on the reference object in the pose assistance image by calling a contour recognition function, to determine the second reference pose (fig. 7; [0043] explains that the object pose recognition algorithm relies on Gengyu et al. “Methods and systems for 3D contour recognition and 3D mesh generation” (US 20180025540 A1), which is incorporated by reference into Fei; this application is included in the “References Cited” section of this office action).
Fei is analogous to the claimed invention because it pertains to the same issue of identifying a pose. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the invention of Shin in view of Mou and Xie to use the methodology of Fei to determine the pose of a person interacting with an object. The motivation would have been to be able to handle poses of humans interacting with props rather than just humans on their own, in order to generate more varied/immersive comic scenes.
Regarding claims 10-18, they are rejected with the same references, rationale, and motivation to combine as claims 1-9 respectively because their limitations substantially correspond to the limitations of claims 1-9 respectively, with the additional limitations of:
A computer device (Shin fig. 2), comprising a processor (Shin [0054] “The service control unit (12) is a part that controls the overall operation of the comic production service server (100) and may be a processor made in the form of a chip.”) and a memory (Shin [0066] “The storage unit (13) is a storage medium that stores data required for the operation of the service control unit (12) or data generated during the operation, and may be a memory.”), wherein the memory stores machine-readable instructions executable by the processor, the processor is used for executing the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the processor executes a comic image generation method (Shin [0069] to [0070] describes executing the comic production application on a standard computational device such as a smartphone or PC, suggesting that its instructions are stored in memory and executed by the device’s processor).
Regarding claims 19-20, they are rejected with the same references, rationale, and motivation to combine as claims 1-2 respectively because their limitations substantially correspond to the limitations of claims 1-2 respectively, with the additional limitations of:
A non-transitory computer-readable storage medium (Shin [0067] lists several possible memory types including non-transitory memory), wherein a computer program is stored on the non-transitory computer-readable storage medium (Shin [0066] “The storage unit (13) is a storage medium that stores data required for the operation of the service control unit (12) or data generated during the operation, and may be a memory.”), and when the computer program is run on a computer device, the computer device executes a comic image generation method (Shin [0069] “The user terminal (200) may be an electronic device equipped with hardware that communicates with the comic service server (100) using a communication network to perform data transmission and reception operations, and executes the comic production application after the comic production application provided by the comic production service server (100) is installed.”).
References Cited
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Zhang et al. ("Adversarial Synthesis of Human Pose from Text". DAGM German Conference on Pattern Recognition (17 Mar 2021). https://link.springer.com/chapter/10.1007/978-3-030-71278-5_11) teaches a method of using a neural network to generate a human pose using only a text description as input.
Babanin et al. (US 20250054210 A1) teaches a method of using a neural network to generate an image using a text description and a pose reference image (see fig. 8).
Gengyu et al. (US 20180025540 A1) is incorporated by reference into Fei and teaches a method of contour recognition, which is used by Fei to identify the pose of an object.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BENJAMIN TOM STATZ/Examiner, Art Unit 2611
/TAMMY PAIGE GODDARD/Supervisory Patent Examiner, Art Unit 2611