DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Obligation Under 37 CFR 1.56 – Joint Inventors
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Response to Amendment
Applicant’s amendment filed on September 9, 2025 has been entered. Claims 1, 2, 4, 6, 9-14, 16, 18 and 20 have been amended. No new claims have been added. Thus, claims 1-20 are still pending in this application, with claims 1, 13 and 20 being independent.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
Determining the scope and contents of the prior art;
Ascertaining the differences between the prior art and the claims at issue;
Resolving the level of ordinary skill in the pertinent art; and
Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-2, 13-14 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over HONG et al. (Hong F, Zhang M, Pan L, Cai Z, Yang L, Liu Z. “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.” arXiv preprint arXiv:2205.08535. 2022 May 17 – hereinafter, “HONG”) in view of well-known prior art (OFFICIAL NOTICE IS TAKEN).
Regarding claim 13, HONG discloses a system, comprising:
one or more processors (p. 161:7, 2nd paragraph: “32GB GPU”), the one or more processors configured to (p. 161:1, 1st paragraph: “Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users” p. 161:1, 1st paragraph: “Our key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation. Specifically, driven by natural language descriptions, we initialize 3D human geometry generation with a shape VAE network.” NOTE: In other words, “AvatarCLIP” is software executing on one or more processors. Furthermore, one of ordinary skill would understand that 3D computer graphics driven by neural networks inherently requires one or more processors.):
receive(p. 161:1, 1st paragraph: “driven by natural language descriptions”) to generate a visualization of a wearer (e.g., such as each of the generated visualizations of the wearers shown in Fig. 1.) wearing one or more garments in a given pose (e.g., As shown in Fig. 1, each of the visualizations are wearing one or more garments in a given pose corresponding the received natural language descriptions. For instance, the natural language request, “I would like to generate a skinny ninja that is raising arms” generates an skinny avatar wearing ninja garments in raised arm poses.), the natural language request comprising a description of the given pose and the one or more garments (p. 161:2, § 1, 1st paragraph: “AvatarCLIP, which is capable of generating and animating 3D avatars solely from natural language descriptions as shown in Fig. 1. See the example natural language requests in Fig. 1, e.g., “I want to generate a skinny ninja that is raising arms.” and/or “I want to generate an overweight sumo wrestler that is sitting.”);
execute, based on the received natural language request (), a machine learning model (p. 161:1, 1st paragraph: “driven by natural language descriptions, we initialize 3D human geometry generation with a shape VAE network.”), wherein executing comprises:
adjusting a pose of a wearer based on the given pose ( ) (p. 161:3, § 3, 1st paragraph: “The inputs are natural languages, text = {𝑡shape, 𝑡app, 𝑡motion}. The three texts correspond to the descriptions of the desired body shape, appearance and motion. The output is two-part, including a) an animatable 3D avatar represented as a mesh 𝑀 = {𝑉 , 𝐹,𝐶}, where 𝑉 is the vertices, 𝐹 stands for faces, 𝐶 represents the vertex colors; b) a sequence of poses Θ = {𝜃𝑖 } 𝐿 𝑖=1 comprising the desired motion, where 𝐿 is the length of the sequence.” p. 161:4, Fig 2 caption: “Fig. 2. An Overview of the Pipeline of AvatarCLIP. The whole pipeline is divided into two parts: a) Static Avatar Generation; b) Motion Generation. Assume the user want to generate ‘a tall and fat Iron Man that is running’. An animatable avatar is generated guided by 𝑡shape = ‘a tall and fat man’ and 𝑡app = ‘Iron Man’. Then a motion sequence matching the description 𝑡motion = ‘running’ is generated to animate the generated avatar.” p. 161:5, 1st paragraph: “For the second part, the natural language description of the motion 𝑡motion is used to generate candidate poses from a pre-calculated code-book. Then the candidate poses are used as references for the optimization of a pre-trained motion VAE to get the desired motion sequence.”); and
conditioning, using a respective machine learning layer of the machine learning model (p. 161:5, § 3.3.2, 2nd paragraph: “Both 𝑓 (𝑝) and 𝑐(𝑝) are implemented using MLPs.”), the one or more garments to the adjusted pose of the wearer ( ) (p. 161:4, § 3.2, 1st paragraph: “Guided by the appearance description 𝑡app, 𝑁 is further optimized by CLIP in a shape-preserving way for shape sculpting and texture generation.” p. 161:5, § 3.3.2: “To generate high-quality 3D avatars, the shape and texture need to be further sculpted and generated to match the description of the appearance 𝑡app. As discussed previously, we choose to use an implicit representation, i.e. NeuS, as the base 3D representation in this step for its advantages in both geometry and colors.” p. 161:5, § 3.3.2, 2nd paragraph: “Specifically, as shown in Fig. 4, the NeuS 𝑁 = {𝑓 (𝑝), 𝑐(𝑝)} comprises of two subnetworks. The SDF network 𝑓 (𝑝) takes some point 𝑝 as input and outputs the signed distance to its nearest surface. The color network 𝑐(𝑝) takes some point 𝑝 as input and outputs the color at that point. Both 𝑓 (𝑝) and 𝑐(𝑝) are implemented using MLPs.” NOTE: In other words, the garments are textured (i.e., “conditioned”) onto the 3D avatar poses by MLP (i.e., multi-layer perceptron)) subnetworks. See Fig. 2. ); and
provide for output (e.g., See Fig, 1, Fig. 2 and Fig. 11 – which show outputs of garments on 3D avatars in poses, wherein the output 3D avatar body shapes, garments and poses correspond to the shape, appearance, and poses (motion) described in each corresponding natural language input.).
HONG fails to disclose a distributed processing system having a user device for providing natural language input to the one or more processors and receiving output from the one or more processors.
However, at the time of the invention, distributed processing systems comprising user device clients and connected remote computer server processors were very well known in the art. OFFICIAL NOTICE IS TAKEN.
Thus, in order to enable user devices having lower processing power to perform the computer-implemented method described by HONG, it would have been obvious to one of ordinary skill in the art to have implemented the system taught by HONG in a distributed computing environment comprising a client user device providing natural language inputs to and receiving corresponding 3D animated outputs from one or more server processors, as is very well-known in the art.
Regarding claim 14 (depends on claim 13), HONG discloses:
wherein when the one or more garments includes two or more garments, each of the two or more intended garments are conditioned simultaneously to the adjusted pose of the user (As shown at least in Fig. 1, the output female soldier avatar has two or more garments, i.e., a hat, a shirt, and pants. Also see Fig. 11.).
Regarding claim 1, claim 1 is directed to the method implemented by the system of claim 13 and, as such, is rejected for the same reasons applied above in the rejection of claim 13.
Regarding claim 2, claim 1 is directed to the method implemented by the system of claim 14 and, as such, is rejected for the same reasons applied above in the rejection of claim 14.
Regarding claim 20, claim 20 is directed to a non-transitory computer-readable medium storing instructions, which when executed by one or more processors, cause the one or more processors to execute the method of claim 1, and, as such, claim 20 is rejected for the same reasons applied above in the rejection of claim 1.
Claims 4-6 and 16-18 are rejected under 35 U.S.C. 103 as being unpatentable over HONG et al. (Hong F, Zhang M, Pan L, Cai Z, Yang L, Liu Z. “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.” arXiv preprint arXiv:2205.08535. 2022 May 17) in view of well-known prior art, further in view of GUPTA et al. (US 2024/0257470, hereinafter “GUPTA”).
Regarding claim 16 (depends on claim 13), whereas HONG may not be entirely explicit as to, GUPTA further teaches:
the natural language request further comprises environmental details (¶ [0005]: “The operations include to receive a text prompt describing a scene and interactions of the avatar within the scene,” ¶ [0072]: “The text prompt may describe a scene and interactions of an avatar within the scene.”), and
wherein the output further includes an environment for the wearer based on the environmental details (¶ [0006]: “receive a text input describing an avatar in a scene, generating a body pose based on the text prompt, re-targeting the body pose to a target avatar body, generating a personalized avatar based on a predefined avatar configuration being applied to the target avatar body, and generating an image of the avatar in the scene based on the personalized avatar, the image including the avatar being in the generated body pose with the predefined avatar configuration.” ¶ [0021]: “Finally, an image of the avatar contextualized in a scene is generated using an image generation model conditioned on the input text (e.g., describing the avatar's action and the scene) and the rendered avatar.” ¶ [0047]: “The scene generator 260 generates a personalized image of a user's avatar in a scene corresponding to the input text. In some implementations, the personalized image includes the users' avatar interacting with objects in the scene. The scene generator 260 may utilize the personalized image generation model to generate the image of the user's avatar and/or the scene including the avatar. Given a rendered image of a posed avatar and text prompt describing the scene and interactions, the scene generator 260 performs conditional stable diffusion inpainting to generate a personalized image by outpainting from the avatar to fill in the rest of the scene and objects. The personalized image generation model is conditioned on the rendered avatar in the target pose as well as the input text prompt describing the avatar's action and the scene.”).
Thus, in order to obtain a more user friendly and versatile system, it would have been obvious to one of ordinary skill in the art to have modified the 3D avatar system taught by HONG so as to incorporate the functionality of generating an output including an environment for the outputted avatar based on the natural language request further including environmental details, as further taught by GUPTA.
Regarding claim 17 (depends on claim 16), GUPTA further discloses:
the environmental details correspond to at least one of lighting, texture, background (e.g., ¶ [0047]: “the scene”; ¶ [0047]: “outpainting from the avatar to fill in the rest of the scene and objects” NOTE: In other words, the text prompt describes the scene, which is the background for the user’s avatar.), color, angle, or image filter (¶ [0006]: “receive a text input describing an avatar in a scene, generating a body pose based on the text prompt, re-targeting the body pose to a target avatar body, generating a personalized avatar based on a predefined avatar configuration being applied to the target avatar body, and generating an image of the avatar in the scene based on the personalized avatar, the image including the avatar being in the generated body pose with the predefined avatar configuration.” ¶ [0021]: “Finally, an image of the avatar contextualized in a scene is generated using an image generation model conditioned on the input text (e.g., describing the avatar's action and the scene) and the rendered avatar.” ¶ [0047]: “The scene generator 260 generates a personalized image of a user's avatar in a scene corresponding to the input text. In some implementations, the personalized image includes the users' avatar interacting with objects in the scene. The scene generator 260 may utilize the personalized image generation model to generate the image of the user's avatar and/or the scene including the avatar. Given a rendered image of a posed avatar and text prompt describing the scene and interactions, the scene generator 260 performs conditional stable diffusion inpainting to generate a personalized image by outpainting from the avatar to fill in the rest of the scene and objects. The personalized image generation model is conditioned on the rendered avatar in the target pose as well as the input text prompt describing the avatar's action and the scene.”).
Regarding claim 18 (depends on claim 16), GUPTA further discloses:
executing the machine learning model further comprises adjusting an output environment based on the environmental details (¶ [0047]: “The scene generator 260 generates a personalized image of a user's avatar in a scene corresponding to the input text. In some implementations, the personalized image includes the users' avatar interacting with objects in the scene. The scene generator 260 may utilize the personalized image generation model to generate the image of the user's avatar and/or the scene including the avatar. Given a rendered image of a posed avatar and text prompt describing the scene and interactions, the scene generator 260 performs conditional stable diffusion inpainting to generate a personalized image by outpainting from the avatar to fill in the rest of the scene and objects. The personalized image generation model is conditioned on the rendered avatar in the target pose as well as the input text prompt describing the avatar's action and the scene.” ¶ [0050]: “the personalized image generation model trains a convolutional neural network (UNet) on a learned latent space of an image autoencoder.” ¶ [0079]: “The system 500 may further include one or more modules configured to generate an image of a scene in a virtual environment based on the personalized avatar image and the text prompt using the personalized image generation model. The scene may include the avatar interacting with objects in the scene in accordance with the text prompt. In some implementations, the rendered image of the avatar in the target pose may be overlapped with or a pasted onto the image of the scene. As described, the personalized image generation model may perform conditional stable diffusion inpainting to generate the image by outpainting from the avatar to fill in an existing scene and objects in the existing scene.”).
Regarding claims 4-6, claims 4-6 are directed, respectively, to the method(s) implemented by the system(s) of claims 16-18, and, as such, claims 4-6 are rejected for the same reasons applied above in the rejection of claims 16-18, respectively.
Claims 3, 7-8, 10-12, 15 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over HONG et al. (Hong F, Zhang M, Pan L, Cai Z, Yang L, Liu Z. “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.” arXiv preprint arXiv:2205.08535. 2022 May 17) in view of well-known prior art, further in view of GUPTA et al. (US 2024/0257470), and still further in view of COLBERT (US 2019/0287301).
Regarding claim 15 (depends on claim 14), whereas HONG and GUPTA might not be entirely explicit as to, COLBERT nevertheless teaches:
the one or more processors, when executing the machine learning model (FIG. 7), are further configured to simultaneously condition, using respective machine learning layers of the machine learning model (See FIG. 7, e.g., layer 721, 731, layer 722, 732, …, and layer 729, 739), the two or more intended garments (FIG. 7: jacket 711, pants 712, blouse 713, …, and shoes 719) to the adjusted pose of the wearer (e.g., ¶ [0038]: An embodiment of the invention may combine a garment selection control with controls for other aspects of image generation (skin tone, pose, body shape, accessories, etc.) Then, by manipulating individual controls of this plurality of controls, the user can change the garment (leaving the skin tone, pose, and accessories alone), or switch among accessories (leaving skin tone, pose and garment alone).” ¶ [0048]: “When the input networks are separated in this manner, they can be used separately and trained/re-trained independently. This arrangement also simplifies the identification of elements of the various Z vectors that affect aspects of the output image. These elements may be identified via Principal Component Analysis, PCA, as described above. Elements that control useful characteristics of the synthetic image may be described as “results-effective.” For example, one Z vector element (or a set of covarying elements) may be effective to control model skin tone. Other results-effective variables may change the model's pose (e.g. turning left or right, or changing arm or leg position, etc.) Some results-effective variables may control characteristics of an item of apparel—for example, the softness or stiffness of the fabric drape, the length of a sleeve or hem, or the height of a shoe heel.” ) (See FIG. 7. ¶ [0046]: “two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side.) Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ¶ [0047]: “From these Z-vectors, a plurality of output networks 731, 732, . . . , 739, operated independently according to the basic inventive principles, could be trained to generate synthetic images involving a model and their respective input(s). The “garment” network could produce an image of a model wearing the input garment, and the “shoe” network could produce an image of a model's leg (or a pair of legs) wearing the input shoe. But in this preferred embodiment, the separate Z vectors 751, 752, . . . , 759 are combined (e.g., concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image synthesizer neural network 770. This output network creates an image of a model wearing the garment and the shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z vector 760, for example pants 712 whose image was provided through input neural network 722).” NOTE: As shown in FIG. 7, each network layer, i.e., layer 721, 731, layer 722, 732, layer, layer 723, 733, etc., provides garment layers corresponding to jacket 711 layer, pants 712 layer and blouse 713 layer outputted in corresponding layers in the outputted image of the model wearing the different garment layers, i.e., the blouse 713 inside the jacket 711, and the pants 712 overlapped by the jacket 711 in the output. ¶ [0051]: “It is appreciated that adding new categories of inputs multiplies the complexity of the training problem and the number of training images necessary for the neural networks to be able to recognize and synthesize the desired images. For example, with one dress, one shoe and one handbag, it is preferable to present “model” images with the dress only (model barefoot and without the bag), model with shoes but no bag, model with bag but no shoes, and model with shoes and bag. If the system is to learn and synthesize multiple model poses, then even more model images may be required—the requirement for training images grows exponentially in the number of articles of apparel and model shapes, sizes and poses that the system is desired to handle.”), the respective machine learning layers being different machine learning layers (¶ [0046]: “two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side.) Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ¶ [0047]: “From these Z-vectors, a plurality of output networks 731, 732, . . . , 739, operated independently according to the basic inventive principles, could be trained to generate synthetic images involving a model and their respective input(s). The “garment” network could produce an image of a model wearing the input garment, and the “shoe” network could produce an image of a model's leg (or a pair of legs) wearing the input shoe. But in this preferred embodiment, the separate Z vectors 751, 752, . . . , 759 are combined (e.g., concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image synthesizer neural network 770. This output network creates an image of a model wearing the garment and the shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z vector 760, for example pants 712 whose image was provided through input neural network 722).” NOTE: As shown in FIG. 7, each network layer, i.e., layer 721, 731, layer 722, 732, layer, layer 723, 733, etc., provides garment layers corresponding to jacket 711 layer, pants 712 layer and blouse 713 layer outputted in corresponding layers in the outputted image of the model wearing the different garment layers, i.e., the blouse 713 inside the jacket 711, and the pants 712 overlapped by the jacket 711 in the output.).
Thus, in order to obtain a virtual try-on system having the cumulative features and/or functionalities taught by HONG, GUPTA and COLBERT, it would have been obvious to one of ordinary skill in the arts to have modified the system taught by the combination of HONG and GUPTA so as to incorporate simultaneously conditioning, using different respective machine learning layers of the machine learning model, two or more intended garments to the adjusted pose of the wearer, as taught by COLBERT.
Regarding claim 19 (depends on claim 13), whereas HONG and GUPTA may not be explicit as to, COLBERT teaches:
the machine learning model comprises one or more machine learning layers (See FIG. 7. ¶ [0046]: “two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side.) Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ), each machine learning layer corresponds to a type of garment (See FIG. 7. ¶ [0046]: “two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side.) Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.”), and each machine learning layer, when executed, provides as output a corresponding garment layer (See FIG. 7. ¶ [0046]: “In a preferred embodiment, two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side.) Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ¶ [0047]: “From these Z-vectors, a plurality of output networks 731, 732, . . . , 739, operated independently according to the basic inventive principles, could be trained to generate synthetic images involving a model and their respective input(s). The “garment” network could produce an image of a model wearing the input garment, and the “shoe” network could produce an image of a model's leg (or a pair of legs) wearing the input shoe. But in this preferred embodiment, the separate Z vectors 751, 752, . . . , 759 are combined (e.g., concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image synthesizer neural network 770. This output network creates an image of a model wearing the garment and the shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z vector 760, for example pants 712 whose image was provided through input neural network 722).” NOTE: As shown in FIG. 7, each network layer, i.e., layer 721, 731, layer 722, 732, layer, layer 723, 733, etc., provides garment layers corresponding to jacket 711 layer, pants 712 layer and blouse 713 layer outputted in corresponding layers in the outputted image of the model wearing the different garment layers, i.e., the blouse 713 inside the jacket 711, and the pants 712 overlapped by the jacket 711 in the output.).
Thus, in order to obtain a more versatile virtual try-on system having the cumulative features and/or functionalities taught by HONG, GUPTA and COLBERT, it would have been obvious to one of ordinary skill in the arts to have modified the system taught by the combination of HONG and GUPTA so as to incorporate one or more machine learning layers wherein each machine learning layer corresponds to a type of garment and each machine learning layer, when executed, provides as output a corresponding garment layer, as taught by COLBERT.
Regarding claims 3, claim 3 is directed to the method implemented by the system of claim 15, and, as such, claim 3 is rejected for the same reasons applied above in the rejection of claim 15.
Regarding claims 7, claim 7 is directed to the method implemented by the system of claim 19, and, as such, claim 7 is rejected for the same reasons applied above in the rejection of claim 19.
Regarding claim 8 (depends on claim 7), COLBERT further teaches:
the type of garment includes at least one of a lower body garment (e.g., FIG. 7: pants 712), an upper body garment (e.g., FIG. 7: jacket 711 and/or blouse 713 ), an accessory (e.g., FIG. 6: bag 611), or shoes (e.g., FIG. 7: shoes 719a,719b).
Regarding claim 10 (depends on claim 1), whereas HONG and GUPTA may not be entirely explicit as to, COLBERT teaches:
receiving, by the one or more processors, a second natural language request (e.g., FIG. 7: pants input 712 and/or blouse input 713 being a second input – with garment input 711 being first input.); and
executing, by the one or more processors based on the received second natural language request, the machine learning model (e.g., executing the machine learning model (i.e., neural network) 700 shown in FIG. 7 ), wherein executing the machine learning model further comprises:
adjusting a depiction of a second garment in a second garment layer based on the second natural language request (e.g., as shown in FIG. 7, in the output of the trained model (i.e., neural network) 700, the pants 712 and/or blouse 713 are fit to the model and partially covered by garment (i.e., jacket) 711 in the output.).
Thus, in order to obtain a more versatile 3D avatar generation system having the cumulative features and/or functionalities taught by HONG, GUPTA and COLBERT, it would have been obvious to one of ordinary skill in the arts to have modified the system taught by the combination of HONG and GUPTA so as to incorporate receiving a second input and adjusting a depiction of a second garment in a second garment layer based on the second input, as taught by COLBERT.
Regarding claim 11 (depends on claim 10), COLBERT further teaches:
adjusting the second garment in the second garment layer does not change the depiction of the one or more garments in a first garment layer (¶ [0038]: “by manipulating individual controls of this plurality of controls, the user can change the garment (leaving the skin tone, pose, and accessories alone), or switch among accessories (leaving skin tone, pose and garment alone).” ¶ [0046]: “In a preferred embodiment, two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side). Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ¶ [0047]: “From these Z-vectors, a plurality of output networks 731, 732, . . . , 739, operated independently according to the basic inventive principles, could be trained to generate synthetic images involving a model and their respective input(s). The “garment” network could produce an image of a model wearing the input garment, and the “shoe” network could produce an image of a model's leg (or a pair of legs) wearing the input shoe. But in this preferred embodiment, the separate Z vectors 751, 752, . . . , 759 are combined (e.g., concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image synthesizer neural network 770. This output network creates an image of a model wearing the garment and the shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z vector 760, for example pants 712 whose image was provided through input neural network 722).” ¶ [0048]: “When the input networks are separated in this manner, they can be used separately and trained/re-trained independently. This arrangement also simplifies the identification of elements of the various Z vectors that affect aspects of the output image. These elements may be identified via Principal Component Analysis, PCA, as described above. Elements that control useful characteristics of the synthetic image may be described as “results-effective.” For example, one Z vector element (or a set of covarying elements) may be effective to control model skin tone. Other results-effective variables may change the model's pose (e.g. turning left or right, or changing arm or leg position, etc.) Some results-effective variables may control characteristics of an item of apparel—for example, the softness or stiffness of the fabric drape, the length of a sleeve or hem, or the height of a shoe heel.” NOTE: Changing the height of the heel of shoe 719a,b does not change the depiction of the garment 711. ¶ [0049]: “In this connection, it is appreciated that two especially useful results-effective variables control the body size or proportions of the (imaginary, synthesized) model, and the tightness of the clothes. These variables together allow the user to visualize the fit of different garment sizes.” ¶ [0058]: “Once the neural network has been trained (which may require thousands or tens of thousands of automatically-produced training images), photographic images of several real items of apparel are obtained (870). As mentioned earlier, these images should be made in conditions similar to those in which the training apparel items were prepared at 810. These photographic images are provided to the trained neural network, which delivers a corresponding synthetic image that appears to show a model dressed in the apparel of the photographic images (880). In this synthetic photo, some or all of the apparel may be real (their images having been provided to the neural network), but the model is not real. The neural network, through its training, is able to properly layer clothing, for example by drawing a jacket over a shirt, or shoes over socks. Finally, the synthetic image may be displayed (890).” NOTE: Clearly, each of the individual clothing inputs can be changed by the user. Thus, changing the shirt to be worn under the jacket (or socks inside the shoes) does not change the depiction of the jacket worn over the shirt (nor the depiction of the shoes worn over socks). ¶ [0059]: “Note that training networks with photorealistic renderings of 3D models of garments, shoes, accessories and the like is quite time- and resource-intensive, but the resulting trained networks can produce the desired synthetic images from inputs that are significantly easier and less-expensive to obtain. Once trained, the network can operate on basic “flat” images of garments (i.e., an image of a garment laid flat against a planar background), front-and-side images of shoes, and one or a few images of each accessory type. The training process can include learning “no accessory,” “no shoes” options, so that the synthesizer can generate those images, as well as images with any desired combination of shoes, bags, jewelry, and other features. Composite networks according to an embodiment of the invention may include input sub-networks to receive images and associated information including: Flat garment images, Mannikin garment images, Shoe images (preferably two or three views), Handbag images (preferably two or three views), Hat images, Necklace images, Bracelet images, Ring images, Scarf images, Necktie images” ¶ [0061]: “It is appreciated that in a multi-input neural network like that described here, one of the inputs may be an image of a person (either naked or in a neutral body suit), standing in a neutral pose. If this image is provided to the input recognizer, and the network is trained to produce output images that match, then the overall system can synthesize images that appear to be a specific individual person wearing a selection of apparel. Thus, instead of just showing a generic model (having controllable height, weight, skin tone, pose, and other characteristics), the synthetic image could show a particular individual whose neutral-pose photo is provided, wearing the same apparel. In this embodiment, it would still be possible to adjust the pose of the (real) model or her apparent height or weight, using the same Z-vector controls that affect any other synthesized image produced by an embodiment. Thus, the output is not merely a morphing or interpolation of the garments so as to occlude the model image. Instead, all of the input images (including the image of the real person) are “compressed” down to Z vectors, and the generative network synthesizes a new image based on the (possibly modified) combined Z vector. A clear distinguishing characteristic of such a synthesized image is that the model is in a different pose than the input model image. For example, the input model image may be standing straight and facing forward, but the synthesized image may be turned left or right, or may have its arms or legs in different positions. However, the Z vector can also be adjusted to produce images showing what the person might look like if she gained or lost weight. This usage model may be of particular value.” ¶ [0062]: “The inventor has identified one additional input-data type that can be provided to the network during training to improve the quality of synthesized images. This is “pose” data. When a 3D model of a human figure is used, it is relatively common to specify its pose via reference points for skeletal joints. For example, the position of the head, shoulders, elbows, wrists, hips, knees and ankles can only vary over a limited range governed by the nature of the relevant joints. By specifying angles, orientations and joint-to-joint distances, the training figure pose can be defined efficiently. If this information is provided during training with the garments, tags and other information, then the neural network can learn to produce synthetic images that depict the imaginary model in a variety of poses. It is even possible to produce images of a sequence of successive poses, which may be displayed successively to produce an animation of the model showing off the garments and accessories as in a fashion show. Body-part girth data may be used like pose or joint-position data to help the neural network learn to depict models of different weights.” ).
Regarding claim 12 (depends on claim 10), COLBERT further teaches:
the second natural language request comprises a description of at least one of: an addition of a second intended garment (¶ [0038]: “by manipulating individual controls of this plurality of controls, the user can change the garment (leaving the skin tone, pose, and accessories alone), or switch among accessories (leaving skin tone, pose and garment alone).”), removal of the one or more garments (¶ [0038]: “by manipulating individual controls of this plurality of controls, the user can change the garment (leaving the skin tone, pose, and accessories alone), or switch among accessories (leaving skin tone, pose and garment alone).” ), or an adjustment of an order of the layers of the one or more garments (¶ [0020]: “Now, in typical use, a garment image is provided to the trained network (130). The garment image need not be (and preferably is not) one of the training images, but instead is an image without a corresponding mate. The network synthesizes an image of this garment on a model, based on the network's training and parameters of the Z vector (140). The synthetic image is displayed to the user (150). The user may adjust a Z vector parameter (160) and a new image is synthesized (140) and displayed (150). The adjustment and re-synthesizing may be repeated as often as desired to produce a variety of synthetic images of a model wearing the garment.” ¶ [0036]: “When a customer visits the operator's system (e.g., when she accesses an e-commerce web site), she may search or browse the catalog of garments using any suitable prior-art method (430). For example, garments may be grouped and presented by color, style, weight, designer, size, price, or any other desired arrangement. When the user selects a garment (440), the system synthesizes and displays an image of the garment on a model (450). The user may be offered an array of controls which are connected to suitable elements of the Z-vector, and she may adjust those parameters as desired (460). When a parameter is adjusted, the system synthesizes and displays a new image of the garment on the model (450).” ¶ [0038]: “An embodiment of the invention may combine a garment selection control with controls for other aspects of image generation (skin tone, pose, body shape, accessories, etc.). Then, by manipulating individual controls of this plurality of controls, the user can change the garment (leaving the skin tone, pose, and accessories alone), or switch among accessories (leaving skin tone, pose and garment alone). This embodiment permits quick, self-directed comparisons among complete “outfits” or “looks,” a capability that is currently provided at significant expense by human fashion coordinators, and consequently mostly unavailable to shoppers of ordinary or modest means.” ¶ [0044]: “An embodiment of this system accepts at least two different types of input apparel data, and produces at least one synthetic image that combines the input in a way that is plausible given the training of the network. Here, “plausible” does not mean “a copy of a previously seen training image, with certain parts replaced by portions of the input images.” Instead, a “plausible” image is one that the recognizer (after training) evaluates as being an image that shows a model wearing the input garment and accessories.” ¶ [0046]: “In a preferred embodiment, two or more input networks 721, 721, . . . , 729 receive the different categories of input. One network 721 may receive an image of a garment 711, while another network 729 may receive an image of a shoe 719a, b. (The inventor has noted that “shoe” network performance can be improved by providing multiple images of the same shoe, e.g. images from the front and side). Each input network may be a separate, multi-layer neural network with decreasing numbers of nodes at each level, which distills the information in its respective input(s) into a corresponding Z-vector 751, 752, . . . , 759 through the compression-like process described earlier.” ¶ [0047]: “From these Z-vectors, a plurality of output networks 731, 732, . . . , 739, operated independently according to the basic inventive principles, could be trained to generate synthetic images involving a model and their respective input(s). The “garment” network could produce an image of a model wearing the input garment, and the “shoe” network could produce an image of a model's leg (or a pair of legs) wearing the input shoe. But in this preferred embodiment, the separate Z vectors 751, 752, . . . , 759 are combined (e.g., concatenated) to form a combined Z-vector 760, which is delivered to a multi-element image synthesizer neural network 770. This output network creates an image of a model wearing the garment and the shoes (plus any other accessories whose images are provided to other input networks and concatenated into the composite Z vector 760, for example pants 712 whose image was provided through input neural network 722).” NOTE: In the system shown in FIG. 7, for each category of apparel, an input for the corresponding input network can be changed to generate different combinations of inputted apparel items in the output image, or an input for a particular apparel category can be removed (i.e., omitted) to produce an output image without an item of the category of apparel. In other words, in the network shown in FIG. 7, it is clear that each input can be changed independently, and that not every input network has to have an input (i.e., for each category of apparel, an apparel category input can be added, changed, or removed entirely such that a respective output image would include the added apparel item, substitute a different apparel item of the same category in the output image, or remove (or not include) an apparel item in the output image). ¶ [0058]: “Once the neural network has been trained (which may require thousands or tens of thousands of automatically-produced training images), photographic images of several real items of apparel are obtained (870). As mentioned earlier, these images should be made in conditions similar to those in which the training apparel items were prepared at 810. These photographic images are provided to the trained neural network, which delivers a corresponding synthetic image that appears to show a model dressed in the apparel of the photographic images (880). In this synthetic photo, some or all of the apparel may be real (their images having been provided to the neural network), but the model is not real. The neural network, through its training, is able to properly layer clothing, for example by drawing a jacket over a shirt, or shoes over socks. Finally, the synthetic image may be displayed (890).” ¶ [0059]: “Note that training networks with photorealistic renderings of 3D models of garments, shoes, accessories and the like is quite time- and resource-intensive, but the resulting trained networks can produce the desired synthetic images from inputs that are significantly easier and less-expensive to obtain. Once trained, the network can operate on basic “flat” images of garments (i.e., an image of a garment laid flat against a planar background), front-and-side images of shoes, and one or a few images of each accessory type. The training process can include learning “no accessory,” “no shoes” options, so that the synthesizer can generate those images, as well as images with any desired combination of shoes, bags, jewelry, and other features. Composite networks according to an embodiment of the invention may include input sub-networks to receive images and associated information including: Flat garment images, Mannikin garment images, Shoe images (preferably two or three views), Handbag images (preferably two or three views), Hat images, Necklace images, Bracelet images, Ring images, Scarf images, Necktie images” NOTE: In the system shown in FIG. 7, inputs for each different category of apparel can clearly be independently input (i.e., for each input network 721, 722, …, 759, a corresponding input can be independently added, changed and/or excluded).).
Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over HONG et al. (Hong F, Zhang M, Pan L, Cai Z, Yang L, Liu Z. “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.” arXiv preprint arXiv:2205.08535. 2022 May 17) in view of well-known prior art, in view of GUPTA et al. (US 2024/0257470), further in view of COLBERT (US 2019/0287301), and still further, in view of CUI et al. (Cui, Aiyu; Daniel McKee; and Svetlana Lazebnik. "Dressing in order: Recurrent person image generation for pose transfer, virtual try-on and outfit editing." In Proceedings of the IEEE/CVF international conference on computer vision; pp. 14638-14647; 2021; hereinafter referred to as “CUI”).
Regarding claim 9 (depends on claim 7), COLBERT arguably teaches:
executing the machine learning model further comprises ordering the garment layers based on the received natural language request (As shown in FIG. 7, the received input garment layers are ordered in the output (i.e., inputted jacket 711, blouse 713, and pants 712 are ordered with the blouse 713 and pants 712 are ordered beneath the jacket 711 in the output image.).
Nevertheless, whereas the combination of HONG, GUPTA and COLBERT may not be entirely explicit as to, CUI teaches:
executing the machine learning model further comprises ordering the garment layers based on the received natural language request (Abstract: “a novel recurrent generation pipeline to sequentially put garments on a person, so that trying on the same garments in different orders will result in different looks.” See FIG. 2: “DiOr generation pipeline (see Section 3 for details). We represent a person as a (pose, body, {garments}) tuple. Generation starts by encoding the target pose as Zpose and the source body as texture map Tbody. Then the body is generated as Zbody by the generator module Gbody. Zbody serves as Z0 for the recurrent garment generator Ggar, which receives the garments in order, each encoded by a 2D texture feature map Tgk and soft shape mask Mgk. In addition to masked source images, the body and garment encoders take in estimated flow fields f to warp the sources to the target pose. We can decode at any step to get an output showing the garments put on so far.” 3rd paragraph of § 1: “ By contrast, our system incorporates a novel recurrent generation module to produce different looks depending on the order of putting on garments. This is why we call our system DiOr, for Dressing in Order.” 2nd paragraph of § 4.4: “When we run our model, the garment dressing order is set to (hair, top, bottom, jacket).”).
Thus, in order to obtain a more versatile virtual try-on system having the cumulative features and/or functionalities taught by HONG, GUPTA, COLBERT and CUI, it would have been obvious to one of ordinary skill in the arts to have modified the system taught by the combination of HONG, GUPTA and COLBERT so as to incorporate the functionality of ordering the garment layers based on the received input in the machine learning model, as taught by CUI.
Response to Arguments
Applicant's arguments filed September 9, 2025 with respect to the amended claims have been considered but are moot in view of the new ground(s) of rejection.
Conclusion
At present, it is not apparent to the examiner which part of the application could serve as a basis for new and allowable claims. However, should the applicant nevertheless regard some particular matter as patentable, the examiner encourages applicant to appropriately amend the claims to include such matter and to indicate in the REMARKS the difference(s) between the prior art and the claimed invention as well as the significance thereof.
Furthermore, should applicant decide to amend the claims, examiner respectfully requests that the applicant please indicate in the REMARKS from which page(s), line(s) or claim(s) of the originally filed application that any amendments are derived. See MPEP § 2163(II)(A) (There is a strong presumption that an adequate written description of the claimed invention is present in the specification as filed, Wertheim, 541 F.2d at 262, 191 USPQ at 96; however, with respect to newly added or amended claims, applicant should show support in the original disclosure for the new or amended claims.).
Action is Final
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT PEREN who can be reached by telephone at (571) 270-7781, or via email at vincent.peren@uspto.gov. The examiner can normally be reached on Monday-Friday from 10:00 A.M. to 6:00 P.M.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING POON, can be reached at telephone number (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from Patent Center. Status information for published applications may be obtained from Patent Center. Status information for unpublished applications is available through Patent Center for authorized users only. Should you have questions about access to Patent Center, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.
/VINCENT PEREN/
Examiner, Art Unit 2617
/KING Y POON/Supervisory Patent Examiner, Art Unit 2617