Last updated: May 29, 2026

Application No. 18/734,454

METHOD FOR VIRTUAL FITTING, ELECTRONIC DEVICE, AND STORAGE MEDIUM

Non-Final OA §103

Filed

Jun 05, 2024

Priority

Apr 01, 2024 — CN 202410392719.8

Examiner

LETT, THOMAS J

Art Unit

2611

Tech Center

2600 — Communications

Assignee

Xiao-I Plus Inc.

OA Round

1 (Non-Final)

Interview Optional

— -35.9% interview lift. Interview lift (-35.9%) is below the 15.0% threshold. A written response is recommended.

Based on 725 resolved cases, 2023–2026

Examiner Intelligence

LETT, THOMAS J View full profile →

Grants 84% — above average

Career Allowance Rate

606 granted / 725 resolved

+21.6% vs TC avg

Minimal -36% lift

Without

With

+-35.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

21 currently pending

Career history

748

Total Applications

across all art units

Statute-Specific Performance

§101

5.3%

-34.7% vs TC avg

§103

41.0%

+1.0% vs TC avg

§102

51.0%

+11.0% vs TC avg

§112

2.4%

-37.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 725 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 7-11, 14-17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu et al. (TryOnDiffusion: A Tale of Two UNets) in view of Choi et al. (VITON-HD: High-Resolution Virtual Try-On) and further in view of Rombach et al. (High-Resolution Image Synthesis with Latent Diffusion Models).
Regarding claim 1, Zhu et al. (TryOn) discloses a method for virtual fitting (apparel try-on results with a significant body shape and pose modification, figure 1), comprising:
obtaining a first person image and a garment image (inputs: for both person and garment images using off-the-shelf methods [11, 28]. For garment image, we further segment out the garment Ic using the parsing map. For person image, we generate clothing-agnostic RGB image Ia which removes the original clothing but retains the person identity, section 3, see top of Figure 2);
inputting the second person image and the garment image into a virtual fitting model obtained by pre-training to obtain a virtual fitting image (the model takes as input 256×256 try-on result from previous Parallel-UNet model and synthesizes the final try on result Itr at 1024×1024 resolution, section 3.1. The model takes as input the agnostic person image and target garment, section 3); 
wherein the virtual fitting model is a dual U-Net structure (achieve it via two UNets that handle the garment and the person respectively, section 3.2, page 5) which comprises an image encoder, two U-Nets, and an image decoder, and
the two U-Nets are respectively used as a garment characterization network and a latent diffusion network (the model takes as input 256×256 try-on result from previous Parallel-UNet model and synthesizes the final try on result Itr at 1024×1024 resolution, section 3.1);
Zhu et al. does not expressly disclose performing a masking process of garment information on the first person image to obtain a second person image 
Choi et al. teaches generating the clothing-agnostic image Ia and the clothing-agnostic segmentation map Sa, which al low the model to remove the original clothing information thoroughly, and preserve the rest of the image, section 3.1, page 14134.
Zhu et al.  in view of Choi et al. are analogous art because they are from the similar problem solving area of virtual fitting.  At the time of the invention, it would have been obvious to a person of ordinary skill in the art to add the masking feature of Choi et al. to the method of Zhu et al.  in order to obtain a masking process of garment information.  The motivation for doing so would be to differentiate image data.
Zhu et al. does not expressly disclose wherein the two U-Nets have a same network structure that comprises one or more down-sampling layers, one or more intermediate layers, and one or more up-sampling layers and an image encoder, two U-Nets, and an image decoder.
Rombach et al. teaches advanced sampling, undersampling, and downsampling blocks, section 3.2 and teaches an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity, a decoder D reconstructs the image from the latent and a U-Net performing denoising, section 3.
Zhu et al. in view of Rombach et al. are analogous art because they are from the similar problem solving area of image synthesis.  At the time of the invention, it would have been obvious to a person of ordinary skill in the art to add the one or more down-sampling layers, one or more intermediate layers, and one or more up-sampling layers and an image encoder, two U-Nets, and an image decoder of Rombach et al. to the method of Zhu et al. in order to obtain an image formation process.  The motivation for doing so would be to achieve state-of-the-art synthesis results on image data.
Regarding claim 2, Zhu et al. (TryOn) discloses the method according to claim 1, wherein the inputting the second person image and the garment image into a virtual fitting model obtained by pre-training to obtain a virtual fitting image comprises:
inputting the garment image into the image encoder to obtain a garment latent feature (Examiner articulates that Virtual Try-on with Diffusion Model employ a network module (e.g., CLIP image encoder or ReferenceNet to extract garment features, which are injected into the process of diffusion denoising to preserve the identity and details of the garment), and
taking the garment latent feature as an input of the garment characterization network (similarity between the target person and the source garment, providing a learnable way to represent correspondence for the try-on task, section 3.2);
recording a feature of the up-sampling layers, the intermediate layers, and the down-sampling layers when performing a spatial self-attention operation (garment is warped implicitly via a cross attention mechanism, page 1; pose embeddings are then fused to the person-UNet through the attention mechanism, which is implemented by concatenating pose embeddings to the key-value pairs of each self attention layer, section 3.2);
inputting the second person image into the image encoder to obtain a person latent feature and mask region information, and taking the person latent feature, the mask region information and a random noise obeying Gaussian distribution as an input of the latent diffusion network (Choi et al: generating the clothing-agnostic image Ia and the clothing-agnostic segmentation map Sa, which al low the model to remove the original clothing information thoroughly, and preserve the rest of the image, section 3.1, page 14134);
respectively concatenating the feature, recorded by the garment characterization network, of the up-sampling layers, the intermediate layers, and the down-sampling layers when performing the spatial self-attention operation with a feature of the up-sampling layers, the intermediate layers, and the down-sampling layers at a corresponding position of the latent diffusion network when performing the spatial self-attention operation in a process of performing iterative denoising, to obtain a concatenated feature, and taking the concatenated feature as a feature of the latent diffusion network at the corresponding position (person-UNet takes the clothing-agnostic RGB Ia and the noisy image zt as input. Since Ia and zt are pixel-wise aligned, we directly concatenate them along the channel dimension at the beginning of UNet processing, section 3.2); and
inputting a feature output by the latent diffusion network into the image decoder to output the virtual fitting image (Output from 256×256 Parallel-UNet is sent to standard super resolution diffusion to create the 1024×1024 image, figure 2).
Regarding claim 3, Rombach et al. discloses the method according to claim 1, wherein a training process of the virtual fitting model comprises:
adding a random noise to a training sample in a diffusion step (applies JPEG compressions noise, camera sensor noise, different image interpolations for downsampling, Gaussian blur kernels and Gaussian noise in a random order to an image, D.6.1, page 23) based on Markov chain (learning the reverse process of a fixed Markov Chain of length T), recovering a clean sample from a noise sample in a reverse process (gradually denoising a nor mally distributed variable, section 3.2), calculating a loss between a real noise and an estimated noise (ability to build the underlying UNet primar ily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, section 3.2, Eq. 2), back propagating and updating a model parameter of the latent diffusion network until convergence, saving the model parameter and taking the model parameter as a model parameter of the garment characterization network.
Regarding claim 4, Choi et al. (ViTOn) discloses the method according to claim 1, wherein the performing a masking process of garment information on the first person image to obtain a second person image comprises:
inputting the first person image into a pre-trained deep learning image semantic segmentation neural network model (the image I by utilizing the pre-trained networks [7, 3], where L is a set of integers indicating the se mantic labels, section 3.1) for semantic segmentation to obtain a semantic segmented person image (a clothing-agnostic image Ia and a clothing-agnostic segmentation map Sa as inputs of each stage, which truly eliminate the shape of clothing item and preserve the body parts that need to be reproduced, section 3.1),
wherein the semantic segmented person image at least comprises an image divided into a human body information region and a garment information region (remove the clothing region to be replaced and preserve the rest of the image, section 3.1); and
performing the mask processing on the garment information region in the semantic segmented person image to obtain the second person image (remove clothing, figure 3).
Regarding claim 7, Zhu et al. (TryOn) discloses a method for virtual fitting, comprising:
obtaining virtual fitting images by using the method for virtual fitting according to claim 1,
wherein the first person image comprises a user image, there are at least two garment images, and the virtual fitting images respectively correspond to the garment images (randomly selected 2804 input pairs out of the 6K test set, ran all four methods on those pairs, section 4); and
selecting at least one target virtual fitting image from at least two virtual fitting images for
display or recommendation (randomly selected 2804 input pairs out of the 6K test set, ran all four methods on those pairs, and presented to raters. 15 non-expert raters (on crowdsource platform) have been asked to select the best result out of four, section 4).
Claim 8, an electronic device, is rejected for the same reason as claim 1.
Claim 9, an electronic device, is rejected for the same reason as claim 2.
Claim 10, an electronic device, is rejected for the same reason as claim 3.
Claim 11, an electronic device, is rejected for the same reason as claim 4.
Claim 14, an electronic device, is rejected for the same reason as claim 7.
Claim 15, a storage medium claim, is rejected for the same reason as claim 1.
Claim 16, a storage medium claim, is rejected for the same reason as claim 2.
Claim 17, a storage medium claim, is rejected for the same reason as claim 3.
Claim 20, a storage medium claim, is rejected for the same reason as claim 1.

Allowable Subject Matter
Claims 5, 6, 12, 13, 18 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to THOMAS J LETT whose telephone number is (571)272-7464. The examiner can normally be reached Mon-Fri 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571) 272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/THOMAS J LETT/Primary Examiner, Art Unit 2611

Read full office action

Prosecution Timeline

Jun 05, 2024

Application Filed

Apr 13, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/566,523

Patent 12633014

GENERATING IMAGE METHOD AND APPARATUS, DEVICE, AND MEDIUM

2y 5m to grant Granted May 19, 2026

18/468,301

Patent 12627947

APPARATUSES, COMPUTER-IMPLEMENTED METHODS, AND COMPUTER PROGRAM PRODUCTS FOR IMPROVED DATA TRANSMISSION AND TRACKING

2y 8m to grant Granted May 12, 2026

17/935,077

Patent 12620181

DETERMINING AN ASSIGNMENT OF VIRTUAL OBJECTS TO POSITIONS IN A USER FIELD OF VIEW TO RENDER IN A MIXED REALITY DISPLAY

3y 7m to grant Granted May 05, 2026

18/529,268

Patent 12619774

CONTROLLED EXPOSURE TO LOCATION-BASED VIRTUAL CONTENT

2y 5m to grant Granted May 05, 2026

18/382,917

Patent 12602714

LIGHTING AND INTERNET OF THINGS DESIGN USING AUGMENTED REALITY

2y 5m to grant Granted Apr 14, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

84%

Grant Probability

48%

With Interview (-35.9%)

2y 10m (~10m remaining)

Median Time to Grant

Low

PTA Risk

Based on 725 resolved cases by this examiner. Grant probability derived from career allowance rate.