DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Status
Claims 1-30 in the claim set filed February 28th, 2025, were pending for examination in the Application no. 17/389,113 filed July 29, 2021. In the remarks and amendments received on December 15th, 2025, claims 1-7, 9-13, 16, 19-22, 25, and 27-30 are amended. Accordingly, claims 1-30 are currently pending for examination in the application.
Response to Amendment
Applicant’s amendments filed December 15th, 2025, to the Claims have overcome each and every objection and 35 U.S.C. 112 (b) rejection previously set forth in the Non-Final Office Action mailed July 15th, 2025. Examiner warmly thanks Applicant for considering the objections and the suggested amendments to be made to the disclosure.
Response to Arguments
Applicants’ arguments filed December 15th, 2025, have been fully considered but are not persuasive.
The examiner respectfully disagrees that the claims are allowable over the previously cited prior art of Zhang and Vedantam because Zhang and Vedantam does not disclose, teach, and/or suggest the newly amended claim “determin[ing], using the combined probability distribution, and one or more features extracted from the respective encoders, a latent code from the combined probability distribution corresponding to features in a latent space” because “neither reference teaches generating a latent code from a combined probability distribution” (pgs. 12-13 of Applicant’s Remarks). As detailed in the previously set forth Office Action, Vedantam teaches generating a latent code from a combined probability distribution as a product-of-experts (POEs) or product of Gaussians as disclosed in Fig. 2 and subheading “Handling missing attributes” on pg. 4 of Vedantam. The examiner respectfully states that the independent claims are allowable over the previously cited prior art of Zhang and Vedantam because the previously cited prior art does not reasonably disclose, teach, and/or suggest the claimed features of the independent claims structurally and functionally interconnected with other limitations newly amended into the claims in the manner as cited in the independent claims (see examiner’s statement of reasons for allowance below).
Claim Objections
Claims 1, 7, 13, 19, and 25 are objected to because of the following informalities failing to comply with 37 CFR 1.71(a) for "full, clear, concise, and exact terms" (see MPEP § 608.01(m)):
In lines 1-2 of claim 1, 7, 13, 19, and 25, the examiner respectfully suggests amending the phrase “comprising: circuitry to:…” to recite “comprising[[:]] circuitry to:…”.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-30 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 1, 7, 13, 19, and 25, each claim recites the phrase " continuous latent image space". The phrase recites the term “continuous”, which is a term of degree. This term renders the claim(s) indefinite as the term is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. For example, it is unclear to examiner whether the term “continuous” modifying the claim limitation “latent image space” is an “image space” different from an ordinary/regular ”latent image space” as defined in the art and/or Applicant’s Specification. For examination purposes, this phrase in this claim will omit this term to merely read as “[[continuous ]]latent image space. Furthermore, claims 2-6, 8-12, 14-18, 20-24, and 26-30 inherit this indefiniteness in view of their dependency to this/these claim(s).
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.
The following is a quotation of pre-AIA 35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA 35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.
Claims 3-6, 9-12, 15-18, 21-24, and 27-30 are rejected under 35 U.S.C. 112(d) or pre-AIA 35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.
Claims 3-6 do not further limit claim 1 because claims 3-6 recite the same limitations as claim 1. Similarly, claims 9-12, 15-18, 21-24, and 27-30 (which are similar to claims 3-6) do not further limit claims 7, 13, 19, and 25 (which are similar to claim 1), respectively.
Claims 3, 9, 15, 21, and 27 recites "determine the individual probability distributions of image features for the conditional input". Claims 1, 7, 13, 19, and 25 also recite "determine, based on the modality-specific encoded features, an individual probability distribution for each different modality".
Claims 4, 10, 16, 22, and 28 recites "determine the combined probability distribution from the individual probability distributions" using modality-specific "encoded features". Claims 1, 7, 13, 19, and 25 also recite "determine, using each of the distribution in a continuous latent image space of the individual probability distributions, to produce a combined probability distribution".
Claims 5, 11, 17, 23, and 29 recites "select the latent code based, at least in part, upon the combined probability distribution". Claims 1, 7, 13, 19, and 25 also recite "determine, using the combined probability distribution and one or more features extracted from the respective encoders, a latent code from the combined probability distribution corresponding to features in a latent image space".
Claims 6, 12, 18, 24, and 30 recites "generate the image based, at least in part, upon image features corresponding to the selected latent code". Claims 1, 7, 13, 19, and 25 also recite "generate an image using the image generation decoder based, at least in part, upon the latent code corresponding to the combined probability distribution with at least one of the modality-specific encoded features relating to the semantic segmentation map and the edge map".
Therefore, claim 3-6, 9-12, 15-18, 21-24, and 27-30 recite the same limitations as claims 1, 7, 13, 19, and 25, respectively. Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.
Allowable Subject Matter
Claims 1-30 would be allowable if rewritten or amended to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, and U.S.C. 112(d) or 35 U.S.C. 112 (pre-AIA ), 4th paragraph, set forth in this Office action.
The following is an examiner’s statement of reasons for allowance.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Regarding independent claims 1, 7, 13, 19, and 25: Using independent claim 1 as an example, the examiner found neither prior art cited in its entirety, nor based on the prior art, found any motivation to combine any of the said prior art that teaches the following combination in the context of the claim as a whole:
“…receive conditional input for two or more modalities, the two or more modalities comprising at least one of a semantic segmentation map and an edge map;
encode features for the conditional input for two or more modalities using respective encoders for each different modality of the two or more modalities to produce modality-specific encoded features;
determine, based on the modality- specific encoded features, an individual probability distribution for each different modality;
determine, using each of the distribution in a continuous latent image space of the individual probability distributions, to produce a combined probability distribution;
determine, using the combined probability distribution and one or more features extracted from the respective encoders, a latent code from the combined probability distribution corresponding to features in a latent image space;
provide the latent code and at least one of the modality-specific encoded features relating to the semantic segmentation map and the edge map to an image generation decoder;
fuse, in the image generation decoder, the latent code corresponding to the combined probability distribution with at least one of the modality-specific encoded features relating to the semantic segmentation map and the edge map;
generate an image using the image generation decoder based, at least in part, upon the latent code corresponding to the combined probability distribution with at least one of the modality-specific encoded features relating to the semantic segmentation map and the edge map; and
provide the image for presentation.”
Thus, the dependent claims are also allowable in view of their dependency to the independent claims.
As a non-limiting example, a close prior art, Zhang et al. (Zhang; “UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis,” 2021; cited in previous Office Action(s)), discloses in Fig. 1, section 1 on pg. 1, 3rd para. of pg. 2, and section 4.2 on pg. 7:
PNG
media_image1.png
608
1824
media_image1.png
Greyscale
[1 Introduction] “Conditional image synthesis aims to create an image according to the given control signals. With the increasing demand for flexible conditional image synthesis, various kinds of control signals have been introduced into this field, which can be divided into three main modalities: (i) textual controls (TC), including the class labels [1] and natural language descriptions [62, 54]; (ii) visual controls (VC), such as a spatially-aligned sketch map for reference [17, 60] or another image for style transfer [15, 27]; (iii) preservation controls (PC), which require the synthesized image to preserve some given image blocks, e.g., image outpainting and inpainting [63, 67].”
[3rd para. of pg. 2] “Based on the aforementioned observations, we propose UFC-BERT, a novel BERT-based two-stage framework to UniFy any number of multi-modal Controls for conditional image synthesis. Concretely, the textual, visual, and preservation control signals, as well as the generated image, are uniformly represented as a sequence of discrete tokens, as shown in Figure 2. The textual control consists of word tokens for class labels or natural language descriptions. The visual control(s) and the generated image are both represented as discrete tokens due to the first stage, where each token corresponds to a block within the reference image(s) or the generated image. Zero, one, or more reference images are supported. To preserve a given image block within the generated image, we encode the given image block into discrete tokens and fix corresponding parts of the generated sequence to the tokens.”
[4.2 Flexibility of Multi-Modal Controls for Conditional Image Synthesis] “In this section, we qualitatively verify the synthesis ability of UFC-BERT with three modalities of control signals, i.e., textual, visual, and preservation controls. The textual controls are the texts paired with the images, which are already provided by the two datasets, while the visual controls are code sequences of cropped regions, e.g. regions that represent logos or texture of clothes.
In Figure 3, we synthesize images conditioned on combinations of the three types of control signals. The results demonstrate UFC-BERT can unify any number of multi-modal controls to synthesize high-quality images. Further, UFC-BERT supports one or multiple visual controls for more flexible synthesis, as shown in Figure 4 where we generate images given 2_3 visual controls. We observe that UFC-BERT can reasonably fuse multiple visual elements and produce a harmonious image.”
. Although Zhang discloses multimodal conditional inputs of at least three modalities of a segmentation reference (i.e., “preservation control” includes at least a segmentation reference—e.g., “image blocks”) and an edge map (e.g., “visual control” includes at least an edge map—e.g., “sketch map”), Zhang does not disclose and/or reasonably teach the allowable claims as a whole, particularly “two or more modalities comprising at least one of a semantic segmentation map and an edge map” (emphasis added) together with the remaining claim elements. Therefore, the claims are allowable over Zhang.
As another non-limiting example, a close prior art, Vedantam et al. (Vedantam; "Generative Models of Visually Grounded Imagination," 2018; cited in previous Office Action(s)), discloses in Fig. 2 and subheading “Handling missing attributes” on pg. 4:
PNG
media_image2.png
581
1527
media_image2.png
Greyscale
PNG
media_image3.png
857
1534
media_image3.png
Greyscale
. Although Vedantam discloses determining a latent code from a combined probability distribution (i.e., “product-of-experts” or POEs) corresponding to features in a latent image space and one more features extracted from the respective encoders (i.e., individual probabilities or “experts” or Gaussians), Vedantam does not disclose and/or reasonably teach the allowable claims as a whole, particularly “using respective encoders for each different modality of two or more modalities” comprising of “at least a semantic segmentation map and an edge map” together with the remaining claim elements. Therefore, the claims are allowable over Vedantam.
As another non-limiting example, a close prior art, Wu et al. (Wu; “Multimodal Generative Models for Compositional Representation Learning,” 2019; cited in previous Office Action mailed January 29th, 2024, as pertinent art), discloses in the abstract and section 3.1 on pg. 7:
[abstract] “As deep neural networks become more adept at traditional tasks, many of the most exciting new challenges concern multimodality—observations that combine diverse types, such as image and text. In this paper, we introduce a family of multimodal deep generative models derived from variational bounds on the evidence (data marginal likelihood). As part of our derivation we find that many previous multimodal variational autoencoders used objectives that do not correctly bound the joint marginal likelihood across modalities. We further generalize our objective to work with several types of deep generative model (VAE, GAN, and flow-based), and allow use of different model types for different modalities. We benchmark our models across many image, label, and text datasets, and find that our multimodal VAEs excel with and without weak supervision. Additional improvements come from use of GAN image models with VAE language models. Finally, we investigate the effect of language on learned image representations through a variety of downstream tasks, such as compositionally, bounding box prediction, and visual relation prediction. We find evidence that these image representations are more abstract and compositional than equivalent representations learned from only visual data.”
[3.1 Product of Experts] “As presented, the joint variational posterior qφ(z|x, y) could be a separate neural network with independent parameters with respect to qφ(z|x) and qφ(z|y). While this is no doubtedly expressive, it does not scale well to applications with more than two modalities: we would need to define 2K inference networks for K modalities, (x1, ..., xK), which can quickly make learning infeasible. Previous work (Wu and Goodman, 2018; Vedantam et al., 2017) posed an elegant solution to this problem of scalability: define qφ(z|x, y) as a product of the unimodal posteriors. Precisely, they use a product-of-experts (Hinton, 1999), also called PoE, to relate the true joint posterior to the true unimodal posteriors:…”
. However, Wu does not disclose and/or reasonably teach the allowable subject matter stated above in context of the claim as a whole. Therefore, the claims are allowable over Wu.
As another non-limiting example, a close prior art, Lee et al. (Lee; “Private-Shared Disentangled Multimodal VAE for Learning of Hybrid Latent Representations,” 2020), discloses in the abstract and Fig. 1 on pg. 2:
[abstract] “Multi-modal generative models represent an important family of deep models, whose goal is to facilitate representation learning on data with multiple views or modalities. However, current deep multi-modal models focus on the inference of shared representations, while neglecting the important private aspects of data within individual modalities. In this paper, we introduce a disentangled multimodal variational autoencoder (DMVAE) that utilizes disentangled VAE strategy to separate the private and shared latent spaces of multiple modalities. We specifically consider the instance where the latent factor may be of both continuous and discrete nature, leading to the family of general hybrid DMVAE models. We demonstrate the utility of DMVAE on a semi-supervised learning task, where one of the modalities contains partial data labels, both relevant and irrelevant to the other modality. Our experiments on several benchmarks indicate the importance of the private-shared disentanglement as well as the hybrid latent representation.”
PNG
media_image4.png
604
1314
media_image4.png
Greyscale
. Although Lee discloses using respective encoders to encode at least two or more modalities (e.g., “private latent modalities”) and using a generator receiving a latent content from a combined probability distribution based on the conditional inputs (e.g., “POE”), Lee does not disclose and/or reasonably teach the allowable claims as a whole, particularly the two or more modalities comprise of “at least a semantic segmentation map and an edge map” together with the remaining claim elements. Therefore, the claims are allowable over Lee.
As another non-limiting example, a close prior art, Vasco et al. (Vasco; “MHVAE: A Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning,” 2020), discloses in the abstract and Fig. 2 on pg. 3:
[abstract] “Humans are able to create rich representations of their external reality. Their internal representations allow for cross-modality inference, where available perceptions can induce the perceptual experience of missing input modalities. In this paper, we contribute the Multimodal Hierarchical Variational Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning. Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, responsible for cross-modality inference. We formally derive the model’s evidence lower bound and propose a novel methodology to approximate the joint-modality posterior based on modality-specific representation dropout. We evaluate the MHVAE on standard multimodal datasets. Our model performs on par with other stateof-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.”
PNG
media_image5.png
547
1103
media_image5.png
Greyscale
. Although Vasco discloses using respective encoders to encode at least two or more modalities (e.g., “private latent modalities”), Vasco does not disclose and/or reasonably teach the allowable claims as a whole, particularly the two or more modalities comprise of “at least a semantic segmentation map and an edge map” together with the remaining claim elements. Therefore, the claims are allowable over Vasco.
As another non-limiting example, a close prior art, Lysenko et al. (Lysenko; “MVAESynth: a unified framework for multimodal data generation, modality restoration, and controlled generation,” 2021), discloses in the abstract and Fig. 1 on pg. 424:
[abstract] “Synthetic data generation is used nowadays in a number of applications with privacy issues, such as training and testing of systems for analyzing the behavior of social network users or bank customers. Very often, personal data is complex and describes different aspects of a person, some of which may be missing for some records, which makes it very hard to deal with. In this paper, we present MVAESynth, a novel framework for the data-driven generation of multimodal synthetic data. It contains our implementation of a multimodal variational auto-encoder (MVAE), which is capable of generating user multimodal personal profiles (for example, social media profiles data and transactional data) and training even with missing modalities. Extensive experimental studies of MVAESynth performance were conducted demonstrating its effectiveness compared with the available solutions for the following tasks 1) training on data with missing modalities; 2) generating realistic social network profiles; 3) restoring missing profile modalities; 4) generating profiles with the specified characteristics.”
PNG
media_image6.png
418
1149
media_image6.png
Greyscale
. Although Lysenko discloses encoding at least two or more modalities (e.g., “2 modalities”) and determining a combined probability distribution (e.g., “product-of-experts (PoE)”), Lysenko does not disclose and/or reasonably teach the allowable claims as a whole, particularly the two or more modalities comprise of “at least a semantic segmentation map and an edge map” together with the remaining claim elements. Therefore, the claims are allowable over Lysenko.
As another non-limiting example, a close prior art, Gong et al. (Gong; “Variational Selective Autoencoder: Learning from Partially-Observed Heterogenous Data,” 2021), discloses in the abstract and Fig. 1 on pg. 4:
[abstract] “Learning from heterogeneous data poses challenges such as combining data from various sources and of different types. Meanwhile, heterogeneous data are often associated with missingness in real-world applications due to heterogeneity and noise of input sources. In this work, we propose the variational selective autoencoder (VSAE), a general framework to learn representations from partially observed heterogeneous data. VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask which represents how the data are missing. It results in a unified model for various downstream tasks including data generation and imputation. Evaluation on both low-dimensional and high-dimensional heterogeneous datasets for these two tasks shows improvement over state-of-the-art models.”
PNG
media_image7.png
683
981
media_image7.png
Greyscale
. Although Gong discloses using respective encoders (i.e., “Encoder[s]” in Fig. 1) to encode at least two or more modalities (e.g., “attribute[s]” of “heterogenous data”) and using a generator(i.e., “Decoder[s]” in Fig. 1) receiving latent code from the two or more modalities as input, Gong does not disclose and/or reasonably teach the allowable claims as a whole, particularly the two or more modalities comprise of “at least a semantic segmentation map and an edge map” together with the remaining claim elements. Therefore, the claims are allowable over Gong.
Applicant’s claims presented on December 15th, 2025, constitute the basis for the reasons of allowance as the current prior art of record, considered individually or in combination, fails to teach or reasonably suggest the claimed features of the independent claim(s) structurally and functionally interconnected with other limitations in the manner as cited in the independent claim(s) and dependent claims. Examiner notes that the current invention as disclosed in the independent claims is allowed in its entirety. Each and every limitation working together in concert realizes the current claimed invention’s novelty. No single limitation alone accomplishes the allowability of the inventive independent claim(s), rather each and every limitation of the claims and their disclosed relationships are integral.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JULIA Z YAO whose telephone number is (571)272-2870. The examiner can normally be reached Monday - Friday (8:30AM - 5PM).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached on (571)270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/J.Z.Y./Examiner, Art Unit 2666
/EMILY C TERRELL/Supervisory Patent Examiner, Art Unit 2666