Prosecution Insights
Last updated: April 19, 2026
Application No. 18/129,778

USER INTERFACE FOR GENERATING AND MANIPULATING MOLECULAR IMAGES WITH NATURAL LANGUAGE INSTRUCTIONS

Final Rejection §101§103§112
Filed
Mar 31, 2023
Examiner
KY, KEVIN
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Microsoft Technology Licensing, LLC
OA Round
2 (Final)
76%
Grant Probability
Favorable
3-4
OA Rounds
2y 6m
To Grant
99%
With Interview

Examiner Intelligence

Grants 76% — above average
76%
Career Allow Rate
420 granted / 549 resolved
+14.5% vs TC avg
Strong +25% interview lift
Without
With
+25.3%
Interview Lift
resolved cases with interview
Typical timeline
2y 6m
Avg Prosecution
33 currently pending
Career history
582
Total Applications
across all art units

Statute-Specific Performance

§101
17.6%
-22.4% vs TC avg
§103
46.5%
+6.5% vs TC avg
§102
20.8%
-19.2% vs TC avg
§112
9.9%
-30.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 549 resolved cases

Office Action

§101 §103 §112
DETAILED ACTION Claim Interpretation Claims 28 and 30-32 have been analyzed under 35 USC § 101. Paragraph(s) 78 of the specifications disclose “Thus, computer-readable storage media excludes media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.”. Therefore, claims 28 and 30-32 are eligible under 35 USC § 101. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claim 30 and 31 recites the limitation "input molecular image". It is unclear if this is the same input molecular image in claim 28, or a newly introduced input molecular image. There is insufficient antecedent basis for this limitation in the claim. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 28 and 30-32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rombach et al (NPL: High-Resolution Image Synthesis with Latent Diffusion Models, see IDS) in view of Edwards et al (NPL: Translation between Molecules and Natural Language, see IDS & examiner provided copy), in further view of Saharia et al (NPL: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding). Regarding claim 28, Rombach discloses computer-readable storage media comprising instructions that, when executed by a processing unit (pg. 28 we train all our models on a single NVIDIA A100 GPU; wherein a computer-readable storage media would be needed for the GPU to execute instructions), cause a computing device to perform acts comprising: receiving a user input comprising natural language text describing a molecular characteristic of a molecule and an input molecular image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification (pg. 7 4.3 4.3.1 Transformer Encoders for LDM: We employ the BERT-tokenizer [14] and implement τθ as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) crossattention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visualsynthesis results in a powerful model, which generalizes well to complex, user-defined text prompts; see Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on theLAION [78] database; user-defined text prompts are open-ended and thus can include molecular characteristic of the molecule); providing the user input to a generative machine learning model trained on pairs of molecular images and associated text (pg. 7 4.3.1 4.3.1 Transformer Encoders for LDMs: For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]); receiving from the generative machine learning model an output molecular image representing a molecule that has the physical property, wherein the output molecular image is generated by the generative machine learning model using diffusion conditioned on an encoding of the natural language text describing the molecular characteristic of the molecule (pg. 2 Generative Models for Image Synthesis; pg. 7 4.3.1 4.3.1 Transformer Encoders for LDMs: For textto-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]; Fig. 5 Figure 5. Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION [78] database. Samples generated with 200 DDIM steps and η = 1.0. We use unconditional guidance [32] with s = 10.0; the diffusion model is conditioned on language prompts on LAION-400M, which is a dataset with CLIP-filtered 400 million image-text pairs which would include molecular images); and providing the molecular image for displaying (Fig. 5-7 Samples for user-defined text prompts from our model for text-to-image synthesis). Rombach does not specifically teach that the image is a molecular image, nor that the training data comprises molecular images and associated text describing molecular characteristics. Edwards teaches generating and using natural language text describing a molecular characteristic of a molecule (pg. 3 2.2 Text-Based de Novo Molecule Generation: we propose generating molecules based on a natural language description of the desired molecule–this is essentially swapping the input and output for the captioning task; see Fig. 5 Examples of molecules generated by different models), and a machine learning model trained on pairs of molecular images and associated text (pg. 3 3.1 Text2Mol Metric: Since the ranking function uses cosine similarity between embeddings, a trained model can be repurposed for evaluating the similarity between the ground truth molecule/description and the generated description/molecule (respectively). To this end, we first train a base multi-layer perceptron (MLP) model from Text2Mol. This model is then used to generate similarities of the candidate molecule-description pairs, which can be compared to the average similarity of the ground truth molecule-description pairs.); and providing the molecular image for displaying (Fig. 5 Examples of molecules generated by different models). Saharia teaches receiving a user input comprising natural language text describing a molecular characteristic of a molecule and an input molecular image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification (pg. 27 Figure A.12: Super-resolution variations for some 64 × 64 generated images. We first generate the 64×64 image using “A photo of ... .”. Given generated 64 × 64 images, we condition both the super-resolution models on different prompts in order to generate different upsampled variations. e.g. for oil painting we condition the super-resolution models on the prompt “An oil painting of ... .”) Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of generating and using natural language text describing molecular characteristics, and a machine learning model trained on pairs of molecular images and associated text from Edwards, and the teaching of receiving a user input comprising natural language text describing a characteristic and an input image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification from Saharia, into the computer-readable storage media as disclosed by Rombach. The motivation for doing this is arises from the known benefit of adapting powerful generative models for specialized applications, as discussed in Rombach, Edwards, and Saharia. Substituting molecular images for molecular structure representations would have been an obvious, predictable adaptation because both representations serve as canonical forms for visualizing molecules in cheminformatics. The combination would predictably result in generating molecular images from molecular text descriptions. Claim(s) 30-32 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Rombach, Edwards, and Saharia as applied to claim 28 above, and further in view of Agarwal et al (US 20220051479). Regarding claim 30, the combination of Rombach, Edwards, and Saharia discloses the computer-readable storage media of claim 28, wherein the user input further comprises an input molecular image (Edwards pg. 6 Fig. 4 having input molecular image), but fails to teach where Agarwal teaches and wherein the output molecular image is identified by the machine learning model by proximity in a latent space to an encoding of the input molecular image and an encoding of the natural language text (¶78 The objective function may aim to minimize the distance between the image representation (e.g., image representations used as training data, such as image representations generated by the system 100 of FIG. 1 or the system 200 of FIG. 2, or image representations received from an external source, such as a cloud-based server or database) and the text representation from a character-level convolutional neural network or a long short-term memory (LSTM) network. Stated another way, the vector encoding for the image classification may be used to guide the text encodings based on similarity to similar images. With latent space additions, a latent vector z may be used to interpolate new instances of image representations). Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the output image is identified by the machine learning model by proximity in a latent space to an encoding of the input image and an encoding of the natural language text from Agarwal into the computer-readable storage media for generating a molecular image of a molecule from a natural language input as disclosed by the combination of Rombach, Edwards, and Saharia. The motivation for doing this is to improve automated image design using deep learning techniques. Regarding claim 31, the combination of Rombach, Edwards, Saharia, and Agarwal discloses the computer-readable storage media of claim 30, wherein the input molecular image is the output molecular image from a previous iteration (Agarwal ¶66 The design improver 208 may be configured to support various controls for the virtual changing room, and the user may select one or more controls to change the virtual changing room, such as approving a selected apparel design, rejecting a selected apparel design, selecting a previously displayed apparel design, selecting a next apparel design, updating apparel designs, etc. In some implementations, the design improver 208 may be configured to perform gesture recognition, speech recognition, or a combination thereof, to identify selected controls; the design improver 208 may be configured to use one or more ML models, such as a convolutional neural network, to perform the gesture recognition, and the design improver 208 may be configured to use one or more other ML models (e.g., one or more ML models used by the speech-to-text converter 202 and/or the natural language processor 204) to perform the speech recognition. Alternatively, the user may enter text at the user device, and the design improver 208 may identify selected controls in text data received from the user). Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the input molecular image is the output molecular image from a previous iteration from Agarwal into the computer-readable storage media as disclosed by the combination of Rombach and Edwards. The motivation for doing this is to improve automated image design using deep learning techniques. Regarding claim 32, the combination of Rombach, Edwards, Saharia, and Agarwal discloses the method of claim 30, wherein the user input further comprises an indication of a mask and the machine learning model interprets the natural language text based on a portion of the input molecular image indicated by the mask (Romach pg. 8 4.5. Inpainting with Latent Diffusion: Inpainting is the task of filling masked regions of an image with new content either because parts of the image are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task; see pg. 9 Fig 11 & pg. 33 Fig. 22; see Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis). Allowable Subject Matter Claims 1-8, 21-27 and 33-34 are allowed. Regarding claim 1, the prior art of record, alone or in combination, fails to teach at least “receiving from the generative machine learning model an output molecular image, wherein the mask limits how the generative machine learning model interprets the natural language text and the generative machine learning model interprets the natural language text based on the portion of the input molecular image indicated by the mask such that the output molecular image is generated by the generative machine learning model using diffusion conditioned on an encoding of the natural language text as interpreted based on the portion indicated by the mask”. At best, Romach teaches in pg. 8 4.5. Inpainting with Latent Diffusion: Inpainting is the task of filling masked regions of an image with new content either because parts of the image are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task; see pg. 9 Fig 11 & pg. 33 Fig. 22 and in Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis. Regarding claim 21, the prior art of record, alone or in combination, fails to teach at least “providing the user input to a generative machine learning model trained on training data comprising pairs of molecular images and associated text, wherein the training data is generated by: creating training prompts from human-generated text using a generative text model, and pairing a one of the training prompts describing a molecular characteristic of a molecular image with the molecular image”. At best, Edwards teaches in pg. 3 3.1 Text2Mol Metric: Since the ranking function uses cosine similarity between embeddings, a trained model can be repurposed for evaluating the similarity between the ground truth molecule/description and the generated description/molecule (respectively). To this end, we first train a base multi-layer perceptron (MLP) model from Text2Mol. This model is then used to generate similarities of the candidate molecule-description pairs, which can be compared to the average similarity of the ground truth molecule-description pairs. Response to Arguments Applicant’s arguments with respect to claim(s) 28 and 30-32 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Applicant’s arguments with respect to claims 1-8, 21-27 and 33-34 have been fully considered and are persuasive. The rejections of claims 1-8, 21-27 and 33-34 under 35 U.S.C. 101 and 35 U.S.C. 103 has been withdrawn. Claims 1-8, 21-27 and 33-34 are allowable. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN KY whose telephone number is (571)272-7648. The examiner can normally be reached Monday-Friday 9-5PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /KEVIN KY/Primary Examiner, Art Unit 2671
Read full office action

Prosecution Timeline

Mar 31, 2023
Application Filed
Aug 19, 2025
Non-Final Rejection — §101, §103, §112
Aug 28, 2025
Applicant Interview (Telephonic)
Aug 28, 2025
Examiner Interview Summary
Nov 05, 2025
Response Filed
Feb 24, 2026
Examiner Interview (Telephonic)
Feb 25, 2026
Final Rejection — §101, §103, §112
Apr 09, 2026
Examiner Interview Summary
Apr 09, 2026
Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12597158
POSE ESTIMATION
2y 5m to grant Granted Apr 07, 2026
Patent 12597291
IMAGE ANALYSIS FOR PERSONAL INTERACTION
2y 5m to grant Granted Apr 07, 2026
Patent 12586393
KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION
2y 5m to grant Granted Mar 24, 2026
Patent 12586559
METHOD AND APPARATUS FOR GENERATING SPEECH OUTPUTS IN A VEHICLE
2y 5m to grant Granted Mar 24, 2026
Patent 12579382
NATURAL LANGUAGE GENERATION USING KNOWLEDGE GRAPH INCORPORATING TEXTUAL SUMMARIES
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
76%
Grant Probability
99%
With Interview (+25.3%)
2y 6m
Median Time to Grant
Moderate
PTA Risk
Based on 549 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month