Last updated: April 19, 2026

Application No. 18/129,778

USER INTERFACE FOR GENERATING AND MANIPULATING MOLECULAR IMAGES WITH NATURAL LANGUAGE INSTRUCTIONS

Final Rejection §101§103§112

Filed

Mar 31, 2023

Examiner

KY, KEVIN

Art Unit

2671

Tech Center

2600 — Communications

Assignee

Microsoft Technology Licensing, LLC

OA Round

2 (Final)

Interview Optional

— +25.3% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 549 resolved cases, 2023–2026

Examiner Intelligence

KY, KEVIN View full profile →

Grants 76% — above average

Career Allow Rate

420 granted / 549 resolved

+14.5% vs TC avg

Strong +25% interview lift

Without

With

+25.3%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

33 currently pending

Career history

582

Total Applications

across all art units

Statute-Specific Performance

§101

17.6%

-22.4% vs TC avg

§103

46.5%

+6.5% vs TC avg

§102

20.8%

-19.2% vs TC avg

§112

9.9%

-30.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 549 resolved cases

Office Action

§101 §103 §112

DETAILED ACTION
Claim Interpretation
Claims 28 and 30-32 have been analyzed under 35 USC § 101. Paragraph(s) 78 of the specifications disclose “Thus, computer-readable storage media excludes media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.”. Therefore, claims 28 and 30-32 are eligible under 35 USC § 101.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 30 and 31 recites the limitation "input molecular image". It is unclear if this is the same input molecular image in claim 28, or a newly introduced input molecular image. There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 28 and 30-32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Rombach et al (NPL: High-Resolution Image Synthesis with Latent Diffusion Models, see IDS) in view of Edwards et al (NPL: Translation between Molecules and Natural Language, see IDS & examiner provided copy), in further view of Saharia et al (NPL: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding).
	Regarding claim 28, Rombach discloses computer-readable storage media comprising instructions that, when executed by a processing unit (pg. 28 we train all our models on a single NVIDIA A100 GPU; wherein a computer-readable storage media would be needed for the GPU to execute instructions), cause a computing device to perform acts comprising:
receiving a user input comprising natural language text describing a molecular characteristic of a molecule and an input molecular image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification (pg. 7 4.3 4.3.1 Transformer Encoders for LDM: We employ the BERT-tokenizer [14] and implement τθ as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) crossattention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visualsynthesis results in a powerful model, which generalizes well to complex, user-defined text prompts; see Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on theLAION [78] database; user-defined text prompts are open-ended and thus can include molecular characteristic of the molecule);
providing the user input to a generative machine learning model trained on pairs of
molecular images and associated text (pg. 7 4.3.1 4.3.1 Transformer Encoders for LDMs: For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]);
receiving from the generative machine learning model an output molecular image representing a molecule that has the physical property, wherein the output molecular image is generated by the generative machine learning model using diffusion conditioned on an encoding of the natural language text describing the molecular characteristic of the molecule (pg. 2 Generative Models for Image Synthesis; pg. 7 4.3.1 4.3.1 Transformer Encoders for LDMs: For textto-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]; Fig. 5 Figure 5. Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION [78] database. Samples generated with 200 DDIM steps and η = 1.0. We use unconditional guidance [32] with s = 10.0; the diffusion model is conditioned on language prompts on LAION-400M, which is a dataset with CLIP-filtered 400 million image-text pairs which would include molecular images); and 
providing the molecular image for displaying (Fig. 5-7 Samples for user-defined text prompts from our model for text-to-image synthesis).
Rombach does not specifically teach that the image is a molecular image, nor that the training data comprises molecular images and associated text describing molecular characteristics.
Edwards teaches generating and using natural language text describing a molecular characteristic of a molecule (pg. 3 2.2 Text-Based de Novo Molecule Generation: we propose generating molecules based on a natural language description of the desired molecule–this is essentially swapping the input and output for the captioning task; see Fig. 5 Examples of molecules generated by different models), and a machine learning model trained on pairs of molecular images and associated text (pg. 3 3.1 Text2Mol Metric: Since the ranking function uses cosine similarity between embeddings, a trained model can be repurposed for evaluating the similarity between the ground truth molecule/description and the generated description/molecule (respectively). To this end, we first train a base multi-layer perceptron (MLP) model from Text2Mol. This model is then used to generate similarities of the candidate molecule-description pairs, which can be compared to the average similarity of the ground truth molecule-description pairs.); and 
providing the molecular image for displaying (Fig. 5 Examples of molecules generated by different models).
Saharia teaches receiving a user input comprising natural language text describing a molecular characteristic of a molecule and an input molecular image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification (pg. 27 Figure A.12: Super-resolution variations for some 64 × 64 generated images. We first generate the 64×64 image using “A photo of ... .”. Given generated 64 × 64 images, we condition both the super-resolution models on different prompts in order to generate different upsampled variations. e.g. for oil painting we condition the super-resolution models on the prompt “An oil painting of ... .”)
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of generating and using natural language text describing molecular characteristics, and a machine learning model trained on pairs of molecular images and associated text from Edwards, and the teaching of receiving a user input comprising natural language text describing a characteristic and an input image, wherein the natural language text comprises an intent edit that describes a physical property without specifying a specific structural modification from Saharia, into the computer-readable storage media as disclosed by Rombach. The motivation for doing this is arises from the known benefit of adapting powerful generative models for specialized applications, as discussed in Rombach, Edwards, and Saharia. Substituting molecular images for molecular structure representations would have been an obvious, predictable adaptation because both representations serve as canonical forms for visualizing molecules in cheminformatics. The combination would predictably result in generating molecular images from molecular text descriptions. 

Claim(s) 30-32 is/are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Rombach, Edwards, and Saharia as applied to claim 28 above, and further in view of Agarwal et al (US 20220051479).
Regarding claim 30, the combination of Rombach, Edwards, and Saharia discloses the computer-readable storage media of claim 28, wherein the user input further comprises an input molecular image (Edwards pg. 6 Fig. 4 having input molecular image), but fails to teach where Agarwal teaches and wherein the output molecular image is identified by the machine learning model by proximity in a latent space to an encoding of the input molecular image and an encoding of the natural language text (¶78 The objective function may aim to minimize the distance between the image representation (e.g., image representations used as training data, such as image representations generated by the system 100 of FIG. 1 or the system 200 of FIG. 2, or image representations received from an external source, such as a cloud-based server or database) and the text representation from a character-level convolutional neural network or a long short-term memory (LSTM) network. Stated another way, the vector encoding for the image classification may be used to guide the text encodings based on similarity to similar images. With latent space additions, a latent vector z may be used to interpolate new instances of image representations).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the output image is identified by the machine learning model by proximity in a latent space to an encoding of the input image and an encoding of the natural language text from Agarwal into the computer-readable storage media for generating a molecular image of a molecule from a natural language input as disclosed by the combination of Rombach, Edwards, and Saharia. The motivation for doing this is to improve automated image design using deep learning techniques.

Regarding claim 31, the combination of Rombach, Edwards, Saharia, and Agarwal discloses the computer-readable storage media of claim 30, wherein the input molecular image is the output molecular image from a previous iteration (Agarwal ¶66 The design improver 208 may be configured to support various controls for the virtual changing room, and the user may select one or more controls to change the virtual changing room, such as approving a selected apparel design, rejecting a selected apparel design, selecting a previously displayed apparel design, selecting a next apparel design, updating apparel designs, etc. In some implementations, the design improver 208 may be configured to perform gesture recognition, speech recognition, or a combination thereof, to identify selected controls; the design improver 208 may be configured to use one or more ML models, such as a convolutional neural network, to perform the gesture recognition, and the design improver 208 may be configured to use one or more other ML models (e.g., one or more ML models used by the speech-to-text converter 202 and/or the natural language processor 204) to perform the speech recognition. Alternatively, the user may enter text at the user device, and the design improver 208 may identify selected controls in text data received from the user).
Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the invention to have implemented the teaching of wherein the input molecular image is the output molecular image from a previous iteration from Agarwal into the computer-readable storage media as disclosed by the combination of Rombach and Edwards. The motivation for doing this is to improve automated image design using deep learning techniques.

Regarding claim 32, the combination of Rombach, Edwards, Saharia, and Agarwal discloses the method of claim 30, wherein the user input further comprises an indication of a mask and the machine learning model interprets the natural language text based on a portion of the input molecular image indicated by the mask (Romach pg. 8 4.5. Inpainting with Latent Diffusion: Inpainting is the task of filling masked regions of an image with new content either because parts of the image are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task; see pg. 9 Fig 11 & pg. 33 Fig. 22; see Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis).

Allowable Subject Matter
Claims 1-8, 21-27 and 33-34 are allowed.
Regarding claim 1, the prior art of record, alone or in combination, fails to teach at least “receiving from the generative machine learning model an output molecular image, wherein the mask limits how the generative machine learning model interprets the natural language text and the generative machine learning model interprets the natural language text based on the portion of the input molecular image indicated by the mask such that the output molecular image is generated by the generative machine learning model using diffusion conditioned on an encoding of the natural language text as interpreted based on the portion indicated by the mask”.
At best, Romach teaches in pg. 8 4.5. Inpainting with Latent Diffusion: Inpainting is the task of filling masked regions of an image with new content either because parts of the image are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task; see pg. 9 Fig 11 & pg. 33 Fig. 22 and in Fig. 5 Samples for user-defined text prompts from our model for text-to-image synthesis.
Regarding claim 21, the prior art of record, alone or in combination, fails to teach at least “providing the user input to a generative machine learning model trained on training data comprising pairs of molecular images and associated text, wherein the training data is generated by: creating training prompts from human-generated text using a generative text model, and pairing a one of the training prompts describing a molecular characteristic of a molecular image with the molecular image”.
At best, Edwards teaches in pg. 3 3.1 Text2Mol Metric: Since the ranking function uses cosine similarity between embeddings, a trained model can be repurposed for evaluating the similarity between the ground truth molecule/description and the generated description/molecule (respectively). To this end, we first train a base multi-layer perceptron (MLP) model from Text2Mol. This model is then used to generate similarities of the candidate molecule-description pairs, which can be compared to the average similarity of the ground truth molecule-description pairs.

Response to Arguments
Applicant’s arguments with respect to claim(s) 28 and 30-32 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments with respect to claims 1-8, 21-27 and 33-34 have been fully considered and are persuasive.  The rejections of claims 1-8, 21-27 and 33-34 under 35 U.S.C. 101 and 35 U.S.C. 103  has been withdrawn.  Claims 1-8, 21-27 and 33-34 are allowable.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN KY whose telephone number is (571)272-7648. The examiner can normally be reached Monday-Friday 9-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KEVIN KY/Primary Examiner, Art Unit 2671

Read full office action

Prosecution Timeline

Mar 31, 2023

Application Filed

Aug 19, 2025

Non-Final Rejection — §101, §103, §112

Aug 28, 2025

Applicant Interview (Telephonic)

Aug 28, 2025

Examiner Interview Summary

Nov 05, 2025

Response Filed

Feb 24, 2026

Examiner Interview (Telephonic)

Feb 25, 2026

Final Rejection — §101, §103, §112

Apr 09, 2026

Examiner Interview Summary

Apr 09, 2026

Applicant Interview (Telephonic)

Precedent Cases

Applications granted by this same examiner with similar technology

17/676,432

Patent 12597158

POSE ESTIMATION

2y 5m to grant Granted Apr 07, 2026

18/814,687

Patent 12597291

IMAGE ANALYSIS FOR PERSONAL INTERACTION

2y 5m to grant Granted Apr 07, 2026

18/222,090

Patent 12586393

KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION

2y 5m to grant Granted Mar 24, 2026

18/570,168

Patent 12586559

METHOD AND APPARATUS FOR GENERATING SPEECH OUTPUTS IN A VEHICLE

2y 5m to grant Granted Mar 24, 2026

19/080,452

Patent 12579382

NATURAL LANGUAGE GENERATION USING KNOWLEDGE GRAPH INCORPORATING TEXTUAL SUMMARIES

2y 5m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

76%

Grant Probability

99%

With Interview (+25.3%)

2y 6m

Median Time to Grant

Moderate

PTA Risk

Based on 549 resolved cases by this examiner. Grant probability derived from career allow rate.