Last updated: May 29, 2026
Application No. 19/057,629
SYSTEMS AND METHODS FOR PERFORMING MEDICAL TASKS USING A MEDICAL ARTIFICIAL INTELLIGENCE SYSTEM

Non-Final OA §101§102§103
Filed
Feb 19, 2025
Priority
Feb 20, 2024 — provisional 63/555,589 +1 more
Examiner
ILAGAN, VINCENT CAESAR
Art Unit
3686
Tech Center
3600 — Transportation & Electronic Commerce
Assignee
President and Fellows of Harvard College
OA Round
1 (Non-Final)
This examiner grants 42% of cases after interview

— +63.6% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 12 resolved cases, 2023–2026
Examiner Intelligence

ILAGAN, VINCENT CAESAR View full profile →
Grants 42% of resolved cases
Career Allowance Rate
5 granted / 12 resolved
-10.3% vs TC avg
Strong +64% interview lift
Without
With
+63.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
16 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
90.5%
+50.5% vs TC avg
§102
7.1%
-32.9% vs TC avg
§112
1.2%
-38.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 12 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
The office action is in response to the claims filed on February 19, 2025, for the application filed on February 19, 2025, which claims priority to Provisional Application Nos. 63/555,589  filed on February 20, 2024 and 63/647,326 filed on May 14, 2024.  Claims 1 – 20 are currently pending and have been examined as discussed below.

Claim Objections
Claim 12 is objected to because of the following informalities: the limitation of “the  text input” in line 2 of claim 12 should be changed to “a text input”; and the limitation of “the  multi-modal output ” in line 4 of claim 12 should be changed to “a multi-modal output”.  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1 – 20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
	Examiners should determine whether a claim satisfies the criteria for subject matter eligibility by evaluating the claim in accordance with the flowchart in MPEP 2016(III).

Eligibility Step 1:
Under Step 1 of the 2019 Revised Patent Subject Matter Eligibility Guidance, it must be determined whether each claim as a whole falls within one of the statutory categories of invention (i.e., a process, machine, manufacture, or composition of matter). See MPEP 2106.03. In the instant application, claims 1 – 18 are directed to a method (i.e., a process); claim 19 is directed to a method (i.e., a process); and claim 20 is directed to a system (i.e., a machine).
	While each one of claims 1 – 20 appears to fall within one or more statutory categories of invention, the Office has determined that the full eligibility analysis is required because there is doubt as to whether the applicant is effectively seeking coverage for a judicial exception itself. The eligibility of each claim is not self-evident at least because each claim as a whole did not appear to clearly improve a technology or computer functionality. To the contrary, each claim as a whole appeared to merely apply one or more judicial exceptions on a computer.
	Accordingly, it has been determined that each one of claims 1 – 20 as a whole falls within one or more statutory categories under Step 1, and the Office proceeds with the full eligibility analysis (the Alice/Mayo test described in MPEP 2106(III)) as discussed below.

Eligibility Step 2A, Prong One:
Under Step 2A, Prong One of the 2019 Revised Patent Subject Matter Eligibility Guidance, it must be determined whether each claim is directed to one or more of the judicial exceptions (i.e., an abstract idea, law of nature, or natural phenomenon). See MPEP 2106.04(II)(A)(1). After evaluation, it has been determined that claims 1 – 20 are directed to judicial exceptions because claims 1 – 20 recite an abstract idea. (The Office will not determine that a claim is not directed to a judicial exception under Step 2A, Prong One for the mere reason that claim further recites one or more additional elements beyond the judicial exception.)

Independent claims 1 and 20 are determined to be directed to a judicial exception including abstract ideas (i.e., certain methods of organizing human activities). Representative claim 20 recites the mental process identified in bold as:
A  system  comprising:
at least one computer hardware processor; and
at least one non-transitory computer-readable storage medium having encoded thereon instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a trained large language model (LLM) and a plurality of task-specific software tools, the method comprising: 
using the at least one computer hardware processor to perform:
receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input; 
processing at least a portion of the multi-modal input using the trained LLM to obtain LLM output, the LLM output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools; 
when the LLM output indicates that zero tasks are to be additionally performed, outputting at least some of the LLM output as a response to the request; and
when the LLM output indicates that one or multiple tasks are to be additionally performed,
processing at least some of the multi-modal input, using the at least one of the plurality of task-specific software tools and the LLM output, to obtain at least one task-specific output; 
outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request. 

Independent claim 19 recites the mental process identified in bold as:
A method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools, the method comprising:
executing the multi-modal input coordinator module, using at least one computer hardware processor, to perform: 
receiving multi-modal input comprising image input  and text input indicating a request that the MAI system perform at least one medical task on the multi-modal input; 
processing the multi-modal input to obtain  a tokenized representation of the image input and the text input;  and
processing the tokenized representation using the trained LLM to obtain LLM output  at least partially responsive to the request, the LLM output  comprising latent embeddings and textual output, the LLM output  indicating zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools; 
when the LLM output indicates that zero tasks are to be additionally performed by the at least one of the plurality of task-specific software tools, outputting the textual output  as a response to the request of the MAI system; and
when the LLM output indicates that one or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools,
identifying, based on the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool; 
generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool and processing the first input with the first task-specific software tool to obtain a first task-specific output; 
generating an integrated response to the request of the MAI system using the textual output produced by the trained LLM and the first task-specific output generated by the first task-specific software tool; and
outputting the integrated response as a response to the request of the MAI system.

	Claim 20 recites the combination of limitations identified as “perform a method for performing medical tasks,” “processing at least a portion of the multi-modal input … to obtain … output, the … output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools,” “when the … output indicates that zero tasks are to be additionally performed, outputting at least some of the … output as a response to the request,” “when the … output indicates that one or multiple tasks are to be additionally performed,” and “processing at least some of the multi-modal input, using … the … output, to obtain at least one task-specific output.” Claim 19 recites the combination of limitations identified as “a method for performing medical tasks,” “a request that …  perform at least one medical task on the multi-modal input,” “processing the multi-modal input to obtain a tokenized representation of the image input and the text input,” “processing the tokenized representation …  to obtain … output  at least partially responsive to the request, the … output  comprising latent embeddings and textual output, the … output  indicating zero, one, or multiple tasks are to be additionally performed,” “when the … output indicates that zero tasks are to be additionally performed …, outputting the textual output  as a response to the request,” “when the … output indicates that one or multiple tasks are to be additionally performed,” “identifying, based on the … output …, a first task-specific software tool,” “generating, from the latent embeddings and the multi-modal input, first input … and processing the first input …  to obtain a first task-specific output,” and “generating an integrated response to the request of the MAI system using the textual output produced by the trained LLM and the first task-specific output generated.” A broadest reasonable interpretation of each combination amounts to determining a response to a request for a medical task (i.e., determining a multi-modal output such as a text output and an image output, including one or more of a medical report, a segmented image, an indication of a classified medical condition, a comparison of longitudinal study images, an indication of a detected anatomical structure and/or an indication of a detected abnormality). This activity may be practically performed in the human mind using observation, evaluation, judgment, and opinion, and thus represents an abstract idea falling in the “mental process” grouping.. With the exception of generic computer-implemented steps, there is nothing in claim 1 itself that forecloses them from being performed by a human, mentally or with tools such as pen and paper. Thus, this activity is an abstract idea in the "mental process" grouping.
	Accordingly, claims 1 and 19 – 20 recite judicial exceptions under Step 2A, Prong One.

Dependent claims 2 – 18 are directed to one or more judicial exceptions (i.e., abstract idea exceptions) under Step 2A, Prong One of the full eligibility analysis as follows:
	Regarding claims 2 – 18, each combination of limitations identified in bold as “the image input comprises one or more medical images comprising one or more radiographs, dermoscopy images, computed tomography scans, pathology images, ultrasound images, endoscopy images, and/or magnetic resonance imaging (MRI) images” in claim 2, “the image input comprises one or more two-dimensional medical images,  one or more three-dimensional images, and/or one or more videos” in claim 3, “the image input comprises a single image” in claim 4, “the image input comprises multiple images” in claim 5, “the multiple images comprise multiple views of at least a portion of patient” in claim 6, “the multiple images comprise a same view of at least a portion of a patient at multiple points in time” in claim 7, “the response comprises a multi-modal output” in claim 8, “the multi-modal output comprises a text output and an image output” in claim 9, “the LLM output comprises the text output  and the at least one task-specific output comprises the image output” in claim 10, “the at least one medical task comprises one or more vision-language tasks comprising one or more of medical report generation, longitudinal study comparison, region-of-interest captioning, open-ended visual question answering, and/or abnormality (e.g., skin lesion) classification, and/or one or more vision-centric tasks comprising one or more medical image analysis tasks including one or more of anatomical structure identification, abnormality characterization, chest abnormality detection, lesion segmentation, and/or organ segmentation” in claim 11, “processing the multi-modal input to obtain a tokenized representation of the image input and the  text input  at least in part by processing the text input to generate text tokens and processing the image input to generate visual tokens, and wherein the at least a portion of the  multi-modal output  comprises the tokenized representation of the image input and the text input” in claim 12 (i.e., a manual tokenized representation of an image by a human is the translation of visual information into structured, discrete semantic concepts or labels), “processing the multi-modal input to obtain the tokenized representation of the image input and the text input further comprises adapting the visual tokens into text tokens using a model trained to transform visual tokens into text tokens” in claim 13, “when the image input comprises a two-dimensional (2D) image, the processing the image input comprises processing the 2D image using a 2D vision encoder” and “when the image input comprises a three-dimensional (3D) image, the processing the image input comprises processing the 3D image using a 3D vision encoder” in claim 14, “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools” in claim 15, “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools comprises determining whether the LLM output comprises at least one tag associated with at least one respective task” in claim 16, “identifying the at least one of the plurality of task-specific software tools using the at least one tag, wherein the at least one tag comprises a first tag associated with a first task, and the at least one of the plurality of task-specific software tools is trained to perform the first task” in claim 17, “the image input comprises a two-dimensional image,” “the LLM output indicates that a detection task is to be additionally performed by the at least one of the plurality of task-specific software tools,” “the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task on the two-dimensional image,” “the at least one task-specific output comprises a second textual output, wherein obtaining the at least one task-specific output comprises generating the second textual output,” and “the response comprises the LLM output and the second textual output, wherein outputting the response comprises outputting the LLM output and the second textual output” in claim 18.  A broadest reasonable interpretation of each combination further defines the activity of determining a response to a request for a medical task (i.e., determining a multi-modal output such as a text output and an image output, including one or more of a medical report, a segmented image, an indication of a classified medical condition, a comparison of longitudinal study images, an indication of a detected anatomical structure and/or an indication of a detected abnormality). This activity may be practically performed in the human mind using observation, evaluation, judgment, and opinion, and thus represents an abstract idea falling in the “mental process” grouping.. With the exception of generic computer-implemented steps, there is nothing in claim 1 itself that forecloses them from being performed by a human, mentally or with tools such as pen and paper. Thus, this activity is an abstract idea in the "mental process" grouping.
	Accordingly, claims 2 – 18 recite judicial exceptions under Step 2A, Prong One.

Eligibility Step 2A, Prong Two:
Claims 1 and 20 recite additional limitations beyond the judicial exceptions. Representative claim 20 recites the additional limitations identified in bold as:
A system comprising:
at least one computer hardware processor; and
at least one non-transitory computer-readable storage medium having encoded thereon instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a trained large language model (LLM) and a plurality of task-specific software tools, the method comprising: 
using the at least one computer hardware processor to perform:
receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input; 
processing at least a portion of the multi-modal input using the trained LLM to obtain LLM output, the LLM output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools; 
when the LLM output indicates that zero tasks are to be additionally performed, outputting at least some of the LLM output as a response to the request; and
when the LLM output indicates that one or multiple tasks are to be additionally performed,
processing at least some of the multi-modal input, using the at least one of the plurality of task-specific software tools and the LLM output, to obtain at least one task-specific output; 
outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request. 

Claim 19 recites the mental process identified in bold as:
A method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools, the method comprising:
executing the multi-modal input coordinator module, using at least one computer hardware processor, to perform: 
receiving multi-modal input comprising image input  and text input indicating a request that the MAI system perform at least one medical task on the multi-modal input; 
processing the multi-modal input to obtain  a tokenized representation of the image input and the text input;  and
processing the tokenized representation using the trained LLM to obtain LLM output  at least partially responsive to the request, the LLM output  comprising latent embeddings and textual output, the LLM output  indicating zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools; 
when the LLM output indicates that zero tasks are to be additionally performed by the at least one of the plurality of task-specific software tools, outputting the textual output  as a response to the request of the MAI system; and
when the LLM output indicates that one or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools,
identifying, based on the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool; 
generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool and processing the first input with the first task-specific software tool to obtain a first task-specific output; 
generating an integrated response to the request of the MAI system using the textual output produced by the trained LLM and the first task-specific output generated by the first task-specific software tool; and
outputting the integrated response as a response to the request of the MAI system.

	Claim 20 recites the additional limitations identified in bold as “a system,” “at least one computer hardware processor,” “at least one non-transitory computer-readable storage medium having encoded thereon instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor … using a medical artificial intelligence (MAI) system, the MAI system comprising a trained large language model (LLM) and a plurality of task-specific software tools,” “using the at least one computer hardware processor to perform,” “receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input,” “the trained LLM,” “the LLM output,” “using the at least one of the plurality of task-specific software tools and the LLM output,” and “outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request.”  Claim 19 recites the additional limitations identified in bold as “a medical artificial intelligence (MAI) system, the MAI system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools,” “executing the multi-modal input coordinator module, using at least one computer hardware processor, to perform,” “receiving multi-modal input comprising image input  and text input indicating … the MAI system,” “using the trained LLM to obtain LLM output,” “the LLM output,” “at least one of the plurality of task-specific software tools,” “the MAI system,” “the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool,” “the trained LLM,” and “outputting the integrated response as a response to the request of the MAI system.”  Each of these elements is an additional limitation beyond the judicial exception, i.e., the mental process of determining a response to a request for a medical task. At best, looking at the combination of all additional elements and the judicial exception, claim 20 as a whole amounts to a general purpose computer (i.e., having a computer hardware processor, a non-transitory computer-readable storage medium (CRM), a medical artificial intelligence (MAI) system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools) added post-hoc to the abstract idea of determining a response to a request for a medical task. 
	MPEP 2106.05(a) states: “In determining patent eligibility, examiners should consider whether the claim ‘purport(s) to improve the functioning of the computer itself’ or ‘any other technology or technical field.’… [A]n improvement in the abstract idea itself is not an improvement in technology.” Furthermore, MPEP 2106.05(a)(II) states: “Merely adding generic computer components to perform the method is not sufficient.” In the instant application, each one of claims 1 and 19 – 20- as a whole does not improve the functioning of computer components (i.e., the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools); nor does each claim as a whole improve any other technology or technical field. The computer components are general purpose computer components added post-hoc to the abstract idea of determining a response to a request for a medical task, i.e., determining a multi-modal output such as a text output and an image output. The claim as a whole improves exclusively upon the abstract idea itself by using conventional and generic computer technology in the nascent but well known environment of artificial intelligence for merely automating the manual process of determining a response to a request for a medical task. See MPEP 2106.05(a).  The claim as a whole represents mere instructions to apply the abstract idea to conventional and generic computer technology recited at a high level of generality. See MPEP 2106.05(f).  Regarding the consideration under MPEP 2106.05(g), the limitations of “receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input,” “receiving multi-modal input comprising image input  and text input indicating a request that the MAI system perform at least one medical task on the multi-modal input,” “outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request,” and “outputting the integrated response as a response to the request of the MAI system.” are determined to not add no more than insignificant extra-solution activities to the judicial exception. These limitations of “receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input” and “receiving multi-modal input comprising image input  and text input indicating a request that the MAI system perform at least one medical task on the multi-modal input” represent the well-known pre-solution activity of necessary data gathering because the claim as a whole represents an activity incidental to the primary process of each claim as a whole (i.e., determining a response to a request for a medical task) and thus those limitations are merely nominal or tangential additions to the claim. These limitations of “outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request” and “outputting the integrated response as a response to the request of the MAI system” represent the well-known post-solution activity of data outputting because the claim as a whole represents an activity incidental to the primary process of each claim as a whole and thus those limitations are merely nominal or tangential additions to the claim. Regarding the consideration under MPEP 2106.05(h), the additional limitations, individually or in combination, also amount to merely indicating a field of use or technological environment in which to apply the judicial exception, i.e., artificial intelligence. In the instant application, the additional limitations (i.e., the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools) do no more than link the abstract idea (i.e., determining a response to a request for a medical task) to the particular technological environment of artificial intelligence. Thus, the additional limitations fail to add an inventive concept to the claims.
	Accordingly, in view of these considerations, the Office has determined that each one of claims 1 and 19 – 20 as a whole does not integrate the abstract idea exception into a practical application under Step 2A, Prong Two, and thus each claim as a whole is directed to a judicial exception under Step 2A.

Dependent claims 2 – 18 present additional information in tandem with further details regarding elements and the abstract idea from an associated one of independent claims 1 and 19 – 20 and are therefore directed to an abstract idea for similar reasons as given Under Step 2A, Prong One above. Claims 2 – 18 do not recite any additional limitations beyond the abstract idea of determining a response to a request for a medical task. Claims 13 – 14 recite additional limitations beyond the judicial exception.
	Regarding claims 13 – 14, each combination of limitations identified in bold as “processing the multi-modal input to obtain the tokenized representation of the image input and the text input further comprises adapting the visual tokens into text tokens using a model trained to transform visual tokens into text tokens” in claim 13 and “when the image input comprises a two-dimensional (2D) image, the processing the image input comprises processing the 2D image using a 2D vision encoder” and “when the image input comprises a three-dimensional (3D) image, the processing the image input comprises processing the 3D image using a 3D vision encoder” in claim 14.  Each of these elements is an additional limitation beyond the judicial exception (i.e., the mental process of determining a response to a request for a medical task). At best, looking at the combination of all additional elements and the judicial exception, each claim as a whole amounts to using a general purpose computer (i.e., using a model trained to transform visual tokens into text tokens, using a 2D vision encoder, and using a 3D vision encoder) added post-hoc to the abstract idea of determining a response to a request for a medical task.  See MPEP 2106.05(a).  
	Accordingly, the Office has determined that each one of claims 2 – 18 as a whole does not integrate the abstract idea exception into a practical application under Step 2A, Prong Two, and thus each claim as a whole is directed to a judicial exception under Step 2A.

Eligibility Step 2B:
Regarding independent claims 1 and 19 –  20, the Office carries over its identification of the additional elements (and combinations thereof) from Step 2A, Prong Two so as to apply the same additional elements in Step 2B. See MPEP 2106.05(II). The Office further carries over its conclusions from the considerations discussed in MPEP 2106.05(a) through (c), (e) through (h) in Step 2A, Prong Two so as to apply the same considerations in Step 2B. 
	Under Step 2B of the 2019 Revised Patent Subject Matter Eligibility Guidance, it must be determined whether the claim provides an inventive concept by determining if the claims include additional elements or a combination of elements that are sufficient to amount to significantly more than the judicial exception. After evaluation, there is no indication that an additional element or combination of elements are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, each claim as a whole does not provide an improvement to technology or technical field under MPEP 2106.05(a). Each claim as a whole improves exclusively upon the abstract idea itself by using conventional and generic computer technology in the nascent but well known environment of artificial intelligence for merely automating the manual process of determining a response to a request for a medical task (i.e., determining a multi-modal output such as a text output and an image output). The additional limitations amount to mere instructions to apply an abstract idea under MPEP 2106.05(f) and/or necessary data gathering and/or outputting under MPEP 2106.05(g). Each claim as a whole recites the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools at a high level of generality, with their functions claimed in a merely generic manner such that each claim as a whole represents the well‐understood, routine, and conventional functions of a computer system (i.e., having a processor, a non-transitory CRM, an artificial intelligence system including a plurality of modules, a trained LLM, and task-specific software tools) for determining a response to a request for a medical task, i.e., determining a multi-modal output such as a text output and an image output. Evidence that the computer-implemented method of generating one or more prompts to be processed by a content machine-learning model for generating the customized content is a well‐understood, routine, and conventional function is provided by UzZaman (U.S. Pub. No. 2024/0249081 A1). 
	Furthermore, looking at the limitations individually or as any ordered combination adds nothing that is not already present when looking at each claim as a whole. There is no indication that the individual elements or combinations of elements amount to an inventive concept.
	Therefore, claims 1 and 19 – 20 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.

Regarding claims 2 – 18, the Office carries over its determination from Step 2A, Prong Two that claims 2 – 18 do not further recite additional limitations beyond the judicial exception (i.e., the abstract idea of determining a response to a request for a medical task, i.e., determining a multi-modal output such as a text output and an image output) so as to apply the same determination in Step 2B. See MPEP 2106.05(II). The Office further carries over its conclusions from the considerations discussed in MPEP 2106.05(a) through (c), (e) through (h) in Step 2A, Prong Two so as to apply the same considerations in Step 2B. The dependent claims merely present additional abstract information in tandem with further details regarding the elements from the independent claims and are, therefore, directed to an abstract idea for similar reasons as given above. Claims 2 – 18 do not recite any additional limitations beyond the abstract idea of determining a subject’s medical status and the accuracy or quality of the medical status. 
	Each claim as a whole does not provide an improvement to technology or technical field under MPEP 2106.05(a), but rather only improves the abstract idea itself. Each claim as a whole amounts to mere instructions to apply the abstract idea to generic computer components (i.e., the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools) under MPEP 2106.05(f). The limitations of “the image input comprises one or more medical images comprising one or more radiographs, dermoscopy images, computed tomography scans, pathology images, ultrasound images, endoscopy images, and/or magnetic resonance imaging (MRI) images” in claim 2, “the image input comprises one or more two-dimensional medical images,  one or more three-dimensional images, and/or one or more videos” in claim 3, “the image input comprises a single image” in claim 4, “the image input comprises multiple images” in claim 5, “the multiple images comprise multiple views of at least a portion of patient” in claim 6, and “the multiple images comprise a same view of at least a portion of a patient at multiple points in time” in claim 7 are determined to not add no more than insignificant extra-solution activities to the judicial exception. These limitations represent the well-known pre-solution activity of data necessary data gathering because the claim as a whole represents an activity incidental to the primary process of the claim as a whole (i.e., determining a response to a request for a medical task) and thus those limitations are merely nominal or tangential additions to the claim. Each claim as a whole recites general computer components (e.g., the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools, the model trained to transform visual tokens into text tokens, the 2D vision encoder, and the 3D vision encoder) at a high level of generality, with their functions claimed in a merely generic manner such that each claim as a whole represents the well‐understood, routine, and conventional functions of a computer system (i.e., having the processor, the non-transitory CRM, the MAI system, the multi-modal input coordinator module, the orchestrator module comprising the trained LLM, and the task-specific software tools, the model trained to transform visual tokens into text tokens, the 2D vision encoder, and the 3D vision encoder) for determining a response to a request for a medical task, i.e., determining a multi-modal output such as a text output and an image output. Evidence that the computer-implemented method of generating one or more prompts to be processed by a content machine-learning model for generating the customized content is a well‐understood, routine, and conventional function is provided by UzZaman (U.S. Pub. No. 2024/0249081 A1).
	Therefore, claims 2 – 18 are rejected under 35 U.S.C. 101 as being directed to non-statutory subject matter.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1 – 8, 15 – 16, and 20 are rejected under 35 U.S.C. 102 as being anticipated by Molenda (U.S. Pub. No. 2023/0368878 A1).
Regarding independent claims 1 and 20, Molenda teaches the limitations of representative claim 20 identified in bold as: 
A system (Abstract and Paragraph [0128] of Molenda. In the instant application, the broadest reasonable interpretation of “a system” reads on the system in Molenda (Abstract and Paragraph [0128]) applying multidimensional language and vision models and maps to categorize, label and track health data, such as morphology (i.e., identifying abnormal structural features via imaging or microscopy) and symptoms and treatment recommendations.) comprising:

at least one computer hardware processor (Paragraphs [0128] – [0129] of Molenda. In the instant application, the broadest reasonable interpretation of “at least one computer hardware processor” reads on the processor in Molenda (Paragraphs [0128] – [0129]).):

at least one non-transitory computer-readable storage medium having encoded thereon instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a trained large language model (LLM) and a plurality of task-specific software tools (Paragraphs [0130], [0427], and [0434] of Molenda. In the instant application, the broadest reasonable interpretation of “least one non-transitory computer-readable storage medium having encoded thereon instructions that, when executed by the at least one computer hardware processor, cause the at least one computer hardware processor to perform a method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a trained large language model (LLM) and a plurality of task-specific software tools” reads on tangible, physical, non-transitory computer readable medium in Molenda (Paragraphs [0130], [0427], and [0434]) with processor-executable instructions stored thereon, with the medium enabling models (e.g., large language models), engines (i.e., a plurality of task-specific software tools), and artificial intelligence.), the method comprising: 

using the at least one computer hardware processor to perform (Paragraphs [0130] – [0131] of Molenda. In the instant application, the broadest reasonable interpretation of “using the at least one computer hardware processor to perform” reads on the processor in Molenda (Paragraphs [0130] – [0131]) executing the processor-executable instructions stored on the medium.):

receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input (Paragraphs [0133] and [0138] of Molenda. In the instant application, the broadest reasonable interpretation of “receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input” reads on the activities in Molenda (Paragraphs [0133] and [0138]) of receiving text, images, visualizations, videos, and/or selected procedure on the GUI.)

processing at least a portion of the multi-modal input using the trained LLM to obtain LLM output, the LLM output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools (Paragraphs [0133], [0138], and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “processing at least a portion of the multi-modal input using the trained LLM to obtain LLM output, the LLM output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools” reads on the activity in Molenda (Paragraphs [0133], [0138],  [0303], and [0345] – [0349]) using a digital shadow chart (e.g., enabled by the input device) and physical markup on a paper shadow chart at least partially responsive to the selected procedure, with the medium enabling models (e.g., large language models) for automation of healthcare record generation, with the shadow chart including historical information (e.g., diagnoses, treatments, follow-ups needed, e.g., enabled by the generation module).).

when the LLM output indicates that zero tasks are to be additionally performed, outputting at least some of the LLM output as a response to the request (Paragraphs [0133], [0138], [0272], and [0348] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “when the LLM output indicates that zero tasks are to be additionally performed, outputting at least some of the LLM output as a response to the request” reads on the activity in Molenda (Paragraphs [0133], [0138], [0272], and [0348] – [0349]) of outputting digital shadow chart information, when the shadow chart has features from past historical visits (but no deferred diagnosis and no treatments or issues needing follow-up), as a response to a selected procedure.); and

when the LLM output indicates that one or multiple tasks are to be additionally performed processing at least some of the multi-modal input, using the at least one of the plurality of task-specific software tools and the LLM output, to obtain at least one task-specific output (Paragraphs [0133], [0347] – [0349],  and [0358] of Molenda. In the instant application, the broadest reasonable interpretation of “when the LLM output indicates that one or multiple tasks are to be additionally performed processing at least some of the multi-modal input, using the at least one of the plurality of task-specific software tools and the LLM output, to obtain at least one task-specific output” reads on the activities in Molenda (Paragraphs [0133], [0347] – [0349], and [0358]) of placing, when the shadow chart identifies issues needing follow-up (e.g., diagnoses, treatments, follow-ups needed), anatomy data and non-anatomy data (i.e., at least one task-specific output) into appropriate places into a digital shadow chart (i.e., the LLM output), with the digital shadow chart being directly printed into a paper shadow chart or modified in electronic form with different selections, filters, and interaction modifiers (i.e., at least one of the plurality of task-specific software tools).); 

outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request (Paragraphs [0133] and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “outputting a response generated using the at least some of the LLM output and the at least one task-specific output as a response to the request” reads on the activity in Molenda (Paragraphs [0138], [0303], and [0345] – [0349]) generating a healthcare record using a digital shadow chart and physical markup on a paper shadow chart after the fact with a photographic or scanned capture of the marked-up paper shadow chart, as a response to user input (e.g., a selected procedure).); 

Regarding claim 2, Molenda as applied to claim 1 teaches the limitation identified in bold as “the image input comprises one or more medical images comprising one or more radiographs, dermoscopy images, computed tomography scans, pathology images, ultrasound images, endoscopy images, and/or magnetic resonance imaging (MRI) images” (Paragraphs [0162] and [0169] of Molenda.  In the instant application, the broadest reasonable interpretation of “the image input comprises one or more medical images comprising one or more radiographs, dermoscopy images, computed tomography scans, pathology images, ultrasound images, endoscopy images, and/or magnetic resonance imaging (MRI) images” reads on the input in Molenda (Paragraphs [0162] and [0169]) enabled by physical input devices including scans/images in any form of multimedia including but not limited to illustrations, photos, videos, digital images, and/or 3D scans (e.g., medical imaging, such as X-rays, CT scans, MRIs, or Ultrasounds) and dermoscopy scans.);

Regarding claim 3, Molenda as applied to claim 1 teaches the limitation identified in bold as “the image input comprises one or more two-dimensional medical images, one or more three-dimensional images, and/or one or more videos” (Paragraphs [0162] and [0169] of Molenda.  In the instant application, the broadest reasonable interpretation of “the image input comprises one or more two-dimensional medical images, one or more three-dimensional images, and/or one or more videos” reads on the anatomic maps and/or visualizations in Molenda (Paragraphs [0162] and [0169]) including actual patient multimedia and can be two- or three-dimensional.).

Regarding claim 4, Molenda as applied to claim 1 teaches the limitation identified in bold as “the image input comprises a single image” (Paragraph [0242] of Molenda.  In the instant application, the broadest reasonable interpretation of “the image input comprises a single image” reads on the single picture or scan in Molenda (Paragraph [0242]).);

Regarding claim 5, Molenda as applied to claim 1 teaches the limitation identified in bold as “the image input comprises multiple images” (Paragraphs [0162] and [0169] of Molenda.  In the instant application, the broadest reasonable interpretation of “the image input comprises multiple images” reads on the input in Molenda (Paragraphs [0162] and [0169]) enabled by physical input devices including scans/images in any form of multimedia including but not limited to illustrations, photos, videos, digital images, and/or 3D scans (e.g., medical imaging, such as X-rays, CT scans, MRIs, or Ultrasounds).);

Regarding claim 6, Molenda as applied to claim 5 teaches the limitation identified in bold as “the multiple images comprise multiple views of at least a portion of patient” (Paragraph [0180] of Molenda.  In the instant application, the broadest reasonable interpretation of “the multiple images comprise multiple views of at least a portion of patient” reads on the multiple views in Molenda (Paragraph [0180]) of the face at different angles.);

Regarding claim 7, Molenda as applied to claim 5 teaches the limitation identified in bold as “the multiple images comprise a same view of at least a portion of a patient at multiple points in time” (Paragraph [0372] of Molenda.  In the instant application, the broadest reasonable interpretation of “the multiple images comprise a same view of at least a portion of a patient at multiple points in time” reads on the x-rays and imaging in Molenda (Paragraph [0372]) on the location of a fracture from different time points.);

Regarding claim 8, Molenda as applied to claim 1 teaches the limitation identified in bold as “the response comprises a multi-modal output” (Paragraphs [0367] and [0369] of Molenda.  In the instant application, the broadest reasonable interpretation of “the response comprises a multi-modal output” reads on the report in Molenda (Paragraphs [0367] and [0369]) including x-rays, CT scans, MRIs, ultrasounds, PET scans, and their variants.);

Regarding claim 15, Molenda as applied to claim 1 teaches the limitation identified in bold as “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools” (Paragraphs [0133], [0138],  [0303], and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools” reads on the activity in Molenda (Paragraphs [0133], [0138],  [0303], and [0345] – [0349]) of outputting digital shadow chart information, when the shadow chart has features from past historical visits (but no deferred diagnosis and no treatments or issues needing follow-up, e.g., enabled by the generation module).);

Regarding claim 16, Molenda as applied to claim 15 teaches the limitation identified in bold as “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools comprises determining whether the LLM output comprises at least one tag associated with at least one respective task” (Paragraphs [0133], [0347] – [0349],  and [0358] of Molenda. In the instant application, the broadest reasonable interpretation of “determining whether the LLM output indicates zero, one, or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools comprises determining whether the LLM output comprises at least one tag associated with at least one respective task” reads on the activities in Molenda (Paragraphs [0133], [0347] – [0349], and [0358]) of outputting digital shadow chart information, when the shadow chart has issues needing follow-up (e.g., diagnoses, treatments, follow-ups needed).);

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
Determining the scope and contents of the prior art.
Ascertaining the differences between the prior art and the claims at issue.
Resolving the level of ordinary skill in the pertinent art.
Considering objective evidence present in the application indicating obviousness or nonobviousness.

	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 9 – 13 and 19 is rejected under 35 U.S.C. 103(a) as being unpatentable over Molenda in view of Natarajan (U.S. Pub. No. 2025/0232872 A1).

Regarding claim 9, Molenda as modified by Natarajan and applied to claim 8 teaches the limitation identified in bold as “the multi-modal output comprises a text output and an image output” (Paragraph [0095] of Natarajan.  In the instant application, the broadest reasonable interpretation of “the multi-modal output comprises a text output and an image output” reads on the output data in Natarajan (Paragraph [0095]) including multiple modalities of data, such as image data (e.g., images and associated descriptions) and text data and audio data (e.g., audio tracks and associated descriptions).).
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to implement the multi-modal output comprises a text output and an image output, as taught by Natarajan (Paragraph [0095]), in order to be used by medical professionals or by patients to answer questions, identify treatment plans, create checklists, generate or populate documents or document templates, etc. (Paragraph [0067] of Natarajan).

Regarding claim 10, Molenda as modified by Natarajan and applied to claim 9 teaches the limitation identified in bold as “the LLM output comprises the text output and the at least one task-specific output comprises the image output” (Paragraph [0095] of Natarajan.  In the instant application, the broadest reasonable interpretation of “the LLM output comprises the text output and the at least one task-specific output comprises the image output” reads on the text data and audio data (e.g., audio tracks and associated descriptions) and image data (e.g., images and associated descriptions) in Natarajan (Paragraph [0095]).).

Regarding claim 11, Molenda as applied to claim 1 does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “the at least one medical task comprises one or more vision-language tasks comprising one or more of medical report generation, longitudinal study comparison, region-of-interest captioning, open-ended visual question answering, and/or abnormality (e.g., skin lesion) classification, and/or one or more vision-centric tasks comprising one or more medical image analysis tasks including one or more of anatomical structure identification, abnormality characterization, chest abnormality detection, lesion segmentation, and/or organ segmentation” (Paragraph [0017] of Natarajan.  In the instant application, the broadest reasonable interpretation of “the at least one medical task comprises one or more vision-language tasks comprising one or more of medical report generation, longitudinal study comparison, region-of-interest captioning, open-ended visual question answering, and/or abnormality (e.g., skin lesion) classification, and/or one or more vision-centric tasks comprising one or more medical image analysis tasks including one or more of anatomical structure identification, abnormality characterization, chest abnormality detection, lesion segmentation, and/or organ segmentation” reads on the one or more of the tasks in Natarajan (Paragraph [0017]) including question answering; report summarization; visual question answering; report generation; and image classification.);
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to implement the at least one medical task comprising one or more vision-language tasks comprising one or more of medical report generation, longitudinal study comparison, region-of-interest captioning, open-ended visual question answering, and/or abnormality (e.g., skin lesion) classification, and/or one or more vision-centric tasks comprising one or more medical image analysis tasks including one or more of anatomical structure identification, abnormality characterization, chest abnormality detection, lesion segmentation, and/or organ segmentation, as taught by Natarajan (Paragraph [0017]), in order to be used by medical professionals or by patients to answer questions, identify treatment plans, create checklists, generate or populate documents or document templates, etc. (Paragraph [0067] of Natarajan).

Regarding claim 12, Molenda as applied to claim 1 does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “processing the multi-modal input to obtain a tokenized representation of the image input and the  text input  at least in part by processing the text input to generate text tokens and processing the image input to generate visual tokens, and wherein the at least a portion of the  multi-modal output  comprises the tokenized representation of the image input and the text input” (Paragraphs [0101], [0464], [0550] – [0551], and [0612] of Natarajan.  In the instant application, the broadest reasonable interpretation of “processing the multi-modal input to obtain a tokenized representation of the image input and the  text input  at least in part by processing the text input to generate text tokens and processing the image input to generate visual tokens” reads on the activities in Natarajan (Paragraphs [0101], [0464], [0550] – [0551], and [0612]) embedding data from the query (e.g., projecting data elements from the query input into a latent space, such as by generating a vector embedding characterized by the dimensions of the latent space), with the query input including query instruction data from a first modality and query context data from a second modality. Query context data can be in a different modality from query instruction data. For instance, query instruction data can be in a text or audio modality (e.g., corresponding to a natural language instruction). Query context data can be in an image, video, audio, text, waveform, sensor time history, etc. modality. The broadest reasonable interpretation of “wherein the at least a portion of the multi-modal output  comprises the tokenized representation of the image input and the text input” reads on the activities in Natarajan (Paragraph [0537]) processing a given portion of an input source and output a series of tokens (e.g., tokenizing textual input source(s) and tokenizing image-based input source(s by extracting and serializing patches from an image input.);
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to include the activity of processing the multi-modal input to obtain a tokenized representation of the image input and the  text input  at least in part by processing the text input to generate text tokens and processing the image input to generate visual tokens, and wherein the at least a portion of the  multi-modal output  comprises the tokenized representation of the image input and the text input, as taught by Natarajan (Paragraphs [0101], [0464], [0550] – [0551], and [0612]), in order to be used by medical professionals or by patients to answer questions, identify treatment plans, create checklists, generate or populate documents or document templates, etc. (Paragraph [0067] of Natarajan).

Regarding claim 13, Molenda as modified by Natarajan and applied to claim 12 teaches the limitation identified in bold as “processing the multi-modal input to obtain the tokenized representation of the image input and the text input further comprises adapting the visual tokens into text tokens using a model trained to transform visual tokens into text tokens” (Paragraphs [0543], [0597], and [0612] of Natarajan.  In the instant application, the broadest reasonable interpretation of “processing the multi-modal input to obtain the tokenized representation of the image input and the text input further comprises adapting the visual tokens into text tokens using a model trained to transform visual tokens into text tokens” reads on the activity in Natarajan (Paragraphs [0543], [0597], and [0612]) of executing machine-learned model(s) to generate an embedding for input data (e.g. input audio or visual data, with the input including audio data representing a spoken utterance) and performing a speech recognition task to output a text output which is mapped to the spoken utterance.);

Regarding independent claim 19, Molenda teaches the limitations identified in bold as: 
A method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools (Paragraphs [0012], [0125], [0130] – [0132], [0135], [0215], and [0427] of Molenda. In the instant application, the broadest reasonable interpretation of “a method for performing medical tasks using a medical artificial intelligence (MAI) system, the MAI system comprising a plurality of modules including a multi-modal input coordinator module, an orchestrator module comprising a trained large language model (LLM), and a plurality of task-specific software tools” reads on the method in Molenda (Paragraphs [0012], [0125], [0130] – [0132], [0135], [0215], and [0427] for generating a medical record using a system enabling artificial intelligence, with the system including the database processing module configured to process and convert data received from the input device (e.g., a large language model), and the digital shadow chart being directly printed into a paper shadow chart or modified in electronic form with different selections, filters, and interaction modifiers (i.e., at least one of the plurality of task-specific software tools).) , the method comprising:

executing the multi-modal input coordinator module, using at least one computer hardware processor, to perform: receiving multi-modal input comprising image input  and text input indicating a request that the MAI system perform at least one medical task on the multi-modal input (Paragraphs [0133], [0135], and [0138] of Molenda. In the instant application, the broadest reasonable interpretation of “receiving multi-modal input comprising image input and a request that the MAI system perform at least one medical task on the multi-modal input” reads on the activity in Molenda (Paragraphs [0133], [0135], and [0138]) of executing, using the processor, the database processing module to receive data (e.g., text, images, illustrations, videos, and/or the selected procedure, etc.); 

processing the multi-modal input to obtain  a tokenized representation of the image input and the text input; 

executing the orchestrator module, using the at least one computer hardware processor, to perform (Paragraph [0135] of Molenda. In the instant application, the broadest reasonable interpretation of “executing the orchestrator module, using the at least one computer hardware processor, to perform” reads on the activities in Molenda (Paragraph [0135]) of processing and converting, using the processor, data (e.g., text, images, illustrations, videos, and/or the selected procedure, etc.) received from the input device.): 

processing the tokenized representation using the trained LLM to obtain LLM output at least partially responsive to the request, the LLM output  comprising latent embeddings and textual output, the LLM output indicating zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools (Paragraphs [0138] and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “processing at least a portion of the multi-modal input using the trained LLM to obtain LLM output, the LLM output indicating that zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools” reads on the activity in Molenda (Paragraphs [0138], [0303], and [0345] – [0349]) of using models (e.g., large language models) stored on the medium for automation of healthcare record generation responsive to the selected procedure, with the system comprising the shadow chart including historical information (e.g., diagnoses, treatments, follow-ups needed, e.g., enabled by the generation module).); 

when the LLM output indicates that zero tasks are to be additionally performed by the at least one of the plurality of task-specific software tools, outputting the textual output as a response to the request of the MAI system (Paragraphs [0130], [0133], [0138], [0272], [0348], [0427], and [0434] of Molenda. In the instant application, the broadest reasonable interpretation of “when the LLM output indicates that zero tasks are to be additionally performed by the at least one of the plurality of task-specific software tools, outputting the textual output as a response to the request of the MAI system” reads on the activity in Molenda (Paragraphs [0130], [0133], [0138], [0272], [0348], [0427], and [0434]) of outputting digital shadow chart information, when the shadow chart has features from past historical visits (but no deferred diagnosis and no treatments or issues needing follow-up), as a response to a selected procedure, with the medium enabling models (e.g., large language models), engines (i.e., a plurality of task-specific software tools), and artificial intelligence.); 

when the LLM output indicates that one or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools identifying, based on the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool (Paragraph [0348] of Molenda. In the instant application, the broadest reasonable interpretation of “when the LLM output indicates that one or multiple tasks are to be additionally performed …, based on the LLM output” reads on the shadow chart in Molenda (Paragraph [0348]) having deferred diagnosis or treatments or issues needing follow-up).); 

generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool and processing the first input with the first task-specific software tool to obtain a first task-specific output; 

generating an integrated response to the request of the MAI system using the textual output produced by the trained LLM and the first task-specific output generated by the first task-specific software tool (Paragraphs [0133] and [0138] of Molenda.  In the instant application, the broadest reasonable interpretation of “generating an integrated response to the request of the MAI system” reads on the shadow chart in Molenda (Paragraphs [0133] and [0138]) can have features from deferred diagnosis or treatments or issues needing follow-up.); 

outputting the integrated response as a response to the request of the MAI system (Paragraphs [0133] and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “outputting the integrated response as a response to the request of the MAI system” reads on the activity in Molenda (Paragraphs [0138], [0303], and [0345] – [0349]) generating a healthcare record as a response to user input (e.g., a selected procedure).).

	Molenda does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “processing the multi-modal input to obtain  a tokenized representation of the image input and the text input” (Paragraph [0082] of Natarajan. In the instant application, the broadest reasonable interpretation of “processing the multi-modal input to obtain  a tokenized representation of the image input and the text input” reads on the activity in Natarajan (Paragraph [0082]) of processing, by tokenizers, combined input (including a natural language string), to generate the series of tokes representing information embedded into a latent space.).
	Molenda does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “processing the tokenized representation using the trained LLM to obtain LLM output at least partially responsive to the request, the LLM output comprising latent embeddings and textual output, the LLM output indicating zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools” (Paragraphs [0597] and [0606] – [0607] of Natarajan. In the instant application, the broadest reasonable interpretation of “processing the tokenized representation …, the LLM output comprising latent embeddings and textual output” reads on the activity in Natarajan (Paragraphs [0597] and [0606] – [0607]) of processing the natural language data to generate a latent text embedding output.).
	Molenda does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “when the LLM output indicates that one or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools identifying, based on the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool” (Paragraphs [0113] – [0114] of Natarajan. In the instant application, the broadest reasonable interpretation of “the at least one of the plurality of task-specific software tools” reads on the tool repository in Natarajan (Paragraphs [0113] – [0114]) including a registry of available tools for the machine-learned model to use when processing a query. The broadest reasonable interpretation of “identifying, …  from among the plurality of task-specific software tools, a first task-specific software tool” reads on the activities in Natarajan (Paragraphs [0114] – [0116]) of selecting an appropriate tool for performing a task, with example tools including  tools for database lookups, internet searches, media processing/generation (e.g., image, video, audio, etc.), machine interfaces (e.g., sensor interfaces, test device interfaces, interfaces with other computing systems, etc.).).
	Molenda does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool and processing the first input with the first task-specific software tool to obtain a first task-specific output” (Paragraphs [0081] – [0085], [0090], and [0335] of Natarajan. In the instant application, the broadest reasonable interpretation of “generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool” reads on the activities in Natarajan (Paragraphs [0081] – [0085], [0090], and [0335]) of processing, by one or more tokenizers, combined input including natural language and generating, from embedding data from the query (e.g., projecting data elements from the query into a latent space, such as by generating a vector embedding), the query input including query instruction data (from a first modality) and query context data (from a second modality). The broadest reasonable interpretation of “processing the first input with the first task-specific software tool to obtain a first task-specific output” reads on the activities in Natarajan (Paragraphs [0114] – [0115]) of adding a tool pointer to combined input to bias the machine-learned model toward using a particular tool to use, with examples of the tools including tools for database lookups, internet searches, media processing/generation (e.g., image, video, audio, etc.), machine interfaces (e.g., sensor interfaces, test device interfaces, interfaces with other computing systems, etc.). For instance, a database lookup or internet search tool can be used by system to retrieve citations for information in output.).
	Molenda does not appear to explicitly disclose, but Natarajan teaches the limitation identified in bold as “generating an integrated response to the request of the MAI system using the textual output produced by the trained LLM and the first task-specific output generated by the first task-specific software tool” (Paragraphs [0095], [0114], and [0390] of Natarajan. In the instant application, the broadest reasonable interpretation of “using the textual output produced by the trained LLM and the first task-specific output generated by the first task-specific software tool” reads on the output of data (e.g., text data) in Natarajan (Paragraphs [0095], [0114], and [0390]) of the model (e.g., PaLM being a densely-connected decoder-only Transformer based large language model (LLM) trained using Pathways) to input to the selected tool, including instructions for the tool to perform or queries to obtain data from the tool.).
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to: include the activity of processing the multi-modal input to obtain  a tokenized representation of the image input and the text input, include the activity of processing the tokenized representation using the trained LLM to obtain LLM output at least partially responsive to the request, the LLM output comprising latent embeddings and textual output, the LLM output indicating zero, one, or multiple tasks are to be additionally performed by at least one of the plurality of task-specific software tools, include the activity, when the LLM output indicates that one or multiple tasks are to be additionally performed by the at least one of the plurality of task-specific software tools, of identifying, based on the LLM output and from among the plurality of task-specific software tools, a first task-specific software tool, and include the activity of generating, from the latent embeddings and the multi-modal input, first input for the first task-specific software tool and processing the first input with the first task-specific software tool to obtain a first task-specific output, as taught by Natarajan (Paragraphs [0081] – [0085], [0090], [0095], [0113] – [0114], [0335], [0390], [0597], and [0606] – [0607]), in order to be used by medical professionals or by patients to answer questions, identify treatment plans, create checklists, generate or populate documents or document templates, etc. (Paragraph [0067] of Natarajan).

Claims 14 is rejected under 35 U.S.C. 103(a) as being unpatentable over Molenda as modified by Natarajan and applied to claim 12, and further in view of Han (U.S. Pub. No. 2019/0030371 A1).
Regarding claim 14, Molenda as modified by Natarajan and applied to claim 12 does not appear to disclose, but Han teaches the limitation identified in bold as “when the image input comprises a two-dimensional (2D) image, the processing the image input comprises processing the 2D image using a 2D vision encoder; and when the image input comprises a three-dimensional (3D) image, the processing the image input comprises processing the 3D image using a 3D vision encoder” (Paragraphs [0079] and [0082] of Han.  In the instant application, the broadest reasonable interpretation of “when the image input comprises a two-dimensional (2D) image, the processing the image input comprises processing the 2D image using a 2D vision encoder” reads on the activities in Han (Paragraph [0079]) of processing the 2D images using a fully convolutional network (FCN) for predicting 2D label map corresponding to a middle image of input stack of adjacent 2D images (when the image input comprises include 2D images), with the encoding portion in CNN model being used for extracting an activation map or a feature map as an output. The broadest reasonable interpretation of “when the image input comprises a three-dimensional (3D) image, the processing the image input comprises processing the 3D image using a 3D vision encoder” reads on the activities in Han (Paragraph [0082]) of using the trained CNN model to predict the anatomical structure of each voxel of an input 3D image or label each voxel of an input 3D medical image to an anatomical structure (when the image input comprises include 3D images).);
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda as modified by Natarajan such that when the image input comprises a two-dimensional (2D) image, the processing the image input comprises processing the 2D image using a 2D vision encoder; and when the image input comprises a three-dimensional (3D) image, the processing the image input comprises processing the 3D image using a 3D vision encoder, as taught by Han (Paragraphs [0079] and [0082]), in order to provide fully automated segmentation of x-ray computed tomography (CT) or magnetic resonance (MR) images and provide an ability to incorporate prior anatomical information about structure shapes and their geometric relationships (Paragraph [0003] of Han).

Claims 17 is rejected under 35 U.S.C. 103(a) as being unpatentable over Molenda as applied to claim 16, and further in view of Zaremoodi (U.S. Pub. No. 2023/0098783 A1).
Regarding claim 17, Molenda as applied to claim 16 does not appear to explicitly disclose, but Zaremoodi teaches the limitation identified in bold as “identifying the at least one of the plurality of task-specific software tools using the at least one tag, wherein the at least one tag comprises a first tag associated with a first task, and the at least one of the plurality of task-specific software tools is trained to perform the first task” (Paragraphs [0118] – [0119] of Zaremoodi.  In the instant application, the broadest reasonable interpretation of “identifying the at least one of the plurality of task-specific software tools using the at least one tag, wherein the at least one tag comprises a first tag associated with a first task, and the at least one of the plurality of task-specific software tools is trained to perform the first task” reads on the activities in Zaremoodi (Paragraphs [0118] – [0119]) of providing the intermediate language model examples labeled for the auxiliary task (e.g., labels for positive, negative, or neutral sentiment) and optimizing the parameters inside the intermediate language model to output the correct prediction or inference for the auxiliary task (e.g., a correct sentiment sentence level sentiment classification).  The selection of the auxiliary task for the behavior focusing may be implemented via hypertuning of the sequential focusing framework.);
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to: include the activity of identifying the at least one of the plurality of task-specific software tools using the at least one tag, wherein the at least one tag comprises a first tag associated with a first task, and the at least one of the plurality of task-specific software tools is trained to perform the first task, as taught by Zaremoodi (Paragraphs [0118] – [0119]), in order to facilitate improved performance by a model that uses output from the focused model on a downstream task such as classifying intents in patient health related questions (Paragraph [0032] of Zaremoodi).

Claims 18 is rejected under 35 U.S.C. 103(a) as being unpatentable over Molenda as applied to claim 1, and further in view of Han and Seah (U.S. Pub. No. 2023/0274420 A1).
Regarding claim 18, Molenda as applied to claim 1 teaches the limitation identified in bold as:
the image input comprises a two-dimensional image;

the LLM output indicates that a detection task is to be additionally performed by the at least one of the plurality of task-specific software tools (Paragraphs [0133], [0138], and [0345] – [0349] of Molenda. In the instant application, the broadest reasonable interpretation of “the LLM output indicates that a detection task is to be additionally performed by the at least one of the plurality of task-specific software tools” reads on the digital shadow chart in Molenda (Paragraphs [0133], [0138],  [0303], and [0345] – [0349]) (e.g., enabled by the input device) and physical markup on a paper shadow chart at least partially responsive to the selected procedure, with the shadow chart having diagnoses and  follow-ups needed, e.g., enabled by the generation module).);

the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task on the two-dimensional image;

the at least one task-specific output comprises a second textual output, wherein obtaining the at least one task-specific output comprises generating the second textual output (Paragraph [0259] of Molenda. In the instant application, the broadest reasonable interpretation of “the at least one task-specific output comprises a second textual output, wherein obtaining the at least one task-specific output comprises generating the second textual output” reads on the description in Molenda (Paragraph [0259]) of an anatomic site (such as a code string, linguistic description in any language, or a symbolic description such as an ear emoji to represent the ear).); and

the response comprises the LLM output and the second textual output, wherein outputting the response comprises outputting the LLM output and the second textual output (Paragraph [0358] of Molenda. In the instant application, the broadest reasonable interpretation of “the LLM output indicates that a detection task is to be additionally performed by the at least one of the plurality of task-specific software tools” reads on the anatomy data and/or non-anatomy data in Molenda (Paragraph [0358]) incorporated into an electronic medical record or the digital shadow chart.).).

	Molenda as applied to claim 1 does not appear to explicitly disclose, but Han teaches the limitation identified in bold as “the image input comprises a two-dimensional image” (Paragraphs [0079] and [0082] of Han.  In the instant application, the broadest reasonable interpretation of “the image input comprises a two-dimensional image” reads on the 2D images in Han (Paragraph [0079]) using in the fully convolutional network (FCN) for predicting 2D label map corresponding to a middle image of input stack of adjacent 2D images.). 
	Molenda as applied to claim 1 does not appear to explicitly disclose, but Seah teaches the limitation identified in bold as “the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task on the two-dimensional image” (Paragraphs [0084] and [0090] – [0094] of Seah. In the instant application, the broadest reasonable interpretation of “the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task” reads on the hybrid AI model in Seah (Paragraphs [0084] and [0090] – [0094]) comprising an image processing component (typically a CNN) and a language processing component (typically an NLP model, preferably a transformer-based model) trained to perform image analysis tasks.). 
	Molenda as applied to claim 1 does not appear to explicitly disclose, but Seah teaches the limitation identified in bold as “the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task on the two-dimensional image” (Paragraph [0079] of Han.  In the instant application, the broadest reasonable interpretation of “the two-dimensional image” reads on one of the 2D images in Han (Paragraph [0079]).). 
	Therefore, it would have been obvious to one of ordinary skill in the art of medical data mining at the time of filing to modify the method of Molenda to implement the image input comprising a two-dimensional image, as taught by Han (Paragraphs [0079] and [0082]), in order to provide specific, enhanced, reproducible, translated, visualized, dynamic, descriptive, and encoded anatomic sites facilitate improved communication, documentation, tracking, understanding, and descriptions of anatomic sites or regions affected by diseases, treatments, symptoms, and morphologies across different systems and languages (Paragraph [0164] of Han); and implement the at least one of the plurality of task-specific software tools comprises a task-specific software tool trained to perform the detection task on the two-dimensional image, as taught by Seah (Paragraphs [0084] and [0090] – [0094]), in order to attend to regions of the medical image that were responsible for generating specific words and paying attention to these words in the medical report and this section of the image thereby improving explainability of the system and the predictions generated by the model (Paragraph [0089] of Seah).

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT CAESAR ILAGAN whose telephone number is (703) 756-1639. The examiner can normally be reached Monday - Friday 8:30 am - 6:00pm.
	Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
	If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jason B. Dunham, can be reached on (571) 272-8109. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
	Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/V.C.I./Examiner, Art Unit 3686 

/DEVIN C HEIN/Examiner, Art Unit 3686
Read full office action
Prosecution Timeline

Feb 19, 2025
Application Filed
May 15, 2026
Non-Final Rejection mailed — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/223,006
Patent 12626820
MODERATED COMMUNICATION SYSTEM FOR INFERTILITY TREATMENT
2y 10m to grant Granted May 12, 2026
17/877,767
Patent 12548645
COMPUTER ARCHITECTURE FOR IDENTIFYING LINES OF THERAPY
3y 6m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 2 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
42%
Grant Probability
99%
With Interview (+63.6%)
2y 8m (~1y 5m remaining)
Median Time to Grant
Low
PTA Risk
Based on 12 resolved cases by this examiner. Grant probability derived from career allowance rate.