Last updated: April 19, 2026
Application No. 18/332,989
METHOD, APPARATUS, AND SYSTEM FOR MULTI-MODAL MULTI-TASK PROCESSING

Non-Final OA §101§102
Filed
Jun 12, 2023
Examiner
NGUYEN, VAN H
Art Unit
2199
Tech Center
2100 — Computer Architecture & Software
Assignee
Alibaba Damo (Hangzhou) Technology Co., Ltd.
OA Round
1 (Non-Final)
Interview Optional

— +18.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 851 resolved cases, 2023–2026
Examiner Intelligence

NGUYEN, VAN H View full profile →
Grants 89% — above average
Career Allow Rate
759 granted / 851 resolved
+34.2% vs TC avg
Strong +18% interview lift
Without
With
+18.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
18 currently pending
Career history
869
Total Applications
across all art units
Statute-Specific Performance

§101
23.1%
-16.9% vs TC avg
§103
24.0%
-16.0% vs TC avg
§102
27.2%
-12.8% vs TC avg
§112
10.9%
-29.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 851 resolved cases
Office Action

§101 §102
DETAILED ACTION

1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
This action is responsive to the application filed 06/12/2023.  
Claims 1-19 are presented for examination.  
Priority


2.	Receipt is acknowledged of papers submitted under 35 U.S.C. 119(a)-(d), and based on application # 202210746272.0 filed in CHINA on 06/29/2022, which papers have been placed of record in the file. 

Information Disclosure Statement

3. 	The Applicants’ Information Disclosure Statements (filed 07/21/2023 and 11/30/2023) have been received, entered into the record, and considered.  

Drawings

3.	The drawings filed 06/12/2023 are acceptable for examination purposes.

Specification
4.	The specification has not been checked to the extent necessary to determine the presence of all possible minor errors. Applicant's cooperation is requested in correcting any errors of which applicant may become aware in the specification.

Claim Rejections - 35 USC § 101

5.         35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.  

Regarding claim 1, the limitations “determine a task representation element corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information, an element used to define task input information, and an element used to define task output information”; “determine an encoding sequence corresponding to each of the plurality of to-be-processed tasks”, and “process each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks” as drafted, are functions that, under its broadest reasonable interpretation, recite the abstract idea of a mental process.  The limitations encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, these limitations recite and fall within the “Mental Processes” grouping of abstract ideas under Prong 1. 

Under Prong 2, this judicial exception is not integrated into a practical application. The additional elements “a system for multi-modal multi-task processing, comprising: a task representation component having circuitry”, “a data conversion component, communicatively coupled to the task representation component, and having circuitry”, and “a data processing component, communicatively coupled to the data conversion component, and having circuitry” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “based on the task representation element, acquire task description information, task input information, and task output information corresponding to each of a plurality of to-be- processed tasks in different modalities” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data.  Accordingly, the additional elements do not integrate the recited judicial exception into a practical application and the claim is therefore directed to the judicial exception.  See MPEP 2106.05(g). 

Under Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of    “a system for multi-modal multi-task processing, comprising: a task representation component having circuitry”, “a data conversion component, communicatively coupled to the task representation component, and having circuitry”, and “a data processing component, communicatively coupled to the data conversion component, and having circuitry” amount to no more than mere instructions, or generic computer/computer components to carry out the exception, and for the limitation “based on the task representation element, acquire task description information, task input information, and task output information corresponding to each of a plurality of to-be- processed tasks in different modalities” the courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept.  Accordingly, the claim is not patent eligible under 35 USC 101.

Regarding claim 2, the limitations “determine a target conversion module corresponding to each of the plurality of to-be- processed tasks among all the data conversion modules”; and “process a corresponding to-be-processed task using the target conversion module to obtain an encoding sequence corresponding to each of the plurality of to-be-processed tasks” encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “the data conversion component having circuitry configured to determine the encoding sequence corresponding to each of the plurality of to-be-processed tasks, the data conversion component includes circuitry” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquire all data conversion modules configured to process the plurality of to-be- processed tasks” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 3, the limitations “process the task description information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a first encoding sequence”; “process the task input information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a second encoding sequence”; and “process the task output information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a third encoding sequence”; and
encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “the data conversion component having circuitry configured to process the corresponding to-be-processed task using the target conversion module to obtain the encoding sequence corresponding to each of the plurality of to- be-processed tasks, the data conversion module includes circuitry” and “based on the first encoding sequence, the second encoding sequence, and the third encoding sequence” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquire the task description information, the task input information, and the task output information corresponding to each of the plurality of to-be-processed tasks” and “obtain the encoding sequence corresponding to each of the plurality of to-be- processed tasks” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 4, the limitations “determine a plurality of target samples in different modalities among the training samples” and “and perform learning training on the plurality of target samples in different modalities to obtain the data processing component” encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “a learning training component having circuitry” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquire training samples through the task representation framework, the training samples corresponding to a plurality of data modalities, and each training sample corresponding to a standard processing result” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 5, the limitation “add the additional sample to the plurality of target samples to obtain adjusted samples that are used for training the multi-modal task processing system” encompasses a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “a learning training component having circuitry” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquire an additional sample through the task representation framework” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.



Regarding claim 6, the limitations “determining a task representation element corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information, an element used to define task input information, and an element used to define task output information”; “determining an encoding sequence corresponding to each of the plurality of to-be- processed tasks”, and “processing each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks, to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks” as drafted, are functions that, under its broadest reasonable interpretation, recite the abstract idea of a mental process.  The limitations encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, these limitations recite and fall within the “Mental Processes” grouping of abstract ideas under Prong 1. 

Under Prong 2, this judicial exception is not integrated into a practical application. The additional element “multi-modal multi-task processing” is recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to- be-processed tasks in different modalities” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data.  Accordingly, the additional elements do not integrate the recited judicial exception into a practical application and the claim is therefore directed to the judicial exception.  See MPEP 2106.05(g). 

Under Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of    “multi-modal multi-task processing” amounts to no more than mere instructions, or generic computer/computer components to carry out the exception, and for the limitation “acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to- be-processed tasks in different modalities” the courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept.  Accordingly, the claim is not patent eligible under 35 USC 101.

Regarding claims 7, 10, and 12, they correspond to claims 2-4. Therefore, they are rejected for the same reasons.

Regarding claim 8, the limitations “detecting whether an adaptive conversion module that matches the data modality exists among all the data conversion modules” and “when the adaptive conversion module that matches the data modality exists, determining the adaptive conversion module as the target conversion module configured to process the to-be- processed task corresponding to the data modality” encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “acquiring a data modality corresponding to each of the plurality of to-be-processed task” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 9, the limitations “when an adaptive conversion module that matches the data modality does not exist among all the data conversion modules, generating an adaptive conversion module that matches the data modality” and “the adaptive conversion module that matches the data modality exists, determining the adaptive conversion module as the target conversion module configured to process the to-be- processed task corresponding to the data modality” encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.  After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 11, the limitations “determining a data type of task input data in the task input information” and “in response to the data type being discrete data, processing the task input information corresponding to each of the plurality of to-be-processed tasks using the target conversion module to obtain the second encoding sequence” encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, the claim recites further mental process. The additional elements “in response to the data type being continuous data, acquiring a glossary used for processing the task input data” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data. The courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept. After considering all claim elements individually and as an ordered combination, it is determined that the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given above with respect to integration of the abstract idea into a practical application. Therefore, the claim is not patent eligible.

Regarding claim 13, the limitations “determining a task representation element corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information, an element used to define task input information, and an element used to define task output information”; “determining an encoding sequence corresponding to each of the plurality of to-be- processed tasks”, and “processing each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks, to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks” as drafted, are functions that, under its broadest reasonable interpretation, recite the abstract idea of a mental process.  The limitations encompass a human mind carrying out the function through observation, evaluation, judgment and /or opinion, or even with the aid of pen and paper.  Thus, these limitations recite and fall within the “Mental Processes” grouping of abstract ideas under Prong 1. 

Under Prong 2, this judicial exception is not integrated into a practical application. The additional elements “an apparatus comprising: a memory configured to store instructions; and one or more processors configured to execute the instructions to cause the apparatus to perform operations for multi-modal multi-task processing” are recited at a high-level of generality such that it amounts no more than mere instructions to apply the exception using generic computer, and/or mere computer components, MPEP 2106.05(f), and “acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to-be-processed tasks in different modalities” do nothing more than add insignificant extra solution activity to the judicial exception of merely gathering data.  Accordingly, the additional elements do not integrate the recited judicial exception into a practical application and the claim is therefore directed to the judicial exception.  See MPEP 2106.05(g). 

Under Step 2B, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements of    “an apparatus comprising: a memory configured to store instructions; and one or more processors configured to execute the instructions to cause the apparatus to perform operations for multi-modal multi-task processing” amount to no more than mere instructions, or generic computer/computer components to carry out the exception, and for the limitation “acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to-be-processed tasks in different modalities” the courts have identified mere data/information gathering is well-understood, routine and conventional activity.  See MPEP 2106.05(d).  The recitation of generic computer instruction and computer components to apply the judicial exception, and merely gathering data/information do not amount to significantly more, thus, cannot provide an inventive concept.  Accordingly, the claim is not patent eligible under 35 USC 101.

Regarding claims 14, 17, and 19, they correspond to claims 2-4. Therefore, they are rejected for the same reasons.

Regarding claims 15, 16, and 18, they correspond to claims 8, 9, and 11. Therefore, they are rejected for the same reasons.

Claim Rejections - 35 USC § 102

6.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:

A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention. 


Claims 1-19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Hu et al. “UniT: Multimodal Multitask Learning with a Unified Transformer”. 
The reference was cited by Applicant in the IDS filed 11/30/2023.
It is noted that any citations to specific, pages, columns, paragraphs, lines, or figures in the prior art references and any interpretation of the reference should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. See MPEP 2123.

As to claim 1: 
Hu teaches a system for multi-modal multi-task processing (Title: Multimodal Multitask; Abstract: a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning), comprising: 
a task representation component (Fig.2, page 3) having circuitry configured to: 
determine a task representation element (Fig.2: task-specific output head) corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information (Fig.2: task specific query embedding and task index), an element used to define task input information, and an element used to define task output information (page 3, right column: After encoding input modalities into hidden state sequences, we apply the transformer decoder on either a single encoded modality or the concatenated sequence of both encoded modalities, depending on whether the task is uni-modal (i.e. vision-only or language-only) or multimodal); and 
based on the task representation element, acquire task description information, task input information, and task output information corresponding to each of a plurality of to-be- processed tasks in different modalities (Fig.2: image input, text input, task index; page 4, left column: We apply a visual transformer encoder Ev with Nv layers and hidden size d e v on top of the feature map x v to further encode it to visual hidden states h v of size L × d e v (where L = Hv × Wv is the length of the encoded visual hidden states). In addition, given that different tasks (such as object detection and VQA) might require extracting different types of information, we also add a task embedding vector w task v into the transformer encoder to allow it to extract task-specific information in its output; page 4, left column: Similar to the image encoder, in the text encoder, we also add a learned task embedding vector wttask as part of the BERT input);
a data conversion component, communicatively coupled to the task representation component, and having circuitry configured to determine an encoding sequence corresponding to each of the plurality of to-be-processed tasks (page 3 and Figure 2: An overview of our UniT model, which jointly handles a wide range of tasks in different domains with a unified transformer encoder-decoder architecture. Our model uses an image encoder to encode the visual inputs (Sec. 3.1), a text encoder to encode the language inputs (Sec. 3.2), and a joint decoder with per-task query embedding (Sec. 3.3) followed by task-specific heads (Sec. 3.4) to make the final outputs for each task); and 
a data processing component, communicatively coupled to the data conversion component, and having circuitry configured to process each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks (section 3.3. Domain-agnostic UniT decoder: After encoding the input modalities, we apply on them a transformer decoder D with hidden size dtd and number of layers Nd to output a sequence of decoded hidden states hdec for predictions on each task. Unlike the image and text encoders with specific architectural designs for each modality, our decoder is built upon the same domain-agnostic transformer decoder architecture [59] across all tasks... The transformer decoder D takes the encoded input sequence henc and a task-specific query embedding sequence q task of length q. It outputs a sequence of decoded hidden states hdec, i for each of the l-th transformer decoder layer, which has the same length q as the query embedding qtask ; section 3.4: Task-specific output heads... A task-specific prediction head is applied over the decoder hidden states {hdec, t} for each task t).  
As to claim 2: 
Hu teaches when the data conversion component having circuitry configured to determine the encoding sequence corresponding to each of the plurality of to-be-processed tasks, the data conversion component includes circuitry further configured to: acquire all data conversion modules (Fig.2: image encoder, text encoder) configured to process the plurality of to-be- processed tasks; determine a target conversion module corresponding to each of the plurality of to-be- processed tasks among all the data conversion modules; and process a corresponding to-be-processed task using the target conversion module to obtain an encoding sequence corresponding to each of the plurality of to-be-processed tasks (Fig.2 and page 3, section 3).  
As to claim 3: 
Hu teaches when the data conversion component having circuitry configured to process the corresponding to-be-processed task using the target conversion module to obtain the encoding sequence corresponding to each of the plurality of to- be-processed tasks, the data conversion module includes circuitry further configured to: acquire the task description information, the task input information, and the task output information corresponding to each of the plurality of to-be-processed tasks; process the task description information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a first encoding sequence  (page 4, section 3.3);  process the task input information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a second encoding sequence (page 4, section 3.3);  process the task output information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a third encoding sequence (page 4, section 3.3);  and based on the first encoding sequence, the second encoding sequence, and the third encoding sequence, obtain the encoding sequence corresponding to each of the plurality of to-be- processed tasks (page 4, section 3.3).

As to claim 4: 
Hu teaches a learning training component having circuitry configured to: acquire training samples through the task representation framework, the training samples corresponding to a plurality of data modalities, and each training sample corresponding to a standard processing result; determine a plurality of target samples in different modalities among the training samples; and perform learning training on the plurality of target samples in different modalities to obtain the data processing component (Fig.2, pages 4-5, section 3.4, and page 5, section 3.5).  
As to claim 5: 
Hu teaches after determining the plurality of target samples in different modalities, the learning training component includes circuitry further configured to: acquire an additional sample through the task representation framework; and add the additional sample to the plurality of target samples to obtain adjusted samples that are used for training the multi-modal task processing system (Fig.2, pages 4-5, section 3.4, and page 5, section 3.5).    
As to claim 6: 
Hu teaches a method for multi-modal multi-task  (Title: Multimodal Multitask; Abstract: a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning)processing comprising: 
determining a task representation element corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information (Fig.2: task specific query embedding and task index), an element used to define task input information, and an element used to define task output information  (page 3, right column: After encoding input modalities into hidden state sequences, we apply the transformer decoder on either a single encoded modality or the concatenated sequence of both encoded modalities, depending on whether the task is uni-modal (i.e. vision-only or language-only) or multimodal); 
acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to- be-processed tasks in different modalities (Fig.2: image input, text input, task index; page 4, left column: We apply a visual transformer encoder Ev with Nv layers and hidden size d e v on top of the feature map x v to further encode it to visual hidden states h v of size L × d e v (where L = Hv × Wv is the length of the encoded visual hidden states). In addition, given that different tasks (such as object detection and VQA) might require extracting different types of information, we also add a task embedding vector w task v into the transformer encoder to allow it to extract task-specific information in its output; page 4, left column: Similar to the image encoder, in the text encoder, we also add a learned task embedding vector wttask as part of the BERT input);
determining an encoding sequence corresponding to each of the plurality of to-be- processed tasks (page 3 and Figure 2: An overview of our UniT model, which jointly handles a wide range of tasks in different domains with a unified transformer encoder-decoder architecture. Our model uses an image encoder to encode the visual inputs (Sec. 3.1), a text encoder to encode the language inputs (Sec. 3.2), and a joint decoder with per-task query embedding (Sec. 3.3) followed by task-specific heads (Sec. 3.4) to make the final outputs for each task); and 
processing each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks, to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks (3.3. Domain-agnostic UniT decoder: After encoding the input modalities, we apply on them a transformer decoder D with hidden size dtd and number of layers Nd to output a sequence of decoded hidden states hdec for predictions on each task. Unlike the image and text encoders with specific architectural designs for each modality, our decoder is built upon the same domain-agnostic transformer decoder architecture [59] across all tasks... The transformer decoder D takes the encoded input sequence henc and a task-specific query embedding sequence q task of length q. It outputs a sequence of decoded hidden states hdec, i for each of the l-th transformer decoder layer, which has the same length q as the query embedding qtask ;  3.4: Task-specific output heads... A task-specific prediction head is applied over the decoder hidden states {hdec, t} for each task t).    
As to claim 7: 
Hu teaches determining the encoding sequence corresponding to each of the plurality of to-be-processed tasks further comprises: acquiring all data conversion modules (Fig.2: image encoder, text encoder) configured to process the plurality of to-be- processed tasks; determining a target conversion module corresponding to each of the plurality of to-be- processed tasks among all the data conversion modules; and processing a corresponding to-be-processed task using the target conversion module to obtain an encoding sequence corresponding to each of the plurality of to-be-processed tasks (Fig.2 and page 3, section 3).  
As to claim 8: 
Hu teaches determining the target conversion module corresponding to each of the plurality of to-be-processed tasks among all the data conversion modules further comprises: acquiring a data modality corresponding to each of the plurality of to-be-processed tasks; and detecting whether an adaptive conversion module that matches the data modality exists among all the data conversion modules; when the adaptive conversion module that matches the data modality exists, determining the adaptive conversion module as the target conversion module configured to process the to-be- processed task corresponding to the data modality  (Fig.2, page 3, section 3 and page 4, section 3.3).  
As to claim 9: 
Hu teaches when an adaptive conversion module that matches the data modality does not exist among all the data conversion modules, generating an adaptive conversion module that matches the data modality; and determining the adaptive conversion module as the target conversion module configured to process the to-be-processed task corresponding to the data modality (Fig.2, page 3, section 3 and page 4, section 3.3).  

As to claim 10: 
Hu teaches processing the corresponding to-be- processed task using the target conversion module to obtain the encoding sequence corresponding to each of the plurality of to-be-processed tasks further comprises: acquiring the task description information, the task input information and the task output information corresponding to each of the plurality of to-be-processed tasks; processing the task description information corresponding to each of the plurality of to- be-processed tasks using the target conversion module to obtain a first encoding sequence (page 4, section 3.3); processing the task input information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a second encoding sequence (page 4, section 3.3); processing the task output information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a third encoding sequence (page 4, section 3.3); and based on the first encoding sequence, the second encoding sequence and the third encoding sequence, obtaining the encoding sequence corresponding to each of the plurality of to- be-processed tasks (page 4, section 3.3).
As to claim 11: 
Hu teaches processing the task input information corresponding to each of the plurality of to-be-processed tasks using the target conversion module to obtain the second encoding sequence further comprises: determining a data type of task input data in the task input information; in response to the data type being discrete data, processing the task input information corresponding to each of the plurality of to-be-processed tasks using the target conversion module to obtain the second encoding sequence; and in response to the data type being continuous data, acquiring a glossary used for processing the task input data; and processing the task input information corresponding to each of the plurality of to-be-processed tasks using the glossary and the target conversion module to obtain the second encoding sequence (Fig.2 and page 4, section 3.3).  

As to claim 12: 
Hu teaches acquiring training samples through the task representation framework, the training samples corresponding to a plurality of data modalities, and each training sample corresponding to a standard processing result; determining a plurality of target samples in different modalities among the training samples; and performing learning training on the plurality of target samples in different modalities (Fig.2, pages 4-5, section 3.4, and page 5, section 3.5).  
As to claim 13: 
Hu teaches an apparatus comprising: a memory configured to store instructions; and one or more processors configured to execute the instructions to cause the apparatus to perform operations for multi-modal multi-task processing (Title: Multimodal Multitask; Abstract: a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning), wherein the operations comprise:
determining a task representation element corresponding to a task representation framework, wherein the task representation framework is used to define a content format for describing a to-be-processed task, and the task representation element comprises: an element used to define task description information  (Fig.2: task specific query embedding and task index), an element used to define task input information, and an element used to define task output information (page 3, right column: After encoding input modalities into hidden state sequences, we apply the transformer decoder on either a single encoded modality or the concatenated sequence of both encoded modalities, depending on whether the task is uni-modal (i.e. vision-only or language-only) or multimodal); 
acquiring, based on the task representation element, the task description information, the task input information and the task output information corresponding to each of a plurality of to-be-processed tasks in different modalities(Fig.2: image input, text input, task index; page 4, left column: We apply a visual transformer encoder Ev with Nv layers and hidden size d e v on top of the feature map x v to further encode it to visual hidden states h v of size L × d e v (where L = Hv × Wv is the length of the encoded visual hidden states). In addition, given that different tasks (such as object detection and VQA) might require extracting different types of information, we also add a task embedding vector w task v into the transformer encoder to allow it to extract task-specific information in its output; page 4, left column: Similar to the image encoder, in the text encoder, we also add a learned task embedding vector wttask as part of the BERT input);
determining an encoding sequence corresponding to each of the plurality of to-be- processed tasks(page 3 and Figure 2: An overview of our UniT model, which jointly handles a wide range of tasks in different domains with a unified transformer encoder-decoder architecture. Our model uses an image encoder to encode the visual inputs (Sec. 3.1), a text encoder to encode the language inputs (Sec. 3.2), and a joint decoder with per-task query embedding (Sec. 3.3) followed by task-specific heads (Sec. 3.4) to make the final outputs for each task); and 
processing each of the plurality of to-be-processed tasks based on the encoding sequence corresponding to each of the plurality of to-be-processed tasks, to obtain a task processing result corresponding to each of the plurality of to-be-processed tasks (section 3.3. Domain-agnostic UniT decoder: After encoding the input modalities, we apply on them a transformer decoder D with hidden size dtd and number of layers Nd to output a sequence of decoded hidden states hdec for predictions on each task. Unlike the image and text encoders with specific architectural designs for each modality, our decoder is built upon the same domain-agnostic transformer decoder architecture [59] across all tasks... The transformer decoder D takes the encoded input sequence henc and a task-specific query embedding sequence q task of length q. It outputs a sequence of decoded hidden states hdec, i for each of the l-th transformer decoder layer, which has the same length q as the query embedding qtask ;  section 3.4: Task-specific output heads... A task-specific prediction head is applied over the decoder hidden states {hdec, t} for each task t).  
As to claim 14: 
Hu teaches determining the encoding sequence corresponding to each of the plurality of to-be-processed tasks further comprises: acquiring all data conversion modules (Fig.2: image encoder, text encoder)  configured to process the plurality of to-be- processed tasks; determining a target conversion module corresponding to each of the plurality of to-be- processed tasks among all the data conversion modules; and processing a corresponding to-be-processed task using the target conversion module to obtain an encoding sequence corresponding to each of the plurality of to-be-processed tasks (Fig.2 and page 3, section 3). 
As to claim 15: 
Hu teaches determining the target conversion module corresponding to each of the plurality of to-be-processed tasks among all the data conversion modules further comprises: acquiring a data modality corresponding to each of the plurality of to-be-processed tasks; and detecting whether an adaptive conversion module that matches the data modality exists among all the data conversion modules; when the adaptive conversion module that matches the data modality exists, determining the adaptive conversion module as the target conversion module configured to process the to-be- processed task corresponding to the data modality (Fig.2, page 3, section 3 and page 4, section 3.3).  

As to claim 16: 
Hu teaches when an adaptive conversion module that matches the data modality does not exist among all the data conversion modules, generating an adaptive conversion module that matches the data modality; and determining the adaptive conversion module as the target conversion module configured to process the to-be-processed task corresponding to the data modality (Fig.2, page 3, section 3 and page 4, section 3.3).  
As to claim 17: 
Hu teaches processing the corresponding to-be- processed task using the target conversion module to obtain the encoding sequence corresponding to each of the plurality of to-be-processed tasks further comprises: acquiring the task description information, the task input information and the task output information corresponding to each of the plurality of to-be-processed tasks; processing the task description information corresponding to each of the plurality of to- be-processed tasks using the target conversion module to obtain a first encoding sequence(page 4, section 3.3); processing the task input information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a second encoding sequence(page 4, section 3.3); processing the task output information corresponding to each of the plurality of to-be- processed tasks using the target conversion module to obtain a third encoding sequence; and based on the first encoding sequence, the second encoding sequence and the third encoding sequence, obtaining the encoding sequence corresponding to each of the plurality of to- be-processed tasks (page 4, section 3.3).  
As to claim 18: 
Hu teaches processing the task input information corresponding to each of the plurality of to-be-processed tasks using the target conversion module to obtain the second encoding sequence further comprises: determining a data type of task input data in the task input information; in response to the data type being discrete data, processing the task input information corresponding to each of the plurality of to-be-processed tasks using the target conversion module to obtain the second encoding sequence; and in response to the data type being continuous data, acquiring a glossary used for processing the task input data; and processing the task input information corresponding to each of the plurality of to-be-processed tasks using the glossary and the target conversion module to obtain the second encoding sequence (Fig.2 and page 4, section 3.3).  
As to claim 19: 
Hu teaches acquiring training samples through the task representation framework, the training samples corresponding to a plurality of data modalities, and each training sample corresponding to a standard processing result; determining a plurality of target samples in different modalities among the training samples; and performing learning training on the plurality of target samples in different modalities (Fig.2, pages 4-5, section 3.4, and page 5, section 3.5).  

Conclusion

7.	The prior art made of record, listed on PTO 892 provided to Applicant is considered to have relevancy to the claimed invention. Applicant should review each identified reference carefully before responding to this office action to properly advance the case in light of the prior art.
	


Contact Information

8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN H. NGUYEN whose telephone number is (571) 272-3765. The examiner can normally be reached on Monday- Friday from 9:00AM to 5:30 PM. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, LEWIS BULLOCK, can be reached at telephone number (571) 272-3759. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from Patent Center and the Private Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from Patent Center or Private PAIR. Status information for unpublished applications is available through Patent Center or Private PAIR to authorized users only. Should you have questions about access to Patent Center or the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) Form at https://www.uspto.gov/patents/uspto-automated- interview-request-air-form.
/VAN H NGUYEN/Primary Examiner, Art Unit 2199
Read full office action
Prosecution Timeline

Jun 12, 2023
Application Filed
Jan 08, 2026
Non-Final Rejection — §101, §102 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/367,042
Patent 12602262
SHARED RESOURCE POOL WITH PERIODIC REBALANCING IN A MULTI-CORE SYSTEM
2y 5m to grant Granted Apr 14, 2026
17/876,144
Patent 12591467
SYSTEM AND METHOD FOR HALTING PROCESSING CORES IN A MULTICORE SYSTEM
2y 5m to grant Granted Mar 31, 2026
18/336,639
Patent 12591456
METHOD AND APPARATUS FOR CONTROLLING HARDWARE ACCELERATOR
2y 5m to grant Granted Mar 31, 2026
18/377,214
Patent 12591468
DYNAMIC MANAGEMENT OF FEATURES FOR PROCESSES EXECUTABLE ON AN INFORMATION HANDLING SYSTEM
2y 5m to grant Granted Mar 31, 2026
18/139,065
Patent 12585496
METHOD, APPARATUS AND COMPUTER PROGRAM FOR ACTIVATING A SCHEDULING CONFIGURATION
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+18.4%)
3y 4m
Median Time to Grant
Low
PTA Risk
Based on 851 resolved cases by this examiner. Grant probability derived from career allow rate.
METHOD, APPARATUS, AND SYSTEM FOR MULTI-MODAL MULTI-TASK PROCESSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email