Office Action Analysis: 18067096 — INPUT GENERATION FOR MULTIMODAL LEARNING BASED MACHINE LEARNING MODELS

Office Action

§101 §103
DETAILED ACTION
Status of Claims
This Office action is responsive to communications filed on 2025-12-17. Claim(s) 1-20 is/are pending and are examined herein.
Claim(s) 1-20 is/are rejected under 35 USC 101.
Claim(s) 1-20 is/are rejected under 35 USC 103.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after 2013-03-16, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Regarding rejections under 35 USC 101, the applicant’s arguments have been fully considered but they are not persuasive. The applicant attempts to argue that the pending claims provide an improvement.
The applicant asserts that the invention claimed by pending claims “can incorporate both text and image data during implementation and identify content (text, images, and so forth) more accurately based on a textual query that includes a complicated or rare combination of terms” [remarks, page 9]. This remark is not persuasive on several counts. First, the pending independent claims recite nothing about text data or textual queries, so the remark does not appear to address the substance of the pending claims. Second, even if the independent claims did recite text data or textual queries, this would merely be a further description of data to be provided as input to the claim (cf. the 101 rejection of dependent claim 9). Third, determining whether a combination of terms in a textual query is “complicated or rare” would be a subjective assessment for which the applicant has provided no objective criteria. Fourth, the applicant has not indicated what additional elements, if any, actually provide this purported improvement.
The applicant asserts, without any justification, that the “transformation of the image data” which now appears in the independent claims is an “additional (i.e., non-abstract)” element [remarks, pages 9-10]. The examiner respectfully disagrees. The specification (e.g., [specification, 0032]) makes clear that the transformations of the claims are those involved in orthogonal super-positioning, which, as indicated in the previous Office action and again below, is an abstract idea, not an additional element. 
The additional elements recited in the pending claims are: generic computing equipment, a generic machine learning model, a description of the input to the machine learning model, and a step of providing input to the machine learning model. These are all generic in the sense that that any use of machine learning involves these elements (any machine learning model operates on generic computing equipment, has particular input data, and that input data must be provided to the model). Since the additional elements are all generic, any purported improvement to the art of machine learning provided by the pending independent claims is necessarily provided entirely by the abstract ideas that are recited in the claims. The pending claims therefore do not meet the requirements of the improvements analysis: MPEP 2106.05(a) indicates that one of the requirements of the improvements analysis is that a “judicial exception alone cannot provide the improvement”. 
The complete 101 analysis, updated in view of the applicant’s amendments, is given below. 

Regarding rejections under 35 USC 103, the applicant’s arguments have been fully considered but they are not persuasive. The applicant’s remarks [remarks, page 12] are focused on the amendment to the independent claim, which amount to requiring that the orthogonal super-position be part of a “transformation of the image data based on input requirements of” the model, where the transformation results in inputs “which comply with the input requirements” of the model [claims 1, 10, and 19]. The applicant asserts that Zhang in view of Cheung fail to disclose these features. The examiner respectfully disagrees: the superpositioning disclosed by Cheung involves a transformation of the inputs. More precisely, it involves a rotation of input vectors, and since rotating vectors preserves their lengths, the results of this transformation do in fact “comply with the input requirements” of the model that is described therein. Further quotes and details are provided in the updated prior art mapping below. 

Examiner’s Remarks
The claims use the word permutations in a way that deviates from its typical usage in the art. While a person of ordinary skill in the art would understand a “permutation” to refer to a linear (re)arrangement of the elements of a set or list, the applicant’s specification repeatedly uses the word “permutation” to refer to a “possible solution” (e.g., “a plurality of permutations or possibilities may be generated, such as possibilities or permutations of images that may correspond to possible answers to the problem matrices of the RAVEN training dataset” [specification, 0030], “permutations 210 or possible solutions to the problem matrices of the training dataset” [specification, 0032], etc). The broadest reasonable interpretation of the word “permutation” in view of the applicant’s disclosure therefore appears to encompass at least a possible solution/output to a given problem/input. 

Claim Rejections - 35 USC 101
35 USC 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claim(s) 1-20 is/are rejected under 35 USC 101 because the claimed invention(s) is/are directed to abstract ideas without significantly more.

Claim 1
Step 1. The claim and its dependents 2-9 fall under the statutory category of methods. An analysis of step 2 for each of the rejected claims follows.

Step 2A Prong 1. The claim recites the following abstract ideas:
generating inputs… the generating comprising: performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model… the transformation of the image data resulting in the inputs which comply with the input requirements; (This recites a mathematical concept. See MPEP 2106.04(a)(2)(I). The specification [specification, 0032] indicates that the “transformation” is orthogonal super-position, which encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
by 1) determining a plurality of permutations based on the one or more images included in the image data, (This recites a mental process that can be performed in the human mind or by a human using pen and paper. The specification explains that the scope of the “inputs” of the claim encompasses Raven Progressive Matrices, and the scope of the “permutations” of the claim encompasses possible solutions to RPMs [specification, 0026]. A human being assisted with pen and paper can generate RPMs and possible solutions. In fact, RPMs were first developed in 1936, before the advent of modern general-purpose computers (cf. conclusion of a previous Office action). See MPEP 2106.04(a)(2)(III).) 
and 2) applying orthogonal super-positioning relative to the plurality of permutations; (This recites a mathematical concept. The specification indicates that this encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
and generating, [by the multimodal learning based machine learning model,] a prediction specific to at least one image of the one or more images. (This recites a mental process that can be performed in the human mind or by a human using pen and paper. As noted above, RPMs were first developed in 1936, well before the advent of modern computers, and are used to measure human intelligence (cf. conclusion of a previous Office action). The examiner notes that the reference Zhang also specifically discloses measuring “human performance” on the RAVEN dataset [Zhang, abstract], which means that a human mind is also able to “generat[e]… a prediction” as recited by the claim. See MPEP 2106.04(a)(2)(III).) 

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
A computer-implemented method comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This recites insignificant extra-solution activity. See MPEP 2106.05(g).) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
A computer-implemented method comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This insignificant extra-solution activity is well-understood, routine, conventional as it is mere data transfer. See MPEP 2106.05(d)(II), “Receiving or transmitting data over a network” and/or “Storing and retrieving information in memory”.) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Claim 2
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] the multimodal learning based machine learning model is a contrastive language-image pre-training model. (This recites insignificant extra-solution activity. See MPEP 2106.05(g).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] the multimodal learning based machine learning model is a contrastive language-image pre-training model. (Contrastive language-image pre-training (CLIP) models are well-understood, routine, conventional. For example, Lüddecke as cited in the conclusion of a previous Office action mentions the “well-known CLIP transformer” [Lüddecke, section 1 paragraph beginning “Contributions”] and thus provides evidence in support of this determination.)

Claim 3
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a color alteration rule. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a color alteration rule. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Claim 4
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a Boolean rule. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a Boolean rule. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Claim 5
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images comprises an orientation specific to each of the one or more images. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images comprises an orientation specific to each of the one or more images. (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 

Claim 6
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).
[The computer-implemented method of claim 1, wherein the applying of the orthogonal super-positioning relative to the plurality of permutations comprises:] associating at least a first subset of the plurality of permutations with a first parameter; (This recites a mental process that can be performed in the human mind or by a human using pen and paper. See MPEP 2106.04(a)(2)(III).) 
and associating at least a second subset of the plurality of permutations with a second parameter, (This recites a mental process that can be performed in the human mind or by a human using pen and paper. See MPEP 2106.04(a)(2)(III).) 
wherein the second parameter is oriented orthogonally with respect to the first parameter. (This recites a mathematical concept. See MPEP 2106.04(a)(2)(I).) 

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).

Claim 7
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).
[The computer-implemented method of claim 6, wherein the applying of the orthogonal super-positioning relative to the plurality of permutations further comprises] associating at least a third subset of the plurality of permutations with a third parameter. (This recites a mental process that can be performed in the human mind or by a human using pen and paper. See MPEP 2106.04(a)(2)(III).) 

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).

Claim 8
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).
[The computer-implemented method of claim 7, further comprising] associating at least a fourth subset of the plurality of permutations with a fourth parameter, (This recites a mental process that can be performed in the human mind or by a human using pen and paper. See MPEP 2106.04(a)(2)(III).) 
wherein the fourth parameter is oriented orthogonally with respect to the third parameter. (This recites a mathematical concept. See MPEP 2106.04(a)(2)(I).) 

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).

Claim 9
Step 2A Prong 1. The claim recites the following abstract ideas:
The abstract idea(s) in the parent claim(s).
[and wherein the generating of the prediction specific to the at least one image of the one or more images comprises] identifying text that is representative of the one or more objects of the at least one image. (This recites a mental process that can be performed in the human mind or by a human using pen and paper. See MPEP 2106.04(a)(2)(III).) 

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, further comprising:] receiving a query regarding the at least one image of the one or more images, wherein the at least one image of the one or more images comprises one or more objects; (This recites insignificant extra-solution activity. See MPEP 2106.05(g).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
The additional element(s) in the parent claim(s).
[The computer-implemented method of claim 1, further comprising:] receiving a query regarding the at least one image of the one or more images, wherein the at least one image of the one or more images comprises one or more objects; (This insignificant extra-solution activity is well-understood, routine, conventional as it is mere data transfer. See MPEP 2106.05(d)(II), “Receiving or transmitting data over a network” and/or “Storing and retrieving information in memory”.) 

Claim 10
Step 1. The claim and its dependents 11-18 fall under the statutory category of machines.

Step 2A Prong 1. The claim recites the following abstract ideas:
generating inputs… the generating comprising: performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model… the transformation of the image data resulting in the inputs which comply with the input requirements; (This recites a mathematical concept. See MPEP 2106.04(a)(2)(I). The specification [specification, 0032] indicates that the “transformation” is orthogonal super-position, which encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
by 1) determining a plurality of permutations based on the one or more images included in the image data, (This recites a mental process that can be performed in the human mind or by a human using pen and paper. The specification explains that the scope of the “inputs” of the claim encompasses Raven Progressive Matrices, and the scope of the “permutations” of the claim encompasses possible solutions to RPMs [specification, 0026]. A human being assisted with pen and paper can generate RPMs and possible solutions. In fact, RPMs were first developed in 1936, before the advent of modern general-purpose computers (cf. conclusion of a previous Office action). See MPEP 2106.04(a)(2)(III).) 
and 2) applying orthogonal super-positioning relative to the plurality of permutations; (This recites a mathematical concept. The specification indicates that this encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
and generating, [by the multimodal learning based machine learning model,] a prediction specific to at least one image of the one or more images. (This recites a mental process that can be performed in the human mind or by a human using pen and paper. As noted above, RPMs were first developed in 1936, well before the advent of modern computers, and are used to measure human intelligence (cf. conclusion of a previous Office action). The examiner notes that the reference Zhang also specifically discloses measuring “human performance” on the RAVEN dataset [Zhang, abstract], which means that a human mind is also able to “generat[e]… a prediction” as recited by the claim. See MPEP 2106.04(a)(2)(III).)

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
A system comprising: at least one data processor; and at least one memory storing instructions, which when executed by the at least one data processor, cause operations comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This recites insignificant extra-solution activity. See MPEP 2106.05(g).) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
A system comprising: at least one data processor; and at least one memory storing instructions, which when executed by the at least one data processor, cause operations comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This insignificant extra-solution activity is well-understood, routine, conventional as it is mere data transfer. See MPEP 2106.05(d)(II), “Receiving or transmitting data over a network” and/or “Storing and retrieving information in memory”.) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Claims 11-18 inherit limitations from claim 10 and recite additional limitations which are substantially similar to those recited by claims 2-9, respectively, so they are rejected by the same rationale.

Claim 19
Step 1. The claim and its dependent 20 fall under the statutory category of machines.

Step 2A Prong 1. The claim recites the following abstract ideas:
generating inputs… the generating comprising: performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model… the transformation of the image data resulting in the inputs which comply with the input requirements; (This recites a mathematical concept. See MPEP 2106.04(a)(2)(I). The specification [specification, 0032] indicates that the “transformation” is orthogonal super-position, which encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
by 1) determining a plurality of permutations based on the one or more images included in the image data, (This recites a mental process that can be performed in the human mind or by a human using pen and paper. The specification explains that the scope of the “inputs” of the claim encompasses Raven Progressive Matrices, and the scope of the “permutations” of the claim encompasses possible solutions to RPMs [specification, 0026]. A human being assisted with pen and paper can generate RPMs and possible solutions. In fact, RPMs were first developed in 1936, before the advent of modern general-purpose computers (cf. conclusion of a previous Office action). See MPEP 2106.04(a)(2)(III).) 
and 2) applying orthogonal super-positioning relative to the plurality of permutations; (This recites a mathematical concept. The specification indicates that this encompasses multiplication by context matrices C_k^{-1} and a summation of matrices [specification, 0034-036], which are mathematical concepts. See MPEP 2106.04(a)(2)(I).) 
and generating, [by the multimodal learning based machine learning model,] a prediction specific to at least one image of the one or more images. (This recites a mental process that can be performed in the human mind or by a human using pen and paper. As noted above, RPMs were first developed in 1936, well before the advent of modern computers, and are used to measure human intelligence (cf. conclusion of a previous Office action). The examiner notes that the reference Zhang also specifically discloses measuring “human performance” on the RAVEN dataset [Zhang, abstract], which means that a human mind is also able to “generat[e]… a prediction” as recited by the claim. See MPEP 2106.04(a)(2)(III).)

Step 2A Prong 2. The claim recites the following additional elements which, considered individually and as an ordered combination, do not integrate the abstract idea into a practical application:
A non-transitory computer-readable storage medium comprising programming code, which when executed by at least one data processor, causes operations comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This recites insignificant extra-solution activity. See MPEP 2106.05(g).) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Step 2B. The claim recites the following additional elements which, considered individually and as an ordered combination, do not amount to significantly more than the abstract idea:
A non-transitory computer-readable storage medium comprising programming code, which when executed by at least one data processor, causes operations comprising: (This recites generic computing components for performing an abstract idea. See MPEP 2106.05(f)(2).) 
[inputs] specific to a multimodal learning based machine learning model from a training dataset comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, (This recites data of a particular type or source, merely linking an abstract idea to a particular field of use. See MPEP 2106.05(h).) 
providing the inputs that are generated, based on the orthogonal super-positioning, into the multimodal learning based machine learning model; (This insignificant extra-solution activity is well-understood, routine, conventional as it is mere data transfer. See MPEP 2106.05(d)(II), “Receiving or transmitting data over a network” and/or “Storing and retrieving information in memory”.) 
by the multimodal learning based machine learning model, (This recites merely applying (or equivalent) an abstract idea, or implementing an abstract idea on a computer, or using a computer as a tool to perform an abstract idea. See MPEP 2106.05(f).) 

Claim 20 inherits limitations from claim 19 and recites additional limitations which are substantially similar to those recited by claim 2, so it is rejected by the same rationale.

Claim Rejections - 35 USC 103
The following is a quotation of 35 USC 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 USC 102(b)(2)(C) for any potential 35 USC 102(a)(2) prior art against the later invention.

Claim(s) 1, 3-8, 10, 12-17, and 19 is/are rejected under 35 USC 103 as being unpatentable over Chi ZHANG et al. (RAVEN: A Dataset for Relational and Analogical Visual rEasoNing, published 2019-03-07; hereafter, “Zhang”) in view of Brian CHEUNG et al. (Superposition of many models into one, published 2019-06-17; hereafter, “Cheung”). 

Claim 1
Zhang discloses: 
A computer-implemented method comprising: generating inputs specific to a multimodal learning based machine learning model from a training dataset ([Zhang, abstract and section 6]: Zhang discloses a “new dataset, built in the context of Raven’s Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning” [Zhang, abstract]. It also discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances” [Zhang, section 6.1 first paragraph]. The dataset disclosed by Zhang is the “training dataset” of the claim, and any one of the machine learning models suitable for RPM and combining visual and structural reasoning maps to the “multimodal learning based machine learning model” of the claim.) comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, the generating comprising: ([Zhang, figures 1-2]: Zhang depicts an example RPM [Zhang, figure 1]. The images in the problem matrices [Zhang, figure 1(a)] map to the “image data” of the claim, and the “rules” [Zhang, figure 1(c) and caption] map to the “plurality of rules” of the claim. Zhang also indicates that the rules include orientation attributes [Zhang, figure 2(c)]. In other words, the rules also map to the “plurality of characterizations representative of orientations” as recited by the claim.)
1) determining a plurality of permutations based on the one or more images included in the image data, and ([Zhang, figure 1]: The images in the answer sets [Zhang, figure 1(b)] map to the “plurality of permutations” of the claim (cf. examiner’s remarks).)
providing the inputs that are generated, [based on the orthogonal super-positioning,] into the multimodal learning based machine learning model; and generating, by the multimodal learning based machine learning model, a prediction specific to at least one image of the one or more images. ([Zhang, section 6]: As noted above, Zhang discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances on RAVEN” [Zhang, section 6.1 first paragraph]. Problem matrices being input into one of these models maps to the “providing” step of the claim, and the answer chosen by the model for a given problem matrix maps to the “prediction specific to at least one image of the one or more images” of the claim.)

Zhang does not distinctly disclose:
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, 

Cheung is in the field of machine learning. Moreover, Zhang in view of Cheung discloses: 
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, ([Cheung, abstract and section 2]: Cheung discloses a “superposition” of models [Cheung, abstract]. The method involves a modification to the “the fundamental operation performed in all neural networks – multiplying the inputs (x in R^N) by a weight matrix (W in R^{M times N}) to compute features (y = Wx)” [Cheung, section 2 first paragraph]. Given weight matrix W_1, …, W_K for each of K tasks, the modification involves multiplying each W_k by a “task-specific linear transformation C_k^{-1} (that we call as context) such that rows of each W_kC_k^{-1} occupy mutually orthogonal subspace in R^N” [Cheung, section 2 paragraph beginning “Let W_1, …, W_K”]. Moreover, “[i]n the special case of C_k^{-1} = C_k^T, each C_k would be an orthogonal matrix representing a rotation. As matrix multiplication is associative, y_k = (WC_k)x can be rewritten as y_k = W(C_k x)… In this form, PSP can be thought of as learning a single set of parameters W for multiple tasks after the rotating the inputs (x) into orthogonal sub-spaces of R^N” [Cheung, section 2 paragraph beginning “In the special case”]. The rotations produced by this method map to the “transformation of the image data” of the claim. Since the input vectors are required to be in R^N in Cheung, the property of being a vector of length N maps to the “input requirements” of the claim. Rotation does not change the length of a vector, so “the transformation of the image data result[s] in the inputs which comply with the input requirements” as required by the claim. In the combination, the superposition method of training models as disclosed by Cheung is applied to training models to suitable for RPM as discussed in Zhang, each of the “K tasks” of Cheung [Cheung, section 2 paragraph beginning “Let W_1, W_2, …, W_K”] corresponding to one of the images in the answer set of Zhang [Zhang, figure 1(a)]. In other words, the method disclosed by Cheung maps to the “orthogonal super-positioning” of the claim because it is “relative to the plurality of permutations” as mapped above.)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to use superposition as disclosed by Cheung for solving RPM tasks as disclosed by Zhang because superposition allows “a surprisingly large number of models [to] be effectively stored within a single parameter instance” and “each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition” [Cheung, abstract], thereby resulting in a more efficient system.

Claim 3
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses: 
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a color alteration rule. ([Zhang, figure 1]: Zhang discloses the rules including a “Color” rule [Zhang, figure 1(c); see also, Zhang, figure 2(a, c)]. The color rule maps to the “color alteration rule” of the claim.)

The same motivation to combine applies. 

Claim 4
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses: 
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images corresponds to a Boolean rule. ([Zhang, figure 2 and section 3.1]: Zhang discloses the rules including a “Uniformity” attribute [Zhang, figure 2(c)] and explains that “Uniformity, set false, will not constrain Entities in a Layout to look the same” [Zhang, section 3.1 paragraph beginning “To increase”; emphasis added]. In other words, Uniformity is a “Boolean rule” as recited by the claim because it can be set to either false or true depending on whether or not Entities in a Layout are to be constrained to look the same. The examiner notes that code in the GitHub repository corresponding to Zhang corroborates this mapping, as explained in the conclusion of a previous Office action.)

The same motivation to combine applies.

Claim 5
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses:
[The computer-implemented method of claim 1, wherein] at least one of the plurality of rules specific to the one or more images comprises an orientation specific to each of the one or more images. ([Zhang, figure 2]: Zhang discloses the rules including an “Orientation” attribute [Zhang, figure 2(c)]. This orientation attribute maps to the “orientation specific to each of the one or more images” of the claim.)

The same motivation to combine applies.

Claim 6
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses:
[The computer-implemented method of claim 1, wherein the applying of the orthogonal super-positioning relative to the plurality of permutations comprises:] associating at least a first subset of the plurality of permutations with a first parameter; and associating at least a second subset of the plurality of permutations with a second parameter, wherein the second parameter is oriented orthogonally with respect to the first parameter. ([Cheung, section 2]: Cheung discloses defining “W_1, W_2, …, W_K [to] be the set of parameters required for each of the K tasks” and “transform[ing] each W_k using a task-specific linear transformation C_k^{-1} (that we call as context), such that rows of each W_kC_k^{-1} occupy a mutually orthogonal subspace in R^N” [Cheung, section 2 paragraph beginning “Let W_1, W_2, …, W_K”]. See also: [Cheung, figure 1 right-hand side and caption]. As noted under the parent claim, the K tasks of Cheung correspond to the images in the answer set (i.e., the “permutations” of the claim) of Zhang [Zhang, figure 1]. In other words, a weight in W_kC_k^{-1} for one value of k map to the “first parameter” of the claim, and a weight in W_kC_k^{-1} for a different value of k maps to the “second parameter” of the claim, thereby ensuring that “the second parameter is oriented orthogonally with respect to the first parameter” as required by the claim.)

The same motivation to combine applies. 

Claim 7
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses:
[The computer-implemented method of claim 6, wherein the applying of the orthogonal super-positioning relative to the plurality of permutations further comprises] associating at least a third subset of the plurality of permutations with a third parameter. ([Zhang, figure 1; Cheung, section 2 and figure 1]: The examiner notes that the broadest reasonable interpretation of this limitation does not require the “third parameter” to be distinct from either the “first parameter” or the “second parameter” as mapped under the parent claim. Nonetheless, as explained under the parent claim above, in the combination, the K tasks of Cheung correspond to the images in the answer set (i.e., the “permutations” of the claim) of Zhang, and Zhang discloses there at least three “permutations” [Zhang, figure 1(a)], and Cheung also depicts a situation with three tasks [Cheung, figure 1 right-hand side and caption], so a non-redundant mapping is also disclosed by the same combination of references.)

The same motivation to combine applies. 

Claim 8
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses:
[The computer-implemented method of claim 7, further comprising] associating at least a fourth subset of the plurality of permutations with a fourth parameter, wherein the fourth parameter is oriented orthogonally with respect to the third parameter. ([Zhang, figure 1; Cheung, section 2 and figure 1]: As explained under the parent claim, the broadest reasonable interpretation of the limitations appearing here and in the parent do not require the “third parameter” and the “fourth parameter” to be distinct from the “first parameter” and the “second parameter”, respectively. In other words, one could map the “third parameter” and “fourth parameter” to, respectively, the “first parameter” and the “second parameter” as described above, thereby ensuring that “the fourth parameter is oriented orthogonally with respect to the third parameter” as required by the claim. Nonetheless, as explained above, such a redundant mapping is not necessary because the K tasks of Cheung correspond to the images in the answer set (i.e., the “permutations” of the claim) of Zhang, and Zhang discloses an answer set having at least four “permutations” [Zhang, figure 1(a)], so a non-redundant mapping is also disclosed by the same combination of references.)

The same motivation to combine applies. 

Claim 10
Zhang discloses: 
A system comprising: at least one data processor; and at least one memory storing instructions, which when executed by the at least one data processor, cause operations comprising: ([Zhang, section 6]: The methods disclosed in Zhang are implemented on a computer (cf. GitHub, as explained in the conclusion of a previous Office action). A computer on which the methods disclosed therein are implemented maps to the “system” of the claim, and the processor and memory of the computer map to the “at least one data processor” and the “at least one memory” of the claim.)
generating inputs specific to a multimodal learning based machine learning model from a training dataset ([Zhang, abstract and section 6]: Zhang discloses a “new dataset, built in the context of Raven’s Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning” [Zhang, abstract]. It also discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances” [Zhang, section 6.1 first paragraph]. The dataset disclosed by Zhang is the “training dataset” of the claim, and any one of the machine learning models suitable for RPM and combining visual and structural reasoning maps to the “multimodal learning based machine learning model” of the claim.) comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, the generating comprising: ([Zhang, figures 1-2]: Zhang depicts an example RPM [Zhang, figure 1]. The images in the problem matrices [Zhang, figure 1(a)] map to the “image data” of the claim, and the “rules” [Zhang, figure 1(c) and caption] map to the “plurality of rules” of the claim. Zhang also indicates that the rules include orientation attributes [Zhang, figure 2(c)]. In other words, the rules also map to the “plurality of characterizations representative of orientations” as recited by the claim.)
1) determining a plurality of permutations based on the one or more images included in the image data, and ([Zhang, figure 1]: The images in the answer sets [Zhang, figure 1(b)] map to the “plurality of permutations” of the claim (cf. examiner’s remarks).)
providing the inputs that are generated, [based on the orthogonal super-positioning,] into the multimodal learning based machine learning model; and generating, by the multimodal learning based machine learning model, a prediction specific to at least one image of the one or more images. ([Zhang, section 6]: As noted above, Zhang discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances on RAVEN” [Zhang, section 6.1 first paragraph]. Problem matrices being input into one of these models maps to the “providing” step of the claim, and the answer chosen by the model for a given problem matrix maps to the “prediction specific to at least one image of the one or more images” of the claim.)

Zhang does not distinctly disclose:
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, 

Cheung is in the field of machine learning. Moreover, Zhang in view of Cheung discloses: 
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, ([Cheung, abstract and section 2]: Cheung discloses a “superposition” of models [Cheung, abstract]. The method involves a modification to the “the fundamental operation performed in all neural networks – multiplying the inputs (x in R^N) by a weight matrix (W in R^{M times N}) to compute features (y = Wx)” [Cheung, section 2 first paragraph]. Given weight matrix W_1, …, W_K for each of K tasks, the modification involves multiplying each W_k by a “task-specific linear transformation C_k^{-1} (that we call as context) such that rows of each W_kC_k^{-1} occupy mutually orthogonal subspace in R^N” [Cheung, section 2 paragraph beginning “Let W_1, …, W_K”]. Moreover, “[i]n the special case of C_k^{-1} = C_k^T, each C_k would be an orthogonal matrix representing a rotation. As matrix multiplication is associative, y_k = (WC_k)x can be rewritten as y_k = W(C_k x)… In this form, PSP can be thought of as learning a single set of parameters W for multiple tasks after the rotating the inputs (x) into orthogonal sub-spaces of R^N” [Cheung, section 2 paragraph beginning “In the special case”]. The rotations produced by this method map to the “transformation of the image data” of the claim. Since the input vectors are required to be in R^N in Cheung, the property of being a vector of length N maps to the “input requirements” of the claim. Rotation does not change the length of a vector, so “the transformation of the image data result[s] in the inputs which comply with the input requirements” as required by the claim. In the combination, the superposition method of training models as disclosed by Cheung is applied to training models to suitable for RPM as discussed in Zhang, each of the “K tasks” of Cheung [Cheung, section 2 paragraph beginning “Let W_1, W_2, …, W_K”] corresponding to one of the images in the answer set of Zhang [Zhang, figure 1(a)]. In other words, the method disclosed by Cheung maps to the “orthogonal super-positioning” of the claim because it is “relative to the plurality of permutations” as mapped above.)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to use superposition as disclosed by Cheung for solving RPM tasks as disclosed by Zhang because superposition allows “a surprisingly large number of models [to] be effectively stored within a single parameter instance” and “each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition” [Cheung, abstract], thereby resulting in a more efficient system.

Claims 12-17 inherit limitations from claim 10 and recite additional limitations which are substantially similar to those recited by claims 3-8, respectively, so they are rejected by the same rationale.

Claim 19
Zhang discloses: 
A non-transitory computer-readable storage medium comprising programming code, which when executed by at least one data processor, causes operations comprising: ([Zhang, section 6]: The methods disclosed in Zhang are implemented on a computer via Python code (cf. GitHub, as explained in the conclusion of a previous Office action). The Python code maps to the “programming code” of the claim, and the hard drive of the computer on which the methods disclosed therein are implemented map to the “non-transitory computer-readable storage medium” of the claim. Alternatively, the server on which the GitHub repository is stored could also map to the “non-transitory computer-readable storage medium” of the claim.)
generating inputs specific to a multimodal learning based machine learning model from a training dataset ([Zhang, abstract and section 6]: Zhang discloses a “new dataset, built in the context of Raven’s Progressive Matrices (RPM) and aimed at lifting machine intelligence by associating vision with structural, relational, and analogical reasoning” [Zhang, abstract]. It also discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances” [Zhang, section 6.1 first paragraph]. The dataset disclosed by Zhang is the “training dataset” of the claim, and any one of the machine learning models suitable for RPM and combining visual and structural reasoning maps to the “multimodal learning based machine learning model” of the claim.) comprising image data, the image data comprising a plurality of rules specific to one or more images and a plurality of characterizations representative of orientations of the one or more images, the generating comprising: ([Zhang, figures 1-2]: Zhang depicts an example RPM [Zhang, figure 1]. The images in the problem matrices [Zhang, figure 1(a)] map to the “image data” of the claim, and the “rules” [Zhang, figure 1(c) and caption] map to the “plurality of rules” of the claim. Zhang also indicates that the rules include orientation attributes [Zhang, figure 2(c)]. In other words, the rules also map to the “plurality of characterizations representative of orientations” as recited by the claim.)
1) determining a plurality of permutations based on the one or more images included in the image data, and ([Zhang, figure 1]: The images in the answer sets [Zhang, figure 1(b)] map to the “plurality of permutations” of the claim (cf. examiner’s remarks).)
providing the inputs that are generated, [based on the orthogonal super-positioning,] into the multimodal learning based machine learning model; and generating, by the multimodal learning based machine learning model, a prediction specific to at least one image of the one or more images. ([Zhang, section 6]: As noted above, Zhang discloses “adopt[ing] several representative models suitable for RPM and test[ing] their performances on RAVEN” [Zhang, section 6.1 first paragraph]. Problem matrices being input into one of these models maps to the “providing” step of the claim, and the answer chosen by the model for a given problem matrix maps to the “prediction specific to at least one image of the one or more images” of the claim.)

Zhang does not distinctly disclose:
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, 

Cheung is in the field of machine learning. Moreover, Zhang in view of Cheung discloses: 
performing a transformation of the image data based on input requirements of the multimodal learning based machine learning model by [1) determining a plurality of permutations based on the one or more images included in the image data, and] 2) applying orthogonal super-positioning relative to the plurality of permutations, the transformation of the image data resulting in the inputs which comply with the input requirements; [providing the inputs that are generated,] based on the orthogonal super-positioning, ([Cheung, abstract and section 2]: Cheung discloses a “superposition” of models [Cheung, abstract]. The method involves a modification to the “the fundamental operation performed in all neural networks – multiplying the inputs (x in R^N) by a weight matrix (W in R^{M times N}) to compute features (y = Wx)” [Cheung, section 2 first paragraph]. Given weight matrix W_1, …, W_K for each of K tasks, the modification involves multiplying each W_k by a “task-specific linear transformation C_k^{-1} (that we call as context) such that rows of each W_kC_k^{-1} occupy mutually orthogonal subspace in R^N” [Cheung, section 2 paragraph beginning “Let W_1, …, W_K”]. Moreover, “[i]n the special case of C_k^{-1} = C_k^T, each C_k would be an orthogonal matrix representing a rotation. As matrix multiplication is associative, y_k = (WC_k)x can be rewritten as y_k = W(C_k x)… In this form, PSP can be thought of as learning a single set of parameters W for multiple tasks after the rotating the inputs (x) into orthogonal sub-spaces of R^N” [Cheung, section 2 paragraph beginning “In the special case”]. The rotations produced by this method map to the “transformation of the image data” of the claim. Since the input vectors are required to be in R^N in Cheung, the property of being a vector of length N maps to the “input requirements” of the claim. Rotation does not change the length of a vector, so “the transformation of the image data result[s] in the inputs which comply with the input requirements” as required by the claim. In the combination, the superposition method of training models as disclosed by Cheung is applied to training models to suitable for RPM as discussed in Zhang, each of the “K tasks” of Cheung [Cheung, section 2 paragraph beginning “Let W_1, W_2, …, W_K”] corresponding to one of the images in the answer set of Zhang [Zhang, figure 1(a)]. In other words, the method disclosed by Cheung maps to the “orthogonal super-positioning” of the claim because it is “relative to the plurality of permutations” as mapped above.)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to use superposition as disclosed by Cheung for solving RPM tasks as disclosed by Zhang because superposition allows “a surprisingly large number of models [to] be effectively stored within a single parameter instance” and “each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition” [Cheung, abstract], thereby resulting in a more efficient system.

Claim(s) 2, 9, 11, 18, and 20 is/are rejected under 35 USC 103 as being unpatentable over Zhang in view of Cheung, further in view of Alec RADFORD et al. (Learning Transferable Visual Models From Natural Language Supervision, published 2021-02-26; hereafter, “Radford”).

Claim 2
Zhang in view of Cheung discloses the elements of the parent claim(s). It does not distinctly disclose:
[The computer-implemented method of claim 1, wherein] the multimodal learning based machine learning model is a contrastive language-image pre-training model. 

Radford is in the field of machine learning. Moreover, Zhang in view of Cheung and Radford discloses: 
[The computer-implemented method of claim 1, wherein] the multimodal learning based machine learning model is a contrastive language-image pre-training model. ([Radford, section 1]: Radford discloses “a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training” [Radford, section 1 paragraph beginning “A crucial difference”]. In the combination, the models disclosed by Zhang in view of Cheung are taken to CLIP models as in Radford.)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine the method for solving RPM tasks disclosed by Zhang in view of Cheung with the use of CLIP as disclosed by Radford because it “is an efficient method of learning from natural language supervision” and “learns to perform a wide set of tasks during pre-training” [Radford, section 1 paragraph beginning “A crucial difference”], thereby resulting in a more efficient and effective system. 

Claim 9
Zhang in view of Cheung discloses the elements of the parent claim(s). It also discloses: 
[The computer-implemented method of claim 1, further comprising:] receiving a query regarding at the least one image of the one or more images, wherein the at least one image of the one or more images comprises one or more objects; ([Zhang, figure 1]: A problem matrix [Zhang, figure 1(a)] maps to a “query regarding at least one image of the one or more images” of the claim.)

Zhang in view of Cheung might not distinctly disclose: 
[and wherein the generating of the prediction specific to the at least one image of the one or more images comprises] identifying text that is representative of the one or more objects of the at least one image.  

Radford is in the field of machine learning. Moreover, Zhang in view of Cheung and Radford discloses:
[and wherein the generating of the prediction specific to the at least one image of the one or more images comprises] identifying text that is representative of the one or more objects of the at least one image. ([Radford, figure 1]: Radford discloses “an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples” [Radford, figure 1 caption], depicting an example where the input of the model being a picture of a dog produces the text output “A photo of a dog” [Radford, figure 1 part (3)]. In the combination, the image input is a problem matrix [Zhang, figure 1(a)], and the text output produced by the model maps to the “text that is representative of the one or more objects” of the claim.)

Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art to combine the method for solving RPM tasks disclosed by Zhang in view of Cheung with the use of CLIP as disclosed by Radford because it “is an efficient method of learning from natural language supervision” and “learns to perform a wide set of tasks during pre-training” [Radford, section 1 paragraph beginning “A crucial difference”], thereby resulting in a more efficient and effective system. 

Claims 11, 18, and 20 inherit limitations from claims 10, 10, and 19, respectively, and recite additional limitations which are substantially similar to those recited by claims 2, 9, and 2, respectively, so they are rejected by the same rationale. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).

A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Shishir AGRAWAL whose telephone number is +1 703-756-1183. The examiner can normally be reached Monday through Thursday, 08:30-14:30 Pacific Time.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey SHMATOV can be reached on +1 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is +1 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at +1 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call +1 800-786-9199 (IN USA OR CANADA) or +1 571-272-1000.

/S.A./Examiner, Art Unit 2123

/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
INPUT GENERATION FOR MULTIMODAL LEARNING BASED MACHINE LEARNING MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

INPUT GENERATION FOR MULTIMODAL LEARNING BASED MACHINE LEARNING MODELS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email