Prosecution Insights
Last updated: April 19, 2026
Application No. 17/910,431

MULTIMODAL OBJECT CLASSIFICATION APPARATUS, METHOD, AND COMPUTER-READABLE MEDIUM

Final Rejection §101§103§112
Filed
Sep 09, 2022
Examiner
SUSSMAN MOSS, JACOB ZACHARY
Art Unit
2122
Tech Center
2100 — Computer Architecture & Software
Assignee
Rakuten Group Inc.
OA Round
2 (Final)
14%
Grant Probability
At Risk
3-4
OA Rounds
3y 3m
To Grant
-6%
With Interview

Examiner Intelligence

Grants only 14% of cases
14%
Career Allow Rate
1 granted / 7 resolved
-40.7% vs TC avg
Minimal -20% lift
Without
With
+-20.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 3m
Avg Prosecution
26 currently pending
Career history
33
Total Applications
across all art units

Statute-Specific Performance

§101
37.3%
-2.7% vs TC avg
§103
35.2%
-4.8% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
15.5%
-24.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 7 resolved cases

Office Action

§101 §103 §112
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is in response to amendments filed November 17th, 2025, in which claims 1, 2, 5, 7, 9, 11, 13, and 14 have been amended, claims 16-17 have been added, and claims 4, 6, 8, 12, and 15 have been cancelled. The amendments have been entered, and claims 1-3, 5, 7, 9-11, 13, 14, and 16-17 are currently pending in the case. Claims 1, 13 and 14 are independent claims. Specification The disclosure is objected to because of the following informalities: In ¶6, “Tiangang Zhu, et. al. " Multimodal Joint…” the extra period after “et” in “et. al.” should be deleted. Furthermore, the extra space between the double quote and “Multimodal” should be deleted. Therefore, it should be corrected to “Tiangang Zhu, et al. "Multimodal Joint…”. Appropriate correction is required. Claim Rejections - 35 USC § 112 The following is a quotation of 35 U.S.C. 112(b): (b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention. The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph: The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention. Claim 13 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claim 13 recites the limitation "the fully connected network" in line 4 of the claim. There is insufficient antecedent basis for this limitation in the claim. For examination purposes this limitation has been interpreted as “a fully connected network”. Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-3, 5, 7, 9-11, 13, 14, and 16-17 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Regarding claim 1: Step 1: Claim 1 is directed to an apparatus, therefore it falls under the statuary category of a machine. Step 2A Prong 1: The claim recites, in part: “apply the plurality of modalities to…generate feature values for each of the plurality of modalities” this limitation is a mathematical concept. “apply the feature values output from the fully connected network to…derive weights corresponding to each of the plurality of modalities based on the feature values for each of the plurality of modalities and information identifying the object” this encompasses the mental derivation of weights for modalities based on observed feature values and observed identifying information. “predict an attribute of the object from a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights” this encompasses the mental prediction of attributes of an object from observed feature values, and weighting that prediction by observed weights. Further, this limitation is a mathematical concept. “apply a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to…predict at least one attribute of the object” this encompasses the mental prediction of attributes based on observed values. Further, this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “at least one processor configured to operate as instructed by the program code”, “acquisition code configured to cause the at least one processor to, “feature generation code configured to cause the at least one processor to”, “weight derivation code configured to cause the at least one processor to”, “classification code configured to cause the at least one processor to”, “the fully connected network is configured to”, “the neural network is configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquire a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). “at least one memory configured to store program code”, “a fully connected network”, “a neural network” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “at least one processor configured to operate as instructed by the program code”, “acquisition code configured to cause the at least one processor to, “feature generation code configured to cause the at least one processor to”, “weight derivation code configured to cause the at least one processor to”, “classification code configured to cause the at least one processor to”, “the fully connected network is configured to”, “the neural network is configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquire a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Furthermore the additional element is directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d. “at least one memory configured to store program code”, “a fully connected network”, “a neural network” these limitations are an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP § 2106.05(h). Regarding claim 2, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “derive, from the plurality of feature values and the information identifying the object, attention weights that indicate an importance of each of the plurality of modalities to the attribute prediction, as the weights corresponding to each of the plurality of feature values” this encompasses the mental derivation of weights indicating the importance of observed modalities to a prediction. Further, this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the neural network is further configured to” The limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). Step 2B: The claim does not contain significantly more than the judicial exception. The limitation “the neural network is further configured to” is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). Regarding claim 3, the rejection of claim 2 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “the attention weights of the plurality of modalities are 1 in total” a continuation of the abstract idea identified in the parent claim. Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. Regarding claim 5, the rejection of claim 4 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “as input, the plurality of feature values and information identifying the object, and outputs, as the weights corresponding to each of the plurality of feature values, attention weights indicating an importance of each of the plurality of modalities to the attribute prediction.” This encompasses the mental observation of feature values and identifying information and outputting weights corresponding to those observed values and further indicating the importance of observed modalities to a prediction. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the neural network takes” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “the neural network takes” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Regarding claim 7, the rejection of claim 4 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “as input, the plurality of modalities and outputs the feature values of the plurality of modalities by mapping them to a latent space common to the plurality of modalities” this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the fully connected network takes” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “the fully connected network takes” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Regarding claim 9, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “encode the plurality of modalities to acquire a plurality of encoded modalities” this limitation is a mathematical concept. “generate the feature values for each of the plurality of encoded modalities” This limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the acquisition code is further configured to cause the at least one processor to”, “the feature generation fully connected network is further configured to” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “the acquisition code is further configured to cause the at least one processor to”, “the feature generation fully connected network is further configured to” the limitation is an additional element that generally links the use of the judicial exception to a particular technological environment or field of use. See MPEP §2106.05(h). Regarding claim 10, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “the object is a commodity, and the plurality of modalities includes two or more of data of an image representing the commodity, data of text describing the commodity, and data of sound describing the commodity” a continuation of the abstract idea identified in the parent claim. Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. Regarding claim 11, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “the attribute of the object includes color information of a product” a continuation of the abstract idea identified in the parent claim. Step 2A Prong 2: The claim does not recite any additional limitations, thus does not further recite any additional elements that integrates the judicial exception into a practical application or amount to significantly more. Regarding claim 13: Step 1: Claim 13 is directed to a method, therefore it falls under the statuary category of a process. Step 2A Prong 1: The claim recites, in part: “applying the feature values output from the fully connected network to…generate feature values for each of the plurality of modalities” this limitation is a mathematical concept. “applying the feature values output from the fully connected network to…derive weights corresponding to each of the plurality of feature values” this encompasses the mental derivation of weights for modalities based on observed feature values and observed identifying information. “applying a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to…predict at least one attribute of the object” this encompasses the mental prediction of attributes based on observed values. Further, this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “a neural network configured to”, “a neural network configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquiring a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “a neural network configured to”, “a neural network configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquiring a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Furthermore the additional element is directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d. Regarding claim 14: Step 1: Claim 14 is directed to a non-transitory computer-readable storage medium, therefore it falls under the statuary category of a manufacture. Step 2A Prong 1: The claim recites, in part: “applying the feature values output from the fully connected network to…generate feature values for each of the plurality of modalities” this limitation is a mathematical concept. “applying the feature values output from the fully connected network to…derive weights corresponding to each of the plurality of feature values” this encompasses the mental derivation of weights for modalities based on observed feature values and observed identifying information. “applying a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to…predict at least one attribute of the object” this encompasses the mental prediction of attributes based on observed values. Further, this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “a neural network configured to”, “a neural network configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquiring a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “a neural network configured to”, “a neural network configured to”, “a deep neural network configured to” the limitations are an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “acquiring a plurality of modalities associated with an object and information identifying the object” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Furthermore the additional element is directed to receiving or transmitting data over a network, e.g., using the Internet to gather data, Symantec, 838 F.3d at 1321, 120 USPQ2d. Regarding claim 16, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: The claim recites, in part: “derive the weights according to the formula σ ( W [ f θ h ⅈ j ,   f θ h t j ] + b ) , where a j is an attention weight for the object j, h i j is image data, h t j is text data, σ is an activation function, f θ h ⅈ j is a feature value with respect to the image data, f θ h t j is a feature value with respect to the text data, W [ f θ h ⅈ j ,   f θ h t j ] is a concatenated value in which a weight coefficient W is applied to the feature values f θ h i j and f θ h t j , and b is a bias value” this limitation is a mathematical concept. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “the neural network is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “the neural network is configured to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). Regarding claim 17, the rejection of claim 1 is incorporated and further: Step 2A Prong 1: a continuation of the abstract idea identified in the parent claim. Step 2A Prong 2: The judicial exception is not integrated into a practical application; the remaining limitations of the claim are as follows: “output code configured to cause at least one of the last least one processor to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “output the predicted at least one attribute” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Step 2B: The claim does not contain significantly more than the judicial exception. The limitations “output code configured to cause at least one of the last least one processor to” the limitation is an additional element that amounts to adding the words “apply it” (or an equivalent) with the judicial exception, or merely uses a computer in its ordinary capacity as a tool to perform an existing process. See MPEP § 2106.05(f)(2). “output the predicted at least one attribute” the limitation is an additional element that amounts to adding insignificant extra-solution activity to the judicial exception. See MPEP § 2106.05(g). Furthermore the additional element is directed to storing and retrieving information in memory, Versata Dev. Group, Inc. v. SAP Am., Inc., 793 F.3d 1306, 1334, 115 USPQ2d 1681, 1701 (Fed. Cir. 2015). See MPEP § 2106.05(d)/(II). Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1, 7, 9, 13, 14 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Nesta et al. (US20190354797A1) (as cited in the IDS, hereinafter “Nesta”) in view of Tzirakis et al. ("End-to-End Multimodal Emotion Recognition using Deep Neural Networks", Tzirakis et al., 27 April 2017) (hereinafter "Tzirakis"). Regarding claim 1: Nesta teaches [a]n information processing apparatus comprising: at least one memory configured to store program code (Nesta, claim 16 “a memory storing instructions;”); and at least one processor configured to operate as instructed by the program code (Nesta, claim 16 “a processor coupled to the memory and configured to execute the instructions to cause the system to perform operations comprising:”), the program code comprising: acquisition code configured to cause the at least one processor to acquire a plurality of modalities associated with an object and information identifying the object (Nesta, ¶1 “The system may use a variety of input modalities including images, video and/or audio, and the expert modules may comprise a corresponding image expert, a corresponding video expert and/or a corresponding audio expert.” In light of ¶26 of the specification, modalities associated with an object include images “An example of a plurality of modalities, which are information associated with an object, is image data showing an image of a product (hereinafter simply referred to as image data) and text data describing the product (hereinafter simply referred to as text data)”); feature generation code configured to cause the at least one processor to apply the plurality of modalities to a fully connected network (Nesta, ¶29 “A preprocessing network (such as networks 223 and 224) may be used for a first feature transformation (e.g. by using an Inception V3 network or a VGG16 network), and the output layer is then fed to fully connected layers 225 to produce a logistic classification 226.”), wherein the fully connected network is configured to generate feature values for each of the plurality of modalities (Nesta, ¶12 “The system may be further configured to accept a variety of input modalities including images, video and/or audio, and the operations performed by the processor may further include extracting features associated with each input modality by a process that includes inputting the corresponding data stream to a trained neural network” here, the features extracted associated with each input modality can be considered the generated feature for a plurality of modalities); weight derivation code configured to cause the at least one processor to apply the feature values output from the fully connected network to a neural network, wherein the neural network is configured to derive weights corresponding to each of the plurality of modalities based on the feature values for each of the plurality of modalities and information identifying the object (Nesta, ¶7 “A gate expert is configured to receive the extracted features from the plurality of expert modules and output a set of weights for each of the input modalities.” Here, the weights output for each input modalities from extracted features can be considered the derived weights corresponding to the features); and Nesta does not teach "classification code configured to cause the at least one processor to apply a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to a deep neural network configured to predict at least one attribute of the object" However, Tzirakis teaches classification code configured to cause the at least one processor to apply a concatenated value of the feature values for each of the plurality of modalities (Tzirakis, page 2, col 2, ¶3 “The first model’s predictions are concatenated with the original feature vector and fed to the second regression model for the final prediction.”), weighted by the corresponding weights from the neural network, to a deep neural network (Tzirakis, page 4, col 1, section D, ¶4 “After training the visual and speech networks the LSTM layers are discarded and only the extracted features are considered. The speech network extracts 1280 features while the visual netowork 640 features. These are concatenated to form a 1920 dimensional feature vector and used to feed a 2-layer LSTM with 256 cells each.”) configured to predict at least one attribute of the object (Tzirakis, page 4, col 1-2, section D, ¶4 ” The goal for each unimodal and the multimodal network is to minimize: L c = L c a + L c v 2 where L c a and L c v are the concordance of the arousal and valence, respectively.” Here the arousal and valence can be considered the predicted attributes. Further, Tzirakis, page 7, col 1, ¶1 “Finally, to further demonstrate the benefits of our model for automatic prediction of arousal and valence Figure 3 illustrates results for single test subject from RECOLA.”). Nesta and Tzirakis are analogous art because both references concern methods for multimodal networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta’s multimodal network to incorporate the weighted concatenated features fed into a deep neural network taught by Tzirakis. The motivation for doing so would have been to outperform traditional networks as stated in Tzirakis, page 1, abstract “The system is then trained in an end-to-end fashion where– by also taking advantage of the correlations of the each of the streams– we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.”. Regarding claim 7: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1, wherein the fully connected network takes (Nesta, ¶29 “A preprocessing network (such as networks 223 and 224) may be used for a first feature transformation (e.g. by using an Inception V3 network or a VGG16 network), and the output layer is then fed to fully connected layers 225 to produce a logistic classification 226.”), as input, the plurality of modalities and outputs the feature values of the plurality of modalities by mapping them to a latent space common to the plurality of modalities (Nesta, ¶6 “In another embodiment a co-learning framework is defined to encourage co-adaptation of a subset of latent variables belonging to an expert network related to different modalities.” Here, the co-learning framework can be considered the first machine learning model. A co-adaption of a subset of latent variables can be considered a common latent space as they share common latent variables. The adaption of these variables can be considered a mapping). Regarding claim 9: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1, wherein the acquisition code is further configured to cause the at least one processor to encode the plurality of modalities to acquire a plurality of encoded modalities (Nesta, ¶24 “The feature extraction may be obtained, for example, by using the encoding part of a neural network trained on a classification task related to the specific modality, using an autoencoder (AE) compression scheme, and/or other feature extraction process” here, using the encoding part of a neural network related to the modality can be considered the acquisition of encoded modalities and the features are extracted from those encoded modalities)., and the fully connected network is further configured to generate the feature values for each of the plurality of encoded modalities (Nesta, ¶29 “A preprocessing network (such as networks 223 and 224) may be used for a first feature transformation (e.g. by using an Inception V3 network or a VGG16 network), and the output layer is then fed to fully connected layers 225 to produce a logistic classification 226.”). Regarding claim 13: Nesta teaches [a]n information processing method comprising: acquiring a plurality of modalities associated with an object and information identifying the object (Nesta, ¶1 “The system may use a variety of input modalities including images, video and/or audio, and the expert modules may comprise a corresponding image expert, a corresponding video expert and/or a corresponding audio expert.” In light of ¶26 of the specification, modalities associated with an object include images “An example of a plurality of modalities, which are information associated with an object, is image data showing an image of a product (hereinafter simply referred to as image data) and text data describing the product (hereinafter simply referred to as text data)”); applying the feature values output from the fully connected network to a neural network (Nesta, ¶29 “A preprocessing network (such as networks 223 and 224) may be used for a first feature transformation (e.g. by using an Inception V3 network or a VGG16 network), and the output layer is then fed to fully connected layers 225 to produce a logistic classification 226.”) configured to generate feature values for each of the plurality of modalities (Nesta, ¶12 “The system may be further configured to accept a variety of input modalities including images, video and/or audio, and the operations performed by the processor may further include extracting features associated with each input modality by a process that includes inputting the corresponding data stream to a trained neural network” here, the features extracted associated with each input modality can be considered the generated feature for a plurality of modalities); applying the feature values output from the fully connected network to a neural network configured to derive weights corresponding to each of the plurality of feature values (Nesta, ¶7 “A gate expert is configured to receive the extracted features from the plurality of expert modules and output a set of weights for each of the input modalities.” Here, the weights output for each input modalities from extracted features can be considered the derived weights corresponding to the features), Nesta does not teach "applying a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to a deep neural network configured to predict at least one attribute of the object" However, Tzirakis teaches applying a concatenated value of the feature values for each of the plurality of modalities (Tzirakis, page 2, col 2, ¶3 “The first model’s predictions are concatenated with the original feature vector and fed to the second regression model for the final prediction.”), weighted by the corresponding weights from the neural network, to a deep neural network (Tzirakis, page 4, col 1, section D, ¶4 “After training the visual and speech networks the LSTM layers are discarded and only the extracted features are considered. The speech network extracts 1280 features while the visual netowork 640 features. These are concatenated to form a 1920 dimensional feature vector and used to feed a 2-layer LSTM with 256 cells each.”) configured to predict at least one attribute of the object (Tzirakis, page 4, col 1-2, section D, ¶4 ” The goal for each unimodal and the multimodal network is to minimize: L c = L c a + L c v 2 where L c a and L c v are the concordance of the arousal and valence, respectively.” Here the arousal and valence can be considered the predicted attributes.). Nesta and Tzirakis are analogous art because both references concern methods for multimodal networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta’s multimodal network to incorporate the weighted concatenated features fed into a deep neural network taught by Tzirakis. The motivation for doing so would have been to outperform traditional networks as stated in Tzirakis, page 1, abstract “The system is then trained in an end-to-end fashion where– by also taking advantage of the correlations of the each of the streams– we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.”. Regarding claim 14: Nesta teaches [a] non-transitory computer-readable storage medium storing computer executable instructions for causing a computer to implement an information processing method (Nesta, ¶41 “Software, in accordance with the present disclosure, such as program code and/or data, may be stored on one or more computer readable mediums.”) the information processing method comprising: acquiring a plurality of modalities associated with an object and information identifying the object (Nesta, ¶1 “The system may use a variety of input modalities including images, video and/or audio, and the expert modules may comprise a corresponding image expert, a corresponding video expert and/or a corresponding audio expert.” In light of ¶26 of the specification, modalities associated with an object include images “An example of a plurality of modalities, which are information associated with an object, is image data showing an image of a product (hereinafter simply referred to as image data) and text data describing the product (hereinafter simply referred to as text data)”); applying the plurality of modalities to a fully connected network (Nesta, ¶29 “A preprocessing network (such as networks 223 and 224) may be used for a first feature transformation (e.g. by using an Inception V3 network or a VGG16 network), and the output layer is then fed to fully connected layers 225 to produce a logistic classification 226.”) configured to generate feature values for each of the plurality of modalities (Nesta, ¶12 “The system may be further configured to accept a variety of input modalities including images, video and/or audio, and the operations performed by the processor may further include extracting features associated with each input modality by a process that includes inputting the corresponding data stream to a trained neural network” here, the features extracted associated with each input modality can be considered the generated feature for a plurality of modalities); applying the feature values output from the fully connected network to a neural network configured to derive weights corresponding to each of the plurality of modalities based on the feature values for each of the plurality of modalities and information identifying the object (Nesta, ¶7 “A gate expert is configured to receive the extracted features from the plurality of expert modules and output a set of weights for each of the input modalities.” Here, the weights output for each input modalities from extracted features can be considered the derived weights corresponding to the features); and Nesta does not teach "apply a concatenated value of the feature values for each of the plurality of modalities, weighted by the corresponding weights from the neural network, to a deep neural network configured to predict at least one attribute of the object" However, Tzirakis teaches apply a concatenated value of the feature values for each of the plurality of modalities (Tzirakis, page 2, col 2, ¶3 “The first model’s predictions are concatenated with the original feature vector and fed to the second regression model for the final prediction.”), weighted by the corresponding weights from the neural network, to a deep neural network (Tzirakis, page 4, col 1, section D, ¶4 “After training the visual and speech networks the LSTM layers are discarded and only the extracted features are considered. The speech network extracts 1280 features while the visual netowork 640 features. These are concatenated to form a 1920 dimensional feature vector and used to feed a 2-layer LSTM with 256 cells each.”) configured to predict at least one attribute of the object (Tzirakis, page 4, col 1-2, section D, ¶4 ” The goal for each unimodal and the multimodal network is to minimize: L c = L c a + L c v 2 where L c a and L c v are the concordance of the arousal and valence, respectively.” Here the arousal and valence can be considered the predicted attributes.). Nesta and Tzirakis are analogous art because both references concern methods for multimodal networks. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta’s multimodal network to incorporate the weighted concatenated features fed into a deep neural network taught by Tzirakis. The motivation for doing so would have been to outperform traditional networks as stated in Tzirakis, page 1, abstract “The system is then trained in an end-to-end fashion where– by also taking advantage of the correlations of the each of the streams– we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.”. Regarding claim 17: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1, further comprising output code configured to cause at least one of the last least one processor to output the predicted at least one attribute (Tzirakis, page 7, col 1, ¶1 “Finally, to further demonstrate the benefits of our model for automatic prediction of arousal and valence Figure 3 illustrates results for single test subject from RECOLA.”). Here, the predicted results can be considered the outputted predicted attributes). It would have been obvious to combine the teachings of Nesta and Gao for the reasons set forth in connection with claim 1 above. Claims 2, 3, and 5 are rejected under 35 U.S.C. 103 as being unpatentable over Nesta in view of Tzirakis in further view of Gao et al. ("Attention driven multi-modal similarity learning", Gao et al., March 2018) (hereinafter “Gao”). Regarding claim 2: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1 Nesta in view of Tzirakis does not teach “wherein the neural network is further configured to derive, from the plurality of feature values and the information identifying the object, attention weights that indicate an importance of each of the plurality of modalities to the attribute prediction, as the weights corresponding to each of the plurality of feature values” However, Gao teaches wherein the neural network is further configured to derive, from the plurality of feature values and the information identifying the object, attention weights that indicate an importance of each of the plurality of modalities to the attribute prediction, as the weights corresponding to each of the plurality of feature values (Gao, page 5, section 3.1.2, ¶3 “To identify the salient features that contribute more to the similarity score under a relation modality, we add dynamic attention weights to Eq. (4).” Here, the features that contribute more to the similarity score can be considered an indication of the importance of each feature.). Nesta in view of Tzirakis and Gao are analogous art because both references concern multimodal learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta/Tzirakis’s recurrent multimodal system to incorporate the attention as a measure of importance taught by Gao. The motivation for doing so would have been to improve the accuracy of similarity learning as stated in Gao, page 2, ¶2 “we propose an interaction-oriented attention mechanism to improve the accuracy of similarity learning, and meanwhile show that the attention weights returned by the mechanism are able to improve the model interpretability”. Regarding claim 3: Nesta in view of Tzirakis in further view of Gao teaches [t]he information processing apparatus according to Claim 2, wherein the attention weights of the plurality of modalities are 1 in total (Gao, pages 5-6, section 3.1.2, ¶3 “In Eqs. (6) and (7), the softmax function is used to generate positive attention weights that sum to 1”). It would have been obvious to combine the teachings of Nesta in view of Tzirakis and Gao for the reasons set forth in connection with claim 2 above. Regarding claim 5: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1 Nesta in view of Tzirakis does not teach “the neural network takes, as input, the plurality of feature values and information identifying the object, and outputs, as the weights corresponding to each of the plurality of feature values, attention weights indicating an importance of each of the plurality of modalities to the attribute prediction” However, Gao teaches the neural network takes, as input, the plurality of feature values and information identifying the object, and outputs, as the weights corresponding to each of the plurality of feature values, attention weights indicating an importance of each of the plurality of modalities to the attribute prediction (Gao, page 5, section 3.1.2, ¶3 “To identify the salient features that contribute more to the similarity score under a relation modality, we add dynamic attention weights to Eq. (4).” Here, the features that contribute more to the similarity score can be considered an indication of the importance of each feature.). Nesta in view of Tzirakis and Gao are analogous art because both references concern multimodal learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta/Tzirakis’s recurrent multimodal system to incorporate the attention as a measure of importance taught by Gao. The motivation for doing so would have been to improve the accuracy of similarity learning as stated in Gao, page 2, ¶2 “we propose an interaction-oriented attention mechanism to improve the accuracy of similarity learning, and meanwhile show that the attention weights returned by the mechanism are able to improve the model interpretability”. Claims 10, 11 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Nesta in view of Tzirakis in further view of Zhu et al. ("Multimodal Joint Attribute Prediction and Value Extraction for E-commerce Product", Zhu et al., 15 Sep 2020) (as cited in the IDS, hereinafter “Zhu”). Regarding claim 10: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1 Nesta in view of Tzirakis does not teach “wherein the object is a commodity, and the plurality of modalities includes two or more of data of an image representing the commodity, data of text describing the commodity” However, Zhu teaches wherein the object is a commodity, and the plurality of modalities includes two or more of data of an image representing the commodity, data of text describing the commodity, (Zhu, page 2, col 1, ¶2 “Furthermore, beyond the textual product descriptions, product images can provide additional clues for the attribute prediction and value extraction tasks.”) and data of sound describing the commodity. It is noted the claim recites alternative language, and Zhu teaches at least one of the alternatives. Nesta in view of Tzirakis and Zhu are analogous art because both references concern multimodal learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta/Tzirakis’s recurrent multimodal system to incorporate the image and text data as taught by Zhu. The motivation for doing so would have been to more accurately extract attribute values Zhu, page 2, col 1, ¶1 “Given a textual product description, we can extract attribute values more accurately with a known product attribute” Regarding claim 11: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1 Nesta in view of Tzirakis does not teach “wherein the attribute of the object includes color information of a product” However, Zhu teaches wherein the attribute of the object includes color information of a product (Zhu, page 5, col 1, ¶1 “Finally, we obtained 87,194 text-image instances consisting of the following categories of products: Clothes, Pants, Dresses, Shoes, Boots, Luggage, and Bags, and involving 26 types of product attributes such as “Material”, “Collar Type”, “Color”, etc.”). Nesta in view of Tzirakis and Zhu are analogous art because both references concern multimodal learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta/Tzirakis’s recurrent multimodal system to incorporate the color attributes as taught by Zhu. The motivation for doing so would have been to add a crucial indication for attribute values as stated in Zhu, page 4, col 1, section 2.6, ¶2 “We argue that the product attributes can provide crucial indications for the attribute values. For example, given a sentence “The red collar and golden buttons in the shirt form a colorful fashion topic” and the predicted product attribute “Color”, it is easy to recognize the value “golden” corresponding to attribute “Color” instead of “Material”. Thus, we incorporate the result of the product attribute prediction y ^ a to improve the value extraction.” Regarding claim 16: Nesta in view of Tzirakis teaches [t]he information processing apparatus according to Claim 1, Nesta in view of Tzirakis does not teach "wherein the neural network is configured to derive the weights according to the formula σ ( W [ f θ h ⅈ j ,   f θ h t j ] + b ) , where a j is an attention weight for the object j, h i j is image data, h t j is text data, σ is an activation function, f θ h ⅈ j is a feature value with respect to the image data, f θ h t j is a feature value with respect to the text data, W [ f θ h ⅈ j ,   f θ h t j ] is a concatenated value in which a weight coefficient W is applied to the feature values f θ h i j and f θ h t j , and b is a bias value" However, Zhu teaches wherein the neural network is configured to derive the weights according to the formula σ ( W [ f θ h ⅈ j ,   f θ h t j ] + b ) , where a j is an attention weight for the object j, h i j is image data, h t j is text data, σ is an activation function, f θ h ⅈ j is a feature value with respect to the image data, f θ h t j is a feature value with respect to the text data, W [ f θ h ⅈ j ,   f θ h t j ] is a concatenated value in which a weight coefficient W is applied to the feature values f θ h i j and f θ h t j , and b is a bias value (Zhu, page 4, col 2, section 2.4, ¶3 “Specifically, we feed the text and image representations hi and vk into the global-gated cross modality attention layer…” and Zhu, page 4-5, col 2-1, section 2.4, ¶4 “The global visual gate g i G is determined by the representation of the sentence and the image, which are obtained by the text encoder and the image encoder, respectively, as follows: PNG media_image1.png 45 308 media_image1.png Greyscale where W1 and W2 are weight matrices.” Here, the global visual gate can be considered equivalent to the formula given, wherein the weight matrices have been distributed. Hi and vG are the text and image data, respectively). Nesta in view of Tzirakis and Zhu are analogous art because both references concern multimodal learning. Accordingly, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Nesta/Tzirakis’s recurrent multimodal system to incorporate the global visual gate as taught by Zhu. The motivation for doing so would have been to benefit attribute prediction task with visually grounded semantics Zhu, page 2, col 2, ¶1 “First, we selectively enhance the semantic representation of the textual product descriptions with a global gated cross-modality attention module that is anticipated to benefit attribute prediction task with visually grounded semantics.” Response to Arguments Applicant's arguments filed November 17th, 2025 (hereinafter “Remarks”) have been fully considered but they are not persuasive. Applicant’s amendments have overcome the claim and drawing objections and the prior art rejections of the previous office action. Applicant’s amendments to claims no longer necessitate the claims to be interpreted under 35 U.S.C. 112(f). Applicant’s arguments regarding the 35 U.S.C. 112(b) rejections of the previous office action have been fully considered. The rejections have been withdrawn due to claim amendments. However, the amendments have required additional indefiniteness rejections to be made in this action. Regarding the rejection under 35 U.S.C. § 101: Argument 1: “A person of ordinary skill in the art would readily recognize that claim 1 provides a specific, technically implemented solution to a technological problem associated with product searching and attribute identification on Electronic Commerce (EC) websites.” (Remarks, page 13, ¶5). Examiners Response: Examiner respectfully disagrees, the MPEP states “it is important to keep in mind that an improvement in the abstract idea itself (e.g. a recited fundamental economic concept) is not an improvement in technology.” See MPEP § 2106.05(a)(II). Here the improvement to product searching and identification is an improvement to a mental process which can be practically performed in the human mind. Argument 2: “[T]he Specification describes - and the claims embody - a technically implemented improvement in the way product attributes are predicted. By enabling the system to assign different levels of importance to different modalities for different products, attribute prediction becomes significantly more accurate. For instance, color attributes of a smartphone are more accurately predicted when the image modality is weighted more heavily than the text modality because text descriptions often focus on specifications rather than color.” (Remarks, page 16, ¶2). Examiners Response: Examiner respectfully disagrees, the MPEP states “[a]n inventive concept "cannot be furnished by the unpatentable law of nature (or natural phenomenon or abstract idea) itself." Genetic Techs. v. Merial LLC, 818 F.3d 1369, 1376, 118 USPQ2d 1541, 1546 (Fed. Cir. 2016).” See MPEP § 2106.05(I). Furthermore, “It is important to note, the judicial exception alone cannot provide the improvement.” See MPEP § 2106.05(a). Here, the improvement to the way attributes are predicted, through importance scores, is an improvement in an abstract idea. A person could observe products and predict various attributes by assigning importance to various modalities. Argument 3: “As emphasized by Deputy Commissioner Charles Kim, Examiners are cautioned "not to oversimplify claim limitations and expand the application of the 'apply it' consideration" in determining whether a claim is integrated into a practical application, and that improvements in technical fields must be considered in determining patent eligibility.... Likewise, in Ex Parte Desjardins, the Appeals Review Panel (including Director Squires) reiterated the importance of adhering to the precedent under Enfish in determining patent eligibility in vacating the § 101 rejection - specifically, that claims directed to improvements in technical fields are patent-eligible.” (Remarks, page 16, ¶3). Examiners Response: Examiner respectfully disagrees, the MPEP states “[i]t should be noted that while this consideration is often referred to in an abbreviated manner as the "improvements consideration," the word "improvements" in the context of this consideration is limited to improvements to the functioning of a computer or any other technology/technical field, whether in Step 2A Prong Two or in Step 2B.” See MPEP § 2106.04(d)(1). The improvements claimed are not to an improvement to the functioning of a computer or any other technology/technical field. As the MPEP states “The courts consider a mental process (thinking) that "can be performed in the human mind, or by a human using a pen and paper" to be an abstract idea. CyberSource Corp. v. Retail Decisions, Inc., 654 F.3d 1366, 1372, 99 USPQ2d 1690, 1695 (Fed. Cir. 2011).” See MPEP § 2106.04(a)(2)(III). A person could observe products and predict various attributes by assigning importance to various modalities. Regarding the rejection under 35 U.S.C. § 103: Argument 4: “[N]one of the cited references teaches or suggests the claimed three-network architecture - namely, a fully connected network for feature generation, a neural network for modality-specific weight derivation, and a deep neural network for attribute prediction. The cited art neither describes these networks individually nor provides any indication that they should be combined in the particular manner recited in claim 1 to perform the claimed sequence of operations.” (Remarks, page 19, ¶2). Examiners Response: Applicant’s arguments with respect to the prior art rejections have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACOB Z SUSSMAN MOSS whose telephone number is (571) 272-1579. The examiner can normally be reached Monday - Friday, 9 a.m. - 5 p.m. ET Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /J.S.M./Examiner, Art Unit 2122 /KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122
Read full office action

Prosecution Timeline

Sep 09, 2022
Application Filed
Jul 09, 2025
Non-Final Rejection — §101, §103, §112
Oct 23, 2025
Examiner Interview Summary
Oct 23, 2025
Applicant Interview (Telephonic)
Nov 17, 2025
Response Filed
Feb 20, 2026
Final Rejection — §101, §103, §112 (current)

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
14%
Grant Probability
-6%
With Interview (-20.0%)
3y 3m
Median Time to Grant
Moderate
PTA Risk
Based on 7 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month