Last updated: May 29, 2026
Application No. 17/817,373
Low-latency Captioning System

Final Rejection §101§103
Filed
Aug 04, 2022
Examiner
MAHARAJ, DEVIKA S
Art Unit
2123
Tech Center
2100 — Computer Architecture & Software
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
2 (Final)
Interview Optional

— +11.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 54% grant rate with +11.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 80 resolved cases, 2023–2026
Examiner Intelligence

MAHARAJ, DEVIKA S View full profile →
Grants 54% of resolved cases
Career Allowance Rate
43 granted / 80 resolved
-1.2% vs TC avg
Moderate +11% lift
Without
With
+11.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 7m
Avg Prosecution
15 currently pending
Career history
108
Total Applications
across all art units
Statute-Specific Performance

§101
9.5%
-30.5% vs TC avg
§103
82.3%
+42.3% vs TC avg
§102
2.7%
-37.3% vs TC avg
§112
5.1%
-34.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 80 resolved cases
Office Action

§101 §103
DETAILED ACTION
1.	This communication is in response to the amendments filed on December 23, 2025 for Application No. 17/817,373 in which Claims 1-18 are presented for examination. 

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
3.	The amendments filed on December 23, 2025 have been considered. Claims 1, 4, 8-10, 13, and 17-18 have been amended. Thus, Claims 1-18 are pending and presented for examination.

4.	Applicant’s arguments filed December 23, 2025 with respect to the 35 U.S.C. 112(b) rejections have been fully considered and are persuasive. Thus, the 35 U.S.C. 112(b) rejections have been withdrawn. 

5.	Applicant's arguments filed December 23, 2025 with respect to the 35 U.S.C. 101 rejection have been fully considered but they are not persuasive.
	Applicant’s Arguments on Pgs. 10-12 of Arguments/Remarks state:
“Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. See Office Action Page 3. Applicant respectfully disagrees with the Examiner's assertion. 
Independent claims 1 and 10 clearly recites limitations that require execution of neural networks and joint training of neural networks, such as timing detector neural network and decoding neural network which cannot be performed by a human in mind. 
Neural networks are inherently complex computing structures that require high- dimensional numerical operations, iterative optimization, and coordinated parameter updates across multiple models. Such operations cannot practically be performed mentally, and therefore do not constitute a mental process. 
Moreover, as per MPEP 2106.07(b), claims must be interpreted in light of the specification. Applicant's specification gives several examples of architectures and implementation of the timing detector neural network and the decoder neural network. For example, FIG. 4 and FIG. 6 of Applicant's specification as filed and produced herein show architectures of the timing detector neural network and the decoder neural networks which follows a transformer architecture, including several feed forward layers, convolutional layers, computations of feature vectors from audio streams and video streams. 
Further a practical implementation of the outlined neural network architectures is defined for example in Paras. [0117]-[0124] were the outlined neural networks are used for low-latency video QA tasks by handling datasets which include 10k video clips or 243k question-answer pairs. Also, from the video clips, video features are extracted for processing by the timing detector neural network and the decoding neural network, and such features are in the form of 128-dimensional vector sequences. Applicant would like to assert that no human mind is capable of implementing such architectures for neural networks, such as timing detector neural network and the decoding neural network, nor for handling such large datasets. Therefore, Applicant would like to request the Examiner to withdraw the classification of claims 1 and 10 as mental processes, and withdraw the rejection of claims 1 and 10 under 35 U.S.C. 101.”
Examiner respectfully disagrees. First, it must be noted that the limitations “execute a timing detector neural network […]” and “execute a decoding neural network […]” of the independent claims are not considered to be a mental process at Step 2A Prong 1. Instead, the operations performed including “[…] identify an early subsequence of frames in the sequence of frames including at least a portion of the information indicative of the information”, “[…] decode the information from the portion of the information in the subsequence of frames”, and “iteratively identify the early sequence of frames having the smallest number of subframes […]” are considered to be mental processes at Step 2A Prong 1. For example, a user may observe/analyze a sequence of frames (video frames) and use judgement/evaluation to identify an “early” subsequence of frames indicative of at least a portion of the received information associated with the sequence of frames (a subset of video frames representative of the video as a whole). Further, the user may then observe/analyze the subsequence of frames and use judgement/evaluation to decode information from the portion of information in the subsequence of frames (i.e., provide captions for those video frames based on said analysis). Correspondingly, the user is able to iteratively identify the early subsequence of frames having the smallest number of subframes containing a portion of the training information sufficient to decode the training information (i.e., identifying the early subset of video frames having the smallest number of subframes representative of the video as a whole for accurate captioning). 
Examiner agrees that claims must be interpreted in light of the specification. However, the Independent claims merely recite “execute a timing detector neural network trained to […]” and “execute a decoding neural network trained to […]”, in which both neural networks are seemingly already trained/configured to perform the specific operations of the claim language without significantly more. This cannot provide an inventive concept. Moreover, although Applicant points to Figures 4 & 6 of the specification to show the architectures of these networks, this architecture is again not described by the instant claim language. As highlighted above, the claim merely recites executing two already trained/configured (seemingly black-box) neural networks. Thus, at Step 2A Prong 2 & Step 2B, this amounts to merely adding the words “apply it” with the judicial exception, as these neural networks are generically recited. 
Applicant further states that the neural networks handle datasets which include 10k video clips or 243k question-answer pairs and features in the form of 128-dimensional vector sequences – however, again, such limitations are not present in the currently drafted claim language. Instead, the Independent claims merely recite “collecting a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames, wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof” – this amounts to merely collecting/receiving video frames. Again, a user is capable of performing the aforementioned limitations merely based on observing/analyzing such collected/received video frames. There is no language present in the currently drafted claims which would preclude the limitations from being performed by mental process. 

	Applicant’s Arguments on Pg. 12 of Arguments/Remarks further state:
“Additionally, claims 1-18 and their enabling description require joint training of the timing detector and the decode neural network which further encompasses determination of a multi-task loss including KL divergence loss, use of elements like future frame eliminator for training, execution of multi-head attention operations, feature extraction using multiple self-attention layers, processing of encoded feature vectors by stacked convolution layers to determine time convoluted sequences, conversion of time convoluted sequences to a probability indicative of whether the timing is correct to generate a caption or not, among others, which cannot be performed in human mind, nor can they mentally determine the "smallest number of subframes sufficient to decode training information," which itself results from machine-learned weights produced by backpropagation. 
Applicant has also amended the independent claims 1 and 10 as amended to clarify the information that is processed by the timing detector neural network and the decoding neural network to include multimodal sensor signals including video frames, sound data or a combination thereof, which cannot be processed by human mind alone. 
Moreover, Applicant's claims 1-18 also provide technical improvement to the technical solution for the technical problem of video captioning, as can be seen from FIG. 9, FIG. 10, FIG. 11A, and FIG. 11B of Applicant's specification as filed. 
Applicant has proposed to clarify the practical application of claim 1 by amending the claims 1 and 10 to recite "output a token sequence corresponding to the information associated with the sequence of frames". 
Therefore, Applicant would request the Examiner to withdraw the rejections of claim 1-18 under 35 U.S.C 101, at least for the reasons cited above.”
Examiner respectfully disagrees for substantially the same reasons as stated above. Again, Applicant states that the joint training “[…] further encompasses determination of a multi-task loss including KL divergence loss, use of elements like future frame eliminator for training, execution of multi-head attention operations […]” etc. – however, these limitations are not present in the currently drafted claim language. There is no explicit recitation of leveraging a kullback-leibler divergence in any of claims 1-18, no recitation of a future frame eliminator in any of claims 1-18, no recitation of executing multi-head attention operations in any of claims 1-18, etc. Applicant is arguing that the instant claims cannot be performed by mental process merely based on the recitation of a “neural network” in the claims and arbitrary functions of the neural networks based on Applicant’s specification – however, all of the aforementioned technical operations are not present in the claims and not considered to be mental processes. Only the limitations as listed (see preceding response to arguments & 35 U.S.C. 101 rejection below) are considered to be mental processes at Step 2A Prong 1, for the reasons already stated by Examiner. 
Further, the amendment which specifies “[…] wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof” is considered to be “field of use” or merely indicating a field of use or technological environment in which to apply a judicial exception and does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application at Step 2A Prong 2 and Step 2B. The newly added limitation “[…] output a token sequence corresponding to the information associated with the sequence of frames” is considered to be a mental process at Step 2A Prong 1, as the limitation merely amounts to a user observing/analyzing the sequence of frames and correspondingly determining/generating a token sequence (caption comprising words/sentences) based on said analysis. The claims still recite an abstract idea.
Thus, the 35 U.S.C. 101 rejection is maintained.

6.	Applicant’s arguments filed December 23, 2025 with respect to the 35 U.S.C. 103 rejection have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 101
7.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


8.	Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding Claim 1:
Step 1: Claim 1 is a system type claim. Therefore, Claims 1-9 are directed to either a process, machine, manufacture, or composition of matter.
2A Prong 1: If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation by mathematical calculation but for the recitation of generic computer components, then it falls within the “Mathematical Concepts” grouping of abstract ideas.
[…] identify an early subsequence of frames in the sequence of frames including at least a portion of the information indicative of the information (mental process – other than reciting “execute a timing detector neural network”, identifying an early subsequence of frames may be performed manually by a user observing/analyzing the sequence of frames indicative of information and accordingly using judgement/evaluation to identify an “early” subsequence of frames including at least a portion of the analyzed information. For example, a user may observe/analyze a sequence of video frames and accordingly identify an early subsequence (earlier frames in the sequence/time-series) of frames which represent the video and/or information about the video)
[…] decode the information from the portion of the information in the subsequence of frames […] (mental process – other than reciting “execute a decoding neural network”, decoding information from the portion of the information in the subsequence of frames may be performed manually by a user observing/analyzing the information in the subsequence of frames and accordingly using judgement/evaluation to decode the information from the portion of the information in the subsequence of frames. For example, a user may observe/analyze video frames and accordingly use judgement/evaluation to determine/generate captions for the video based on said analysis)
[…] iteratively identify the early subsequence of frames having the smallest number of subframes from the beginning of a training sequence of frames containing a portion of training information sufficient to decode the training information and cause to output a token sequence corresponding to the information associated with the sequence of frames (mental process – other than reciting “wherein the timing detector neural network is jointly trained with the decoding neural network”, iteratively identifying the smallest number of subframes may be performed manually by a user iteratively observing/analyzing the training sequence of frames containing a portion of training information and using judgement/evaluation to identify the smallest number of subframes containing a portion of training information sufficient to decode the training information. For example, the user may observe/analyze a set of video frames and use judgement/evaluation to determine the smallest number of frames representative of the video/information in the video and correspondingly determine/generate captions for these video frames)
2A Prong 2: This judicial exception is not integrated into a practical application. 
Additional elements:
an artificial intelligence (Al) low-latency processing system, the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor, cause the low-latency processing system to: […] (recited at a high-level of generality (i.e., as a generic system comprising generic computer components) such that it amounts to no more than mere instructions to apply the exception using generic computer components)
collect a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames (Adding insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g))
[…] wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying the information is associated with multimodal sensor signals comprising at least video frames, sound data, or a combination thereof without significantly more does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
execute a timing detector neural network trained to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of applying a generically trained machine learning model with previously determined data (information/sequence of frames) without significantly more)
execute a decoding neural network trained to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of applying a generically trained machine learning model with previously determined data (information/subsequence of frames) without significantly more)
wherein the timing detector neural network is jointly trained with the decoding neural network to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model with previously determined data (training sequence of frames/training information) without significantly more)
2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception.
Additional elements:
an artificial intelligence (Al) low-latency processing system, the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor, cause the low-latency processing system to: […] (mere instructions to apply the exception using generic computer components cannot provide an inventive concept)
collect a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well-understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim). Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
[…] wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying the information is associated with multimodal sensor signals comprising at least video frames, sound data, or a combination thereof without significantly more does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
execute a timing detector neural network trained to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of applying a generically trained machine learning model with previously determined data (information/sequence of frames) without significantly more. This cannot provide an inventive concept)
execute a decoding neural network trained to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of applying a generically trained machine learning model with previously determined data (information/subsequence of frames) without significantly more. This cannot provide an inventive concept)
wherein the timing detector neural network is jointly trained with the decoding neural network to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner’s note: high level recitation of training a machine learning model with previously determined data (training sequence of frames/training information) without significantly more. This cannot provide an inventive concept)
For the reasons above, Claim 1 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 2-9. The additional limitations of the dependent claims are addressed below. 	

Regarding Claim 2: 
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 2 depends on. 
minimize a multi-task loss function including a time detection loss and an information generation loss (mathematical process – other than reciting “timing detector neural network is jointly trained with the decoder neural network”, minimizing a multi-task loss function may be performed by mathematical process utilizing a multi-task loss function including a time detection loss and an information generation loss)
Step 2A Prong 2 & Step 2B:
wherein the timing detector neural network is jointly trained with the decoder neural network on features of different subsequences of the sequence of frames to […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner' s note: high level recitation of training a machine learning model with previously determined data without significantly more)
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 

Regarding Claim 3:
Step 2A Prong 1: 
See the rejection of Claim 2 above, which Claim 3 depends on.
Step 2A Prong 2 & Step 2B:
wherein the multi-task loss function includes three losses defining (1) an accuracy of decoded information, (2) a difference between the information decoded from the subsequence of frames and the information decoded from the full sequence of frames, and (3) an accuracy of prediction of the timing detector neural network (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying the loss functions without significantly more does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

Regarding Claim 4:
Step 2A Prong 1: See the rejection of Claim 1 above, which Claim 4 depends on. 
[…] extract features from each frame in the sequence of frames (mental process – other than reciting “feature extractor”, extracting features from each frame may be performed manually by a user observing/analyzing the sequence of frames and accordingly using judgement/evaluation to identify and extract features from each frame)
[…] encode the extracted features of each frame to produce a sequence of encoded features (mental process – other than reciting “feature encoder”, encoding extracted features may be performed manually by a user observing/analyzing each frame and correspondingly using judgement/evaluation to encode the extracted features of each frame to produce a sequence of encoded features)
[…] identify a subsequence of encoded features representing the subsequence of frames (mental process – other than reciting “timing detector neural network”, identifying a subsequence of encoded features representing the subsequence of frames may be performed manually by a user observing/analyzing the subsequence of frames and correspondingly using judgement/evaluation to identify a subsequence of encoded features based on said frames)
[…] decode the information (mental process – other than reciting “decoder neural network”, decoding the information may be performed manually by a user observing/analyzing the information and correspondingly using judgement/evaluation to decode said information)
Step 2A Prong 2 & Step 2B:
wherein the processor is configured to execute a feature extractor […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner's note: high level recitation of applying an already trained feature extractor without significantly more. This cannot provide an inventive concept)
execute a feature encoder […] (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner' s note: high level recitation of applying an already trained feature encoder without significantly more. This cannot provide an inventive concept)
submit the sequence of encoded features to the timing detector neural network […] (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well-understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim. Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
 submit the subsequence of encoded features to the decoder neural network […] (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well-understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim. Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1.

Regarding Claim 5:
Step 2A Prong 1: 
See the rejection of Claim 1 above, which Claim 5 depends on. 
Step 2A Prong 2 & Step 2B:
wherein the processor triggers the execution of modules of the Al low-latency processing system upon receiving a new input frame appended to the sequence of frames (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the processor triggers execution of modules upon receiving a new input frame does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 
  
Regarding Claim 6:
Step 2A Prong 1: 
See the rejection of Claim 1 above, which Claim 6 depends on. 
Step 2A Prong 2 & Step 2B:
wherein the information is a caption for an audio scene, a video scene, or an audio-video scene (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the information is a caption for an audio/video/audio-video scene does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 
 
 Regarding Claim 7:
Step 2A Prong 1: 
See the rejection of Claim 1 above, which Claim 7 depends on. 
Step 2A Prong 2 & Step 2B:
wherein the information is an answer to a question about the sequence of frames. (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the information is an answer to a question about the sequence of frames does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 

Regarding Claim 8:
Step 2A Prong 1: See the rejection of Claim 4 above, which Claim 8 depends on
[…] encode the question (mental process – other than reciting “execute a text encoder neural network”, encoding the question may be performed manually by a user observing/analyzing the question and accordingly using judgement/evaluation to encode said question)
Step 2A Prong 2 & Step 2B:
execute a text encoder neural network (Adding the words “apply it” (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea - see MPEP 2106.05(f) – Examiner' s note: high level recitation of applying an already trained machine learning model without significantly more. This cannot provide an inventive concept)
submit the encoded question to the timing detector neural network (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well-understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim. Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
submit the question or the encoded question to the decoding neural network (MPEP 2106.05(d)(II) indicates that merely “Receiving or transmitting data over a network” is a well-understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim. Thereby, a conclusion that the claimed limitation is well-understood, routine, conventional activity is supported under Berkheimer)
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 
 
Regarding Claim 9:
Step 2A Prong 1: 
See the rejection of Claim 1 above, which Claim 9 depends on. 
Step 2A Prong 2 & Step 2B:
wherein the sequence of frames include multi-modal information coming from different sensors of different modalities (Field of Use – limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception does not amount to significantly more than the exception itself, and cannot integrate a judicial exception into a practical application; in this case specifying that the sequence of frames include multi-modal information coming from different sensors of different modalities does not integrate the exception into a practical application nor amount to significantly more – See MPEP 2106.05(h))
Accordingly, under Step 2A Prong 2 and Step 2B, these additional elements do not integrate the abstract idea into practical application because they do not impose any meaningful limits on practicing the abstract idea, as discussed above in the rejection of claim 1. 

	Independent Claim 10 recites substantially the same limitations as Claim 1, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale.
	For the reasons above, Claim 10 is rejected as being directed to an abstract idea without significantly more. This rejection applies equally to dependent claims 11-18. The additional limitations of the dependent claims are addressed below. 

	Claim 11 recites substantially the same limitations as Claim 2, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 12 recites substantially the same limitations as Claim 3, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 13 recites substantially the same limitations as Claim 4, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 14 recites substantially the same limitations as Claim 5, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 15 recites substantially the same limitations as Claim 6, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 16 recites substantially the same limitations as Claim 7, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 17 recites substantially the same limitations as Claim 9, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

	Claim 18 recites substantially the same limitations as Claim 8, in the form of a method. The claim is also directed to performing mental processes/mathematical calculations without significantly more, therefore it is rejected under the same rationale. 

Claim Rejections - 35 USC § 103
9.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

10.	Claims 1-6, 9-15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (hereinafter Chen) (“Less is More: Picking Informative Frames for Video Captioning”), in view of Lin et al. (hereinafter Lin) (US PG-PUB 20220014807). 
Regarding Claim 1, Chen teaches an artificial intelligence (Al) low-latency processing system (Chen, Pg. 7, Section 5.1 Comparison with the state-of-the-arts, “Instead, we estimate the running time by the complexity of visual feature extractors and the number of processed frames. The details of running time estimation are listed in supplemental materials. Thanks to the PickNet, our captioning model is 4∼33 times faster than other methods.”, therefore, an artificial intelligence (AI) low-latency caption processing system is disclosed by PickNet), the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor cause the low-latency processing system (See introduction of Lin reference below for explicit teaching of the processing system comprising a processor and memory) to: 
collect a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames, wherein the information is associated with multimodal sensor signals (See introduction of Lin reference below for explicit teaching of the information being “associated with” multimodal sensor signals) comprising at least: video frames, sound data, or a combination thereof (Chen, Pg. 4, Section 3.2.2 Rewards, “First of all, the picked frames should contain rich semantic information, which can be used to effectively generate language description. In the video captioning task, it is natural to use the evaluated language metrics as the language reward. Here, we choose CIDEr [32] score. Given a video vi and a collection of human generated reference sentences Si = {sij}, the goal of CIDEr is to measure the similarity of the machine generated sentence ci to a majority of how most people describe the video.”, therefore, a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames, wherein the information is associated with a video/video frames is collected/obtained);
execute a timing detector neural network trained to identify an early subsequence of frames in the sequence of frames including at least a portion of the information indicative of the information (Chen, Pg. 2, Section 1. Introduction, “We propose PickNet to perform informative frame picking for video captioning. Specifically, the base model for visual-linguistic association in video captioning is a standard Encoder-Decoder framework [3]. We develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by considering both visual and textual cues.” & Pg. 2, Section 1. Introduction, “We design a plug-and-play reinforcement learning-based PickNet to select informative frames which can pick informative frames for the next learn ing stage. A compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation.”, thus, a timing detector neural network (PickNet) is trained to identify an early subsequence of frames (informative frames) in the sequence of frames including at least a portion of the information indicative of the information (video frames that clearly convey the information in the video – For example, see Figure 1 on Chen Pg. 1));
execute a decoding neural network trained to decode the information from the portion of the information in the subsequence of frames (Chen, Pg. 3, Section 3.1 Preliminary, “Decoder and sentence generation. Once the representation of the video has been generated, the recurrent decoder can employ it to generate the corresponding description. At every time-step of the decoding phase, the decoder unit uses the encoded vector v, previous generated one-hot representation word wt−1 and previous internal state pt−1 as input, and outputs a new internal state pt.”, therefore, a decoding neural network (decoder) is executed to decode the information from the portion of the information in the subsequence of frames (video representation)), 
wherein the timing detector neural network is jointly trained with the decoding neural network (Chen, Pg. 6, Section 3.3 Training, “After the first two stages, the Encoder Decoder and PickNet are well pretrained, but there exists a gap between them because the Encoder-Decoder use the full video frames as input while PickNet just selects a portion of frames. So we need a joint training stage to integrate this two parts together. However, the pick action is not differentiable, so the gradients introduced by cross-entropy loss cannot flow into PickNet. Hence, we follow the approximate joint training scheme.”, therefore, the timing detector neural network (PickNet) is jointly trained with the decoder (encoder-decoder)) to iteratively identify the early subsequence of frames having the smallest number of subframes from the beginning of a training sequence of frames containing a portion of training information sufficient to decode the training information (Chen, Pg. 2, Section 1. Introduction, “We answer the follow question: Is there a way to use as less number of frames as possible to well approximate the performance using all the frames for video captioning? We propose PickNet to perform informative frame picking for video captioning. Specifically, the base model for visual-linguistic association in video captioning is a standard Encoder-Decoder framework [3]. We develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by considering both visual and textual cues. From visual perspective, we maximize the diversity between current picked frame candidate and the selected frames. From textual perspective, we minimize the discrepancy between ground truth caption and the generated one using current picked candidate. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will be updated for future trials. This procedure goes on until the end of the video sequence.”, thus, the encoder-decoder and PickNet are jointly trained to iteratively (repeated procedure until end of video sequence) identify the early subsequence of frames having the smallest number of subframes (as less number of frames as possible) from the beginning of the training sequence of frames containing a portion of training information sufficient to decode the training information – also depicted by Figure 1 on Pg. 1 & Figure 3 on Pg. 5) and cause to output a token sequence corresponding to the information associated with the sequence of frames (Chen, Pg. 2, Section 1. Introduction, “Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation”, thus, a token sequence (See Decoder and Sentence Generation section on Pgs. 3-4) is outputted corresponding to the information associated with the sequence of frames (video captions)).

While Chen teaches an AI low-latency processing system as disclosed above, Chen does not explicitly disclose the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor cause the low-latency processing system to: […]
However, Lin teaches the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor cause the low-latency processing system (Lin, Claim 14, “An apparatus for generating captioning information of multimedia data, comprising: a memory storing one or more instructions; and a processor configured to execute the one or more instructions stored in the memory to: […]”, thus, a processing system (apparatus) comprising a processor and a memory storing instructions to be executed by the processor is disclosed) to: […]

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the AI low-latency processing system, as disclosed by Chen to include wherein the low-latency processing system comprising: a processor; and a memory having instructions stored thereon that, when executed by the processor cause the low-latency processing system to […], as disclosed by Lin. One of ordinary skill in the art would have been motivated to make this modification to enable the use of a system, comprising processor and memory, to efficiently and accurately provide low-latency processing and captioning for video data (Lin, Par. [0004], “Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a storage medium for generating video captioning information, so as to improve the accuracy of the generated video captioning information.”).

While Chen teaches collecting a sequence of frames jointly including information dispersed among at least some frames in the sequence of frames, wherein the information is associated with […] at least: video frames, sound data, or a combination thereof, Chen does not explicitly disclose wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof. 
However, Lin teaches wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof (Lin, Par. [0201], “The image may be acquired from a local storage or a local database as required or received from an external data source (such as, the Internet, a server, a database, etc.) through an input device or a transmission medium.” & Par. [0300], “The multimedia data may be sample data in a training image dataset or a training video dataset acquired from a local storage or a local database as required, or training sample in a training image dataset or a training video dataset received from an external data source through an input device or a transmission medium.”, thus, the information may be associated with multimodal (video data and image data) sensor signals which are propagated through the transmission medium)

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the AI low-latency processing system, as disclosed by Chen in view of Lin to include wherein the information is associated with multimodal sensor signals comprising at least: video frames, sound data, or a combination thereof, as disclosed by Lin. One of ordinary skill in the art would have been motivated to make this modification to enable AI low-latency processing and captioning for multimodal data, such that different features may be learned to improve model robustness and accuracy (Lin, Par. [0124], “By training the feature select ion network, the network can select specific characteristic information for generating captioning information of the multimedia data for different multimedia data, that is, for a given multimedia data, it is possible to determine respective weights for different characteristic information through the feature selection network.”).
	
Regarding Claim 2, Chen in view of Lin teaches the Al low-latency processing system of claim 1, wherein the timing detector neural network is jointly trained with the decoder neural network on features of different subsequences of the sequence of frames to minimize a multi-task loss function including a time detection loss and an information generation loss (Chen, Pg. 5, Section 3.3 Training, “The training procedure is splitted into three stages. The first stage is to pretrain the Encoder-Decoder. We call it supervision stage. In the second stage, we fix the Encoder Decoder and train PickNet by reinforcement learning. It is called reinforcement stage. And the final stage is the joint training of PickNet and the Encoder-Decoder. We call it adaptation stage. We use standard back-propagation to train the Encoder-Decoder, and REINFORCE [34] to train Pick Net. Supervision stage. When training the Encoder-Decoder, traditional method maximizes the likelihood of the next ground-truth word given previous ground-truth words using back-propagation. However, this approach causes the exposure bias [26], which results in error accumulation during generation at test time, since the model has never been exposed to its own predictions. In order to alleviate this phenomenon, the schedule sampling [4] procedure is used, which feeds back the model’s own predictions and slowly increases the feedback probability during training. We use SGD with cross entropy loss to train the Encoder-Decoder. Given the ground-truth sentences y = (y1,y2,...,ym), the loss is defined as:”, therefore, the timing detector neural network (PickNet) is jointly trained with the decoder neural network on features of different subsequence of the sequence of frame to minimize a multi-task loss function including a time detection loss (REINFORCE method used to train PickNet) and an information generation loss (cross entropy loss)).

Regarding Claim 3, Chen in view of Lin teaches the Al low-latency processing system of claim 2, wherein the multi-task loss function includes three losses defining (1) an accuracy of decoded information (Lin, Par. [0242], “The specific form of the model loss function may be configured according to actual needs. For example, the model loss function commonly used in training a video captioning model or an image captioning model may be selected. During training, the value of the model loss function represents the difference between the captioning information of multimedia data predicted by the model and the captioning label information, or indicates whether the predicted captioning information meets other preset end conditions.”, therefore, an accuracy of decoded information (including the difference between predicted captioning information to actual captioning information) is minimized during training),, (2) a difference between the information decoded from the subsequence of frames and the information decoded from the full sequence of frames (Lin, Par. [0384], “the model training module may be used to: train a preset captioning model based on the first sample multimedia data to obtain a value of the first loss function, and train the captioning model based on the second sample multimedia data to obtain a value of the second loss function; obtain a value of the final loss function based on the value of the first loss function and the value of the second loss function; and train the captioning model based on the value of the final loss function until the final loss function converges.”, therefore, difference between the information decoded from a subsequence of frames (first loss/second loss) and information decoded from the full sequence of frames (final loss) is minimized during training), and (3) an accuracy of prediction of the timing detector neural network (Lin, Par. [0260], “Where Jlabel(θ) represents the cross entropy loss, θ represents the model parameters of the video captioning model, t represents the current moment, T represents the maximum moment, yt represents the ground-truth corresponding to the current moment, and y1:t-1 represents the ground-truth corresponding from time 1 to time t−1, V represents the video, and pθ represents the probability that the output word is ground-truth. Specifically, pθ (yt|y1:t-1,V) represents the probability that the word predicted by the model at the current moment is the corresponding labeled word. The meaning of the loss function is that when the input of the current time is the correct word at each time before the current time, the probability that the output of the current time is also the correct word is maximized.”, thus, a cross entropy loss, as related to timing detection/time intervals, is disclosed).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the AI low-latency processing system of claim 2, as disclosed by Chen in view of Lin to include wherein the multi-task loss function includes three losses defining (1) an accuracy of decoded information, (2) a difference between the information decoded from the subsequence of frames and the information decoded from the full sequence of frames, and (3) an accuracy of prediction of the timing detector neural network, as disclosed by Lin. One of ordinary skill in the art would have been motivated to make this modification to enable the use of a multi-task loss function, which may encourage model convergence and improve system accuracy (Lin, Par. [0243], “In order to improve the accuracy of the generated captioning information of the multimedia data, FIG. 24 illustrates a method for training a multimedia data captioning model provided in an alternative embodiment of the present disclosure. As shown in the figure, the training sample in the training method also includes a second sample of multimedia data without captioning labels. The model loss function includes a first loss function and a second loss function. When training the original captioning model based on the first sample multimedia data, the method may include the following steps S201 to S203.”).
	
Regarding Claim 4, Chen in view of Lin teaches the Al low-latency processing system of claim 1, wherein the processor is configured to 
execute a feature extractor to extract features from each frame in the sequence of frames; execute a feature encoder to encode the extracted features of each frame to produce a sequence of encoded features (Chen, Pg. 4, Section 3.2.1 Architecture, “During training, we use stochastic policy, i.e., the action is sampled according to Equation (13). When testing, the policy becomes determined, hence the action with higher probability is chosen. If the policy decides to pick the current frame, the frame feature will be extracted by a pretrained CNN and embedded into a lower dimension, then passed to the encoder unit, and the template will be updated: ˜ g ←gt. (14)”, therefore, a feature extractor is executed to extract features from each frame in the sequence of frames and further a feature encoder is then executed to encode the extracted features to produce a sequence of encoded features); 
submit the sequence of encoded features to the timing detector neural network and to identify a subsequence of encoded features representing the subsequence of frames (Chen, Pg. 4, Section 3.2.1. Architecture, “We force PickNet to pick the first frame, thus the encoder will always process at least one frame, which makes the training procedure more robust. Figure 3 shows how PickNet works with the encoder. It is worth noting that the input of PickNet can be of any other forms, such as the difference between optical flow maps, which may handle the motion information more properly.”, therefore, the sequence of encoded features may be submitted to the timing detector neural network (PickNet) to identify a subsequence of encoded features representing the subsequence of frames (See Figure 3 on Chen Pg. 5))-; and 
submit the subsequence of encoded features to the decoder neural network to decode the information (Chen, Pg. 3, Section 3.1 Preliminary, “Decoder and sentence generation. Once the representation of the video has been generated, the recurrent decoder can employ it to generate the corresponding description. At every time-step of the decoding phase, the decoder unit uses the encoded vector v, previous generated one-hot represen tation word wt−1 and previous internal state pt−1 as input, and outputs a new internal state pt.”, therefore, the subsequence of encoded features may be submitted to the decoder network to decode the information – also depicted by Chen Figure 2 on Pg. 3).

Regarding Claim 5, Chen in view of Lin teaches the Al low-latency processing system of claim 1, wherein the processor triggers the execution of modules of the Al low-latency processing system upon receiving a new input frame appended to the sequence of frames (Chen, Pg. 4, Section 3.2.1 Architecture, “During training, we use stochastic policy, i.e., the action is sampled according to Equation (13). When testing, the policy becomes determined, hence the action with higher probability is chosen. If the policy decides to pick the current frame, the frame feature will be extracted by apretrained CNN and embeddedinto a lower dimension, then passed to the encoder unit, and the template will be updated: ˜ g ←gt. (14) We force PickNet to pick the first frame, thus the encoder will always process at least one frame, which makes the training procedure more robust.”, thus, execution of modules of the AI low-latency processing system may be triggered when receiving a new input frame, as the policy decides to pick frames and updates the template/representation accordingly)

Regarding Claim 6, Chen in view of Lin teaches the Al low-latency processing system of claim 1, wherein the information is a caption for an audio scene, a video scene, or an audio-video scene (Chen, Pg. 2, “To the best of our knowledge, this is the first study on frame selection for video captioning. In fact, our frame work can go beyond the Encoder-Decoder framework in video captioning task, and serves as a complementary building block for other state-of-the-art solutions.”, thus, the information may comprise a caption for a video scene).

Regarding Claim 9, Chen in view of Lin teaches the Al low-latency processing system of claim 1, wherein the sequence of frames include multi-modal information coming from different sensors of different modalities (Lin, Par. [0201], “The image may be acquired from a local storage or a local database as required or received from an external data source (such as, the Internet, a server, a database, etc.) through an input device or a transmission medium.” & Par. [0300], “The multimedia data may be sample data in a training image dataset or a training video dataset acquired from a local storage or a local database as required, or training sample in a training image dataset or a training video dataset received from an external data source through an input device or a transmission medium.”, thus, the information may be associated with multimodal (video data and image data) sensor signals which are propagated through the transmission medium – thus, the sequence of frames may include multimodal information coming from different sensors of different modalities (different input devices/transmission mediums/local storage/external storage)).
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Regarding Claim 10, Chen in view of Lin teaches a computer-implemented method for an artificial intelligence (Al) low-latency processing system (Chen, Pg. 7, Section 5.1 Comparison with the state-of-the-arts, “Instead, we estimate the running time by the complexity of visual feature extractors and the number of processed frames. The details of running time estimation are listed in supplemental materials. Thanks to the PickNet, our captioning model is 4∼33 times faster than other methods.”, therefore, computer-implemented methods (See Chen Section 3. Method) for an artificial intelligence (AI) low-latency caption processing system are disclosed by PickNet), including a processor and a memory storing instructions of the computer- implemented method performing steps using the processor (Lin, Claim 14, “An apparatus for generating captioning information of multimedia data, comprising: a memory storing one or more instructions; and a processor configured to execute the one or more instructions stored in the memory to: […]”, thus, a processing system (apparatus) comprising a processor and a memory storing instructions to be executed by the processor is disclosed), the steps comprising: […]
The rest of the claim language in Claim 10 recites substantially the same limitations as Claim 1, in the form of a method, therefore it is rejected under the same rationale.
The reasons of obviousness have been noted in the rejection of Claim 1 above and applicable herein.

Claim 11 recites substantially the same limitations as Claim 2 in the form of a method, therefore it is rejected under the same rationale.
	
Claim 12 recites substantially the same limitations as Claim 3 in the form of a method, therefore it is rejected under the same rationale.

Claim 13 recites substantially the same limitations as Claim 4 in the form of a method, therefore it is rejected under the same rationale.

Claim 14 recites substantially the same limitations as Claim 5 in the form of a method, therefore it is rejected under the same rationale.

Claim 15 recites substantially the same limitations as Claim 6 in the form of a method, therefore it is rejected under the same rationale.

Claim 17 recites substantially the same limitations as Claim 9 in the form of a method, therefore it is rejected under the same rationale.

11.	Claims 7-8, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (hereinafter Chen) (“Less is More: Picking Informative Frames for Video Captioning”), in view of Lin et al. (hereinafter Lin) (US PG-PUB 20220014807), further in view of Kim et al. (hereinafter Kim) (US PG-PUB 20210012222).
Regarding Claim 7, Chen in view of Lin teaches the Al low-latency processing system of claim 1. 
Chen in view of Lin does not explicitly disclose wherein the information is an answer to a question about the sequence of frames.
However, Kim teaches wherein the information is an answer to a question about the sequence of frames (Kim, Abstract, “In implementations of answering questions during video playback, a video system can receive a question related to a video at a timepoint of the video during playback of the video, and determine audio sentences of the video that occur within a segment of the video that includes the timepoint. The video system can generate a classification vector from words of the question and the audio sentences, and determine an answer to the question utilizing the classification vector. The video system can obtain answer candidates, and the answer to the question can be selected as one of the answer candidates based on matching the classification vector to one of the answer vectors.”, thus, the information may comprise an answer to a question about the sequence of frames in the video).

It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the AI low-latency processing system of claim 1, as disclosed by Chen in view of Lin to include wherein the information is an answer to a question about the sequence of frames, as disclosed by Kim. One of ordinary skill in the art would have been motivated to make this modification to enable accurate and reliable question answering during video playback and processing for a plurality of different domains (Kim, Par. [0006], “Accordingly, the video system can generate an answer to a question based on context of the video when the question is asked, and the answer is accurate and reliable since it is grounded to the domain knowledge base. Hence, the video system is not limited to answering questions having an answer that may be determined from the question itself, but can also answer questions with an answer based on the context of the video or domain knowledge about the video for any video pertaining to a domain.”).
	
Regarding Claim 8, Chen in view of Lin in view of Kim teaches the Al low-latency processing system of claim 7, wherein the processor is configured to: 
execute a text encoder neural network to encode the question (Kim, Par. [0023], “For instance, the video system may include a question encoder that receives the question (e.g., text of the question in any suitable format, such as concatenated word vectors), and generates a question representation, such as a question feature vector based on the words of the question”, therefore, a text encoder neural network (question encoder) may be executed to encode the question);
submit the encoded question to the timing detector neural network (Chen, Pg. 2, Section 1. Introduction, “We propose PickNet to perform informative frame picking for video captioning. Specifically, the base model for visual-linguistic association in video captioning is a standard Encoder-Decoder framework [3]. We develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by considering both visual and textual cues.” & Pg. 2, Section 1. Introduction, “We design a plug-and-play reinforcement learning-based PickNet to select informative frames which can pick informative frames for the next learning stage. A compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation.”, thus, a timing detector neural network (PickNet) may receive encoded information (which may comprise an encoded question as taught by Kim above) to be submitted to the network); and 
submit the question or the encoded question to the decoding neural network (Chen, Pg. 3, Section 3.1 Preliminary, “Decoder and sentence generation. Once the representation of the video has been generated, the recurrent decoder can employ it to generate the corresponding description. At every time-step of the decoding phase, the decoder unit uses the encoded vector v, previous generated one-hot representation word wt−1 and previous internal state pt−1 as input, and outputs a new internal state pt.”, therefore, the information/encoded information (which may comprise a question/encoded question as taught by Kim above) may be submitted to the decoder for decoding);
The reasons of obviousness have been noted in the rejection of Claim 7 above and applicable herein.
	
Claim 16 recites substantially the same limitations as Claim 7 in the form of a method, therefore it is rejected under the same rationale.

Claim 18 recites substantially the same limitations as Claim 8 in the form of a method, therefore it is rejected under the same rationale.

Conclusion
12.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

13.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Devika S Maharaj whose telephone number is (571)272-0829. The examiner can normally be reached Monday - Thursday 8:30am - 5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached at (571)270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DEVIKA S MAHARAJ/Examiner, Art Unit 2123                                                                                                                                                                                                        
/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123
Read full office action
Prosecution Timeline

Aug 04, 2022
Application Filed
Sep 23, 2025
Non-Final Rejection mailed — §101, §103
Dec 10, 2025
Interview Requested
Dec 17, 2025
Examiner Interview Summary
Dec 17, 2025
Applicant Interview (Telephonic)
Dec 23, 2025
Response Filed
May 08, 2026
Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/792,747
Patent 12619855
MODEL POOL FOR MULTIMODAL DISTRIBUTED LEARNING
3y 9m to grant Granted May 05, 2026
17/655,348
Patent 12585948
NEURAL PROCESSING DEVICE AND METHOD FOR PRUNING THEREOF
4y 0m to grant Granted Mar 24, 2026
17/498,737
Patent 12579426
Training a Neural Network having Sparsely-Activated Sub-Networks using Regularization
4y 5m to grant Granted Mar 17, 2026
17/090,724
Patent 12572795
ANSWER SPAN CORRECTION
5y 4m to grant Granted Mar 10, 2026
17/085,593
Patent 12561577
AUTOMATIC FILTER SELECTION IN DECISION TREE FOR MACHINE LEARNING CORE
5y 3m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
54%
Grant Probability
65%
With Interview (+11.3%)
4y 7m (~10m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 80 resolved cases by this examiner. Grant probability derived from career allowance rate.