Prosecution Insights
Last updated: April 19, 2026
Application No. 18/209,421

SAMPLING OPERATIONS IN A COMPUTER VISION TOOL TO REGULATE DOWNSTREAM TASKS

Non-Final OA §101§102§103
Filed
Jun 13, 2023
Examiner
KIM, SEHWAN
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Microsoft Technology Licensing, LLC
OA Round
1 (Non-Final)
60%
Grant Probability
Moderate
1-2
OA Rounds
4y 1m
To Grant
99%
With Interview

Examiner Intelligence

Grants 60% of resolved cases
60%
Career Allow Rate
86 granted / 144 resolved
+4.7% vs TC avg
Strong +66% interview lift
Without
With
+65.6%
Interview Lift
resolved cases with interview
Typical timeline
4y 1m
Avg Prosecution
35 currently pending
Career history
179
Total Applications
across all art units

Statute-Specific Performance

§101
20.8%
-19.2% vs TC avg
§103
46.2%
+6.2% vs TC avg
§102
6.3%
-33.7% vs TC avg
§112
23.3%
-16.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 144 resolved cases

Office Action

§101 §102 §103
Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Examiner’s Note The Examiner encourages Applicant to schedule an interview to discuss issues related to, for example, the rejections noted below under 35 U.S.C § 101 and § 103, for moving forward allowance. Providing supporting paragraph(s) for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner. For clarification, claim 17 may be amended (e.g., “non-transitory computer-readable medium”) based on par 39 “The term "non-transitory computer-readable media" specifically excludes transitory propagating signals, carrier waves, and wave forms or other intangible or transitory media that may nevertheless be readable by a computer” to make sure that the claim falls within one of the four statutory categories. For clarification, claim 18 may be amended (e.g., “A computer system comprising a hardware processor and non-transitory memory”) based on par 40 “The innovations can be described in the general context of computer-executable instructions being executed in a computer system on a target real or virtual processor” to make sure that the claim falls within one of the four statutory categories. Priority Acknowledgment is made of applicant's claim for the present application filed on 06/13/2023. Claim Objections Claim(s) 11 is/are objected to because of the following informalities. Claim(s) 11 is/are objected to because of the following informalities: it appears that “selecting which of the downstream tasks” (line 2) needs to read “selecting one of the downstream tasks” or something else. Appropriate correction is suggested. Claim(s) 11 each recite(s) limitations that raise issues of indefiniteness as set forth above, and their dependent claims are objected to at least based on their direct and/or indirect dependency from the claim listed above. Appropriate explanation and/or amendment is required. Claim Interpretation The following is a quotation of 35 U.S.C. 112(f): (f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph: An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: Claim 18: “a sampling tool, implemented using the processing system of the computer system, configured to perform sampling operations” (Note that pars 21-41 of the present application along with fig 1 describe(s) a sufficient structure for performing the claimed function.) Claim 19: “downstream tools configured to perform operations for the downstream tasks, respectively” (Note that pars 21-41 of the present application along with fig 1 describe(s) a sufficient structure for performing the claimed function.) Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. Claim Rejections - 35 USC § 101 35 U.S.C. 101 reads as follows: Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Regarding claim 1 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “…, a method of regulating downstream tasks, the method comprising: …; determining inputs for machine learning models in different channels using the encoded data; determining a set of event indicators for the given frame, including: …; and fusing results from the machine learning models; and based at least in part on the set of event indicators for the given frame, regulating downstream tasks for the given frame”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“In a computer system that implements a computer vision tool”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“receiving encoded data for a given frame of a video sequence”) – the act of receiving data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of receiving data is recited at a high-level of generality (i.e., as a generic act of receiving performing a generic act function of receiving data) such that it amounts no more than a mere act to apply the exception using a generic act of receiving. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“providing the inputs to the machine learning models, respectively”) – the act of providing (i.e. inputting) data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of inputting data is recited at a high-level of generality (i.e., as a generic act of inputting performing a generic act function of inputting data) such that it amounts no more than a mere act to apply the exception using a generic act of inputting. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). As discussed above, the claim recites the additional element(s) of receiving data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. As discussed above, the claim recites the additional element(s) of inputting data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g) – “Mere Data Gathering”. However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. Regarding claim 2 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the inputs for the given frame are part of different time series that include: a time series of reconstructed frames; a time series of motion information; and a time series of residual information”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 3 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “determining whether the given frame is intra-coded; selectively decoding encoded data for the given frame, including: if the given frame is intra-coded, decoding encoded data for the given frame to produce a reconstructed version of the given frame, wherein the reconstructed version of the given frame is part of the time series of reconstructed frames; or otherwise, the given frame not being intra-coded, selecting, from the time series of reconstructed frames, a reconstructed version of a previous frame to use for the given frame; determining motion information for the given frame based at least in part on motion vector values decoded or derived from the encoded data; and determining residual information for the given frame based at least in part on residual values decoded or derived from the encoded data”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 4 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element(s) (“providing first input, from the time series of reconstructed frames, to a first machine learning model among the machine learning models; providing second input, from the time series of motion information, to a second machine learning model among the machine learning models; and providing third input, from the time series of residual information, to a third machine learning model among the machine learning models”) – the act of providing (i.e. inputting) data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of inputting data is recited at a high-level of generality (i.e., as a generic act of inputting performing a generic act function of inputting data) such that it amounts no more than a mere act to apply the exception using a generic act of inputting. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“the first machine learning model having been trained to identify events in reconstructed frames”, “the second machine learning model having been trained to identify events in motion information”, “the third machine learning model having been trained to identify events in residual information”). The additional element is recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim recites the additional element(s) of inputting data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g) – “Mere Data Gathering”. However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. The additional elements regarding training are recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not amount to significantly more than the abstract idea. The claim is directed to an abstract idea. Regarding claim 5 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein the determining the inputs is performed with decoding of less than all frames of the video sequence, thereby reducing resource utilization to determine the inputs”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 6 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein each of the machine learning models uses: a two-dimensional convolutional neural network; a three-dimensional convolutional neural network; a video transformer; or a temporal dilated video transformer”). This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 7 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein one of the machine learning models uses a temporal dilated video transformer, the temporal dilated video transformer comprising: an initial stage, the initial stage having a patch embedding layer and an initial set of temporal dilated transformer blocks; and a set of successive stages, each of the set of successive stages having a patch merging layer and a successive set of temporal dilated transformer blocks”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 8 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element(s) (“wherein the machine learning models have been trained using encoded data in a specific video codec format”). The additional element is recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The additional elements regarding training are recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not amount to significantly more than the abstract idea. The claim is directed to an abstract idea. Regarding claim 9 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“uses a cross-attention layer”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 10 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the set of event indicators for the given frame are: a single classification for the given frame,…; or a score for each of multiple types of events, …”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) In particular, the claim recites an additional element(s) (“wherein different ones of the downstream tasks have been trained for different types of classification”, “wherein different ones of the downstream tasks have been trained for different types of events”). The additional element is recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). The additional elements regarding training are recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not amount to significantly more than the abstract idea. The claim is directed to an abstract idea. Regarding claim 11 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “selecting which of the downstream tasks, if any, to use for the given frame; or adjusting one or more of the downstream tasks for the given frame”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 12 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “wherein the regulating the downstream tasks reduces overall resource utilization by the downstream tasks, and wherein the regulating the downstream tasks includes, for each given downstream task among the downstream tasks: determining whether the given downstream task is to be used for the given frame; and selectively performing the given downstream task for the given frame, including: if the given downstream task is to be used for the given frame, performing the given downstream task for the given frame; or otherwise, the given downstream task not being used for the given frame, skipping the given downstream task for the given frame”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim does not recite additional elements. Thus, the claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. Thus, the claim is not patent eligible. Regarding claim 13 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element(s) (“accepting, as user input, a system resource constraint indicator”) – the act of receiving data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of receiving data is recited at a high-level of generality (i.e., as a generic act of receiving performing a generic act function of receiving data) such that it amounts no more than a mere act to apply the exception using a generic act of receiving. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element (“wherein the regulating the downstream tasks is also based at least in part on the system resource constraint indicator, whereby the downstream tasks operate within a range of acceptable resource utilization”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim recites the additional element(s) of receiving data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 14 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element (“wherein the downstream tasks include a text or character recognition task, a face detection task, a person detection task, a vehicle detection task, an object detection task for another type of object, a face tracking task, a person tracking task, a vehicle tracking task, an object tracking task for another type of object, and/or an action recognition task for a type of action”). This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h) Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This is a recitation of a particular type or source of model/data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of model/data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h). Regarding claim 15 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“wherein the downstream tasks are performed on a different computer system connected over a network to the computer system that implements the computer vision tool”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 16 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “…; or for the subsequent frame of the video sequence, performing …, the using, the determining, and/or the regulating for the subsequent frame concurrent with the same operation or operations for the given frame”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. In particular, the claim recites an additional element(s) (“for a subsequent frame of the video sequence, as the given frame, repeating the receiving, the using, the determining, and the regulating on a frame-by-frame basis”) – the act of repeating. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of repeating is recited at a high-level of generality (i.e., as a generic act of performing a generic act function of repeating) such that it amounts no more than a mere act to apply the exception using a generic act of repeating. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. In particular, the claim recites an additional element(s) (“the receiving”) – the act of receiving data. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of receiving data is recited at a high-level of generality (i.e., as a generic act of receiving performing a generic act function of receiving data) such that it amounts no more than a mere act to apply the exception using a generic act of receiving. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, the claim recites the additional element(s) of receiving data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. As discussed above, the claim recites the additional element(s) of repeating at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g). However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Performing repetitive calculations”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible. Regarding claim 17 The claim recites “A computer-readable medium having stored thereon computer-executable instructions for causing a processing system, when programmed thereby, to perform operations of a computer vision tool to regulate downstream tasks, the operations comprising:” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) and “Storing and retrieving information in memory” (see MPEP 2106.05(g) on Insignificant Extra-Solution Activity, and MPEP 2106.05(d) on Well-Understood, Routine, Conventional Activity) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Regarding claim 18 The claim recites “A computer system comprising a processing system and memory, wherein the computer system implements a computer vision tool comprising: a buffer, implemented using the memory of the computer system, configured to” and “a sampling tool, implemented using the processing system of the computer system, configured to” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1. Regarding claim 19 The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. Step 1: The claim recites a method; therefore, it falls into the statutory category of processes. Step 2A Prong 1: The limitations of “… perform operations for the downstream tasks, respectively”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper). If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element(s) (“downstream tools configured to”) – using a device and/or a model to process data. The device and the model in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer component to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. MPEP 2106.05(f). Regarding claim 20 The claim is rejected for the reasons set forth in the rejection of a combination of Claims 7 and 9 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception. Claim Rejections - 35 USC § 102 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless – (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claim(s) 1-5, 8, 10-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al. (Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics) Regarding claim 1 Li teaches In a computer system that implements a computer vision tool, a method of regulating downstream tasks, the method comprising: (Li [fig(s) 4] [sec(s) 2] “We ran each technique on a new 10 minute clip from each camera under a sweep of resource configurations: a single core, 0.5-1.5 GHz CPU speed, and 128-1024 MB of RAM. Experiments were performed on a Macbook Pro laptop with a virtual machine that restricted resources to the specified parameters. Table 1 lists the filtering speeds in each setting. As shown, both NN models require at least 512 MB of RAM to operate, which precludes them from being used on many deployed cameras. Tiny YOLO is unable to achieve even 1 fps in any setting; note that even with the 11× speedup reported when also using background subtraction [36], Tiny YOLO is still far below real-time speeds. In contrast, when it has sufficient memory to run, the binary classification model consistently achieves real-time speeds, e.g., 28 fps and 86 fps with 0.5 GHz and 1.5 GHz processors, respectively.” [sec(s) 1] “On-camera filtering. In this paper, unlike prior filtering approaches that typically run on edge [54] or backend servers, we seek to filter frames at the beginning of the analytics pipeline – directly on cameras. Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).”;) receiving encoded data for a given frame of a video sequence; (Li [fig(s) 4] [sec(s) 5] “We implemented Reducto on the Raspberry Pi using OpenCV [12] for feature extraction and frame differencing calculations, and a hash table lookup to make threshold selections and filtering decisions. Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi. Overall, we observed that Reducto’s filtering results for each video identically matched those from our VM-based implementation (i.e., results in Figure 11). More importantly, Reducto was able to operate at 47.8 fps on the Raspberry Pi, highlighting the ability to perform real-time filtering. Digging deeper, we found that the bulk of the processing overheads were due to per-frame feature extraction with OpenCV; this task could operate at 99.7 fps, as compared to frame differencing calculations and hash table lookups that ran at 129.5 and 318.6 fps, respectively.”;) determining inputs for machine learning models in different channels using the encoded data; (Li [fig(s) 4] [fig(s) 6, 20] [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … As this characterization data is collected, the profiler processes each frame in the video to extract the three low-level differencing features presented in §3. Our observation (Figure 5) is that there often exists a single feature that works the best for a query class across different videos, cameras, and accuracy targets. Hence, during profiling, the server finds the best feature for each query class that it wishes to support. At the end of this phase, the best feature for each query class is identified and stored at the server. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used. The generated model is encoded as a hash table, where each entry represents a cluster of differencing values whose corresponding thresholds are within the same neighborhood — each key is the average differencing value and each value is the threshold for that cluster which delivers the required accuracy. Together with the selected feature, this hash table is also sent to the camera for each query.”; e.g., color videos read(s) on “different channels”.) determining a set of event indicators for the given frame, including: (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [sec(s) 1] “On-camera filtering. In this paper, unlike prior filtering approaches that typically run on edge [54] or backend servers, we seek to filter frames at the beginning of the analytics pipeline – directly on cameras. Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 4.1] “To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used.”;) providing the inputs to the machine learning models, respectively; and (Li [fig(s) 4] [fig(s) 6, 20] [table(s) 2] “On-camera tracking speed Server tracking speed” [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … As this characterization data is collected, the profiler processes each frame in the video to extract the three low-level differencing features presented in §3. Our observation (Figure 5) is that there often exists a single feature that works the best for a query class across different videos, cameras, and accuracy targets. Hence, during profiling, the server finds the best feature for each query class that it wishes to support. At the end of this phase, the best feature for each query class is identified and stored at the server. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used. The generated model is encoded as a hash table, where each entry represents a cluster of differencing values whose corresponding thresholds are within the same neighborhood — each key is the average differencing value and each value is the threshold for that cluster which delivers the required accuracy. Together with the selected feature, this hash table is also sent to the camera for each query.”;) fusing results from the machine learning models; and (Li [fig(s) 4] [fig(s) 6, 20] [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … As this characterization data is collected, the profiler processes each frame in the video to extract the three low-level differencing features presented in §3. Our observation (Figure 5) is that there often exists a single feature that works the best for a query class across different videos, cameras, and accuracy targets. Hence, during profiling, the server finds the best feature for each query class that it wishes to support. At the end of this phase, the best feature for each query class is identified and stored at the server. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used. The generated model is encoded as a hash table, where each entry represents a cluster of differencing values whose corresponding thresholds are within the same neighborhood — each key is the average differencing value and each value is the threshold for that cluster which delivers the required accuracy. Together with the selected feature, this hash table is also sent to the camera for each query.”;) based at least in part on the set of event indicators for the given frame, regulating downstream tasks for the given frame. (Li [fig(s) 4] [sec(s) 1] “On-camera filtering. In this paper, unlike prior filtering approaches that typically run on edge [54] or backend servers, we seek to filter frames at the beginning of the analytics pipeline – directly on cameras. Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).”;) Regarding claim 2 Li teaches claim 1. Li further teaches wherein the inputs for the given frame are part of different time series that include: a time series of reconstructed frames; a time series of motion information; and a time series of residual information. (Li [sec(s) 1] “Reducto. Based on this insight, we developed Reducto, a simple and yet inexpensive solution to the real-time video analytics efficiency problem, that tackles three main challenges. (C1) What low-level video features to use? The computer vision (CV) community [20, 24, 44, 46, 59, 60, 66, 68, 82, 83] has discovered a slew of low-level video features that extract frame differences [21], such as Edge and Pixel. To find the most appropriate features for on-camera filtering, we carefully studied a representative set of them (§3). An important observation we make is that the “best” feature (i.e., the one that most closely tracks changes in query results) to use varies across query classes more so than across different videos (see §4.2). This is because each feature uniquely captures a certain low-level video property; different query classes are interested in different video properties, and hence fit the best with different features. Based on this observation, the Reducto server performs offline profiling of historical video data to determine the best feature for each query class. The server notifies the camera of the feature it should use for each new query. Note that this is in contrast to existing strategies that always use the Pixel feature [24].” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” read(s) on “reconstructed frames” since encoded frames are decoded before the object detection. In addition, e.g., “motion across frames” along with encoding based on “H.264 standard” read(s) on “residual information” since residual information is used for the encoding. For more details on “motion information” and “residual information”, please refer to Wu et al. (Compressed Video Action Recognition) [sec(s) 2.2] “The need for efficient video storage and transmission has led to highly efficient video compression algorithms, such as MPEG-4, H.264, and HEVC, some of which date back to 1990s [20]. Most video compression algorithms leverage the fact that successive frames are usually very similar. We can efficiently store one frame by reusing contents from another frame and only store the difference. Most modern codecs split a video into I-frames (intra-coded frames), P-frames (predictive frames) and zero or more B-frames (bi-directional frames). I-frames are regular images and compressed as such. P-frames reference the previous frames and encode only the ‘change’. A part of the change – termed motion vectors – is represented as the movements of block of pixels from the source frame to the target frame at time t, which we denote by T(t). Even after this compensation for block movement, there can be difference between the original image and the predicted image at time t. We denote this residual difference by ∆(t).”) Regarding claim 3 Li teaches claim 2. wherein the determining the inputs includes: (See claim 1) Li further teaches determining whether the given frame is intra-coded; (Li [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” and “motion across frames” along with “compressed using H.264” read(s) on “given frame is intra-coded” since encoded frames are decoded before the object detection. For more details on intra-coded frames, please refer to Wu et al. (Compressed Video Action Recognition) [sec(s) 2.2] “The need for efficient video storage and transmission has led to highly efficient video compression algorithms, such as MPEG-4, H.264, and HEVC, some of which date back to 1990s [20]. Most video compression algorithms leverage the fact that successive frames are usually very similar. We can efficiently store one frame by reusing contents from another frame and only store the difference. Most modern codecs split a video into I-frames (intra-coded frames), P-frames (predictive frames) and zero or more B-frames (bi-directional frames). I-frames are regular images and compressed as such. P-frames reference the previous frames and encode only the ‘change’.”) selectively decoding encoded data for the given frame, including: (Li [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” and “motion across frames” along with “compressed using H.264” read(s) on “decoding” since encoded frames are decoded before the object detection.) if the given frame is intra-coded, decoding encoded data for the given frame to produce a reconstructed version of the given frame, wherein the reconstructed version of the given frame is part of the time series of reconstructed frames; or otherwise, the given frame not being intra-coded, selecting, from the time series of reconstructed frames, a reconstructed version of a previous frame to use for the given frame; (Li [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” and “motion across frames” along with “compressed using H.264” read(s) on “if the given frame is intra-coded, decoding encoded data for the given frame to produce a reconstructed version of the given frame” since encoded frames are decoded before the object detection. For more details on intra-coded frames, please refer to Wu et al. (Compressed Video Action Recognition) [sec(s) 2.2] “The need for efficient video storage and transmission has led to highly efficient video compression algorithms, such as MPEG-4, H.264, and HEVC, some of which date back to 1990s [20]. Most video compression algorithms leverage the fact that successive frames are usually very similar. We can efficiently store one frame by reusing contents from another frame and only store the difference. Most modern codecs split a video into I-frames (intra-coded frames), P-frames (predictive frames) and zero or more B-frames (bi-directional frames). I-frames are regular images and compressed as such. P-frames reference the previous frames and encode only the ‘change’.”) determining motion information for the given frame based at least in part on motion vector values decoded or derived from the encoded data; and (Li [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” and “motion across frames” along with “compressed using H.264” read(s) on “motion vector values decoded” since encoded frames are decoded before the object detection. For more details on motion vector, please refer to Wu et al. (Compressed Video Action Recognition) [sec(s) 2.2] “The need for efficient video storage and transmission has led to highly efficient video compression algorithms, such as MPEG-4, H.264, and HEVC, some of which date back to 1990s [20]. Most video compression algorithms leverage the fact that successive frames are usually very similar. We can efficiently store one frame by reusing contents from another frame and only store the difference. Most modern codecs split a video into I-frames (intra-coded frames), P-frames (predictive frames) and zero or more B-frames (bi-directional frames). I-frames are regular images and compressed as such. P-frames reference the previous frames and encode only the ‘change’. A part of the change – termed motion vectors – is represented as the movements of block of pixels from the source frame to the target frame at time t”) determining residual information for the given frame based at least in part on residual values decoded or derived from the encoded data. (Li [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; e.g., “Pixel” and “motion across frames” along with “compressed using H.264” read(s) on “residual values decoded” since encoded frames are decoded before the object detection. For more details on residual values, please refer to Wu et al. (Compressed Video Action Recognition) [sec(s) 2.2] “The need for efficient video storage and transmission has led to highly efficient video compression algorithms, such as MPEG-4, H.264, and HEVC, some of which date back to 1990s [20]. Most video compression algorithms leverage the fact that successive frames are usually very similar. We can efficiently store one frame by reusing contents from another frame and only store the difference. Most modern codecs split a video into I-frames (intra-coded frames), P-frames (predictive frames) and zero or more B-frames (bi-directional frames). I-frames are regular images and compressed as such. P-frames reference the previous frames and encode only the ‘change’. A part of the change – termed motion vectors – is represented as the movements of block of pixels from the source frame to the target frame at time t, which we denote by T(t). Even after this compensation for block movement, there can be difference between the original image and the predicted image at time t. We denote this residual difference by ∆(t).”) Regarding claim 4 Li teaches claim 2. wherein the providing the inputs to the machine learning models, respectively, includes: (See claim 1) Li further teaches providing first input, from the time series of reconstructed frames, to a first machine learning model among the machine learning models, the first machine learning model having been trained to identify events in reconstructed frames; (Li [fig(s) 4] [fig(s) 6, 20] [table(s) 2] “On-camera tracking speed Server tracking speed” [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used.” [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”;) providing second input, from the time series of motion information, to a second machine learning model among the machine learning models, the second machine learning model having been trained to identify events in motion information; and (Li [fig(s) 4] [fig(s) 6, 20] [table(s) 2] “On-camera tracking speed Server tracking speed” [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used.” [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; Please refer to Wu et al. (Compressed Video Action Recognition) for more details on motion information.) providing third input, from the time series of residual information, to a third machine learning model among the machine learning models, the third machine learning model having been trained to identify events in residual information. (Li [fig(s) 4] [fig(s) 6, 20] [table(s) 2] “On-camera tracking speed Server tracking speed” [sec(s) 1] “To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3). Clustering is done over all pairs of observed difference values (in the training set) and their highest feature values that hit the accuracy target. For each observed difference value, the camera selects the cluster in which the value falls and performs filtering using the average filtering threshold of that cluster. Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used.” [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 3] “Low-level features such as pixel or edge differences can be observed directly from raw images, but contain moderate amounts of noise.” [sec(s) 5.2] “Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi.” [sec(s) 4.2] “Area compares the size of the areas of motion across frames, but does not consider the distance that those areas move. In contrast, Edge is finer-grained and observes changes in the locations of the edges of objects.”; Please refer to Wu et al. (Compressed Video Action Recognition) for more details on residual information.) Regarding claim 5 Li teaches claim 1. Li further teaches wherein the determining the inputs is performed with decoding of less than all frames of the video sequence, thereby reducing resource utilization to determine the inputs. (Li [sec(s) 1] “We evaluated Reducto using multiple datasets of live video feeds covering 24 hours from 7 live traffic and surveillance cameras. We consider three classes of queries that track people and cars: tagging, object counting, and bounding box detection. Running on both Raspberry Pi and VM environments similar to commodity camera settings, we find that Reducto is able to filter out 51–97% of frames compared to traditional pipelines, resulting in bandwidth savings of 21–86%, 50-96% reductions in backend computation overheads, and 66–73% lower query response times. Importantly, in our experiments, Reducto achieves such filtering benefits while always meeting the specified query accuracy targets. Reducto also outperforms two recent video analytics systems: Reducto filters out 93% more frames than the FilterForward [23] edge filtering system, and achieves 37% more backend processing improvements than Chameleon [41]. Source code and experimental data for Reducto are available at https://github.com/reducto-sigcomm-2020/reducto.”;) Regarding claim 8 Li teaches claim 1. Li further teaches wherein the machine learning models have been trained using encoded data in a specific video codec format. (Li [fig(s) 4] [sec(s) 5] “We implemented Reducto on the Raspberry Pi using OpenCV [12] for feature extraction and frame differencing calculations, and a hash table lookup to make threshold selections and filtering decisions. Unfiltered frames were encoded using Raspberry Pi’s hardware-accelerated video encoder for the H.264 standard. As we did with the VM, we fed in each recorded video in Table 3 sequentially and in real-time to the Raspberry Pi. Overall, we observed that Reducto’s filtering results for each video identically matched those from our VM-based implementation (i.e., results in Figure 11). More importantly, Reducto was able to operate at 47.8 fps on the Raspberry Pi, highlighting the ability to perform real-time filtering. Digging deeper, we found that the bulk of the processing overheads were due to per-frame feature extraction with OpenCV; this task could operate at 99.7 fps, as compared to frame differencing calculations and hash table lookups that ran at 129.5 and 318.6 fps, respectively.”;) Regarding claim 10 Li teaches claim 1. Li further teaches wherein the set of event indicators for the given frame are: a single classification for the given frame, wherein different ones of the downstream tasks have been trained for different types of classification; or a score for each of multiple types of events, wherein different ones of the downstream tasks have been trained for different types of events. (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 1] “Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 11 Li teaches claim 1. Li further teaches wherein the regulating the downstream tasks includes: selecting which of the downstream tasks, if any, to use for the given frame; or adjusting one or more of the downstream tasks for the given frame. (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 1] “Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 12 Li teaches claim 1. Li further teaches wherein the regulating the downstream tasks reduces overall resource utilization by the downstream tasks, and wherein the regulating the downstream tasks includes, for each given downstream task among the downstream tasks: (Li [fig(s) 4] [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 2] “We ran each technique on a new 10 minute clip from each camera under a sweep of resource configurations: a single core, 0.5-1.5 GHz CPU speed, and 128-1024 MB of RAM. Experiments were performed on a Macbook Pro laptop with a virtual machine that restricted resources to the specified parameters. Table 1 lists the filtering speeds in each setting. As shown, both NN models require at least 512 MB of RAM to operate, which precludes them from being used on many deployed cameras. Tiny YOLO is unable to achieve even 1 fps in any setting; note that even with the 11× speedup reported when also using background subtraction [36], Tiny YOLO is still far below real-time speeds. In contrast, when it has sufficient memory to run, the binary classification model consistently achieves real-time speeds, e.g., 28 fps and 86 fps with 0.5 GHz and 1.5 GHz processors, respectively. Further, pixel-based frame differencing is able to hit real-time speeds across all camera settings. Thus, the rest of the section focuses on frame differencing and binary classification (which is at least tenable in some camera settings). Filtering efficacy. Now that we have identified potential filtering candidates for our resource-constrained environment, we ask, how effective are they at filtering out frames? We discuss the two candidates, binary classification and frame differencing, in turn.” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.” [sec(s) 1] “On-camera filtering. In this paper, unlike prior filtering approaches that typically run on edge [54] or backend servers, we seek to filter frames at the beginning of the analytics pipeline – directly on cameras. Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81].”;) determining whether the given downstream task is to be used for the given frame; and (Li [sec(s) 1] “Instead, threshold selection must be adaptive and be done by cameras, online. To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3).” [sec(s) 2] “To make effective use of frame differencing, the key question is whether it is possible to correlate frame differences with pipeline accuracy so that we can make a more informed decision as to whether a frame can be filtered out. We answer this question affirmatively in the next sections, where we describe how lightweight differencing features across video frames can serve as cheap monitoring signals that are highly correlated with changes in query results. If applied judiciously (and dynamically), these strong correlations enable large filtering benefits that are even comparable to those with the ideal baseline described earlier in this section.” [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.”;) selectively performing the given downstream task for the given frame, including: if the given downstream task is to be used for the given frame, performing the given downstream task for the given frame; or otherwise, the given downstream task not being used for the given frame, skipping the given downstream task for the given frame. (Li [sec(s) 1] “Instead, threshold selection must be adaptive and be done by cameras, online. To overcome these challenges, we use lightweight machine learning techniques to predict, at a fine granularity on the camera (e.g., every few frames), which threshold to use for the selected feature. To do this, we train a cluster-based model for each query and server-specified feature, based on the observation that there is a strong correlation between the thresholds of the feature and the query accuracy (see §4.3).” [sec(s) 2] “To make effective use of frame differencing, the key question is whether it is possible to correlate frame differences with pipeline accuracy so that we can make a more informed decision as to whether a frame can be filtered out. We answer this question affirmatively in the next sections, where we describe how lightweight differencing features across video frames can serve as cheap monitoring signals that are highly correlated with changes in query results. If applied judiciously (and dynamically), these strong correlations enable large filtering benefits that are even comparable to those with the ideal baseline described earlier in this section.” [sec(s) 4.4] “Once such a table entry is found, the camera uses the threshold (i.e., the entry’s value) to filter out frames in the segment. The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 13 Li teaches claim 1. Li further teaches accepting, as user input, a system resource constraint indicator, wherein the regulating the downstream tasks is also based at least in part on the system resource constraint indicator, whereby the downstream tasks operate within a range of acceptable resource utilization. (Li [fig(s) 4] [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 2] “We ran each technique on a new 10 minute clip from each camera under a sweep of resource configurations: a single core, 0.5-1.5 GHz CPU speed, and 128-1024 MB of RAM. Experiments were performed on a Macbook Pro laptop with a virtual machine that restricted resources to the specified parameters. Table 1 lists the filtering speeds in each setting. As shown, both NN models require at least 512 MB of RAM to operate, which precludes them from being used on many deployed cameras. Tiny YOLO is unable to achieve even 1 fps in any setting; note that even with the 11× speedup reported when also using background subtraction [36], Tiny YOLO is still far below real-time speeds. In contrast, when it has sufficient memory to run, the binary classification model consistently achieves real-time speeds, e.g., 28 fps and 86 fps with 0.5 GHz and 1.5 GHz processors, respectively. Further, pixel-based frame differencing is able to hit real-time speeds across all camera settings. Thus, the rest of the section focuses on frame differencing and binary classification (which is at least tenable in some camera settings). Filtering efficacy. Now that we have identified potential filtering candidates for our resource-constrained environment, we ask, how effective are they at filtering out frames? We discuss the two candidates, binary classification and frame differencing, in turn.” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 14 Li teaches claim 1. Li further teaches wherein the downstream tasks include a text or character recognition task, a face detection task, a person detection task, a vehicle detection task, an object detection task for another type of object, a face tracking task, a person tracking task, a vehicle tracking task, an object tracking task for another type of object, and/or an action recognition task for a type of action. (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 1] “Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 15 Li teaches claim 1. Li further teaches wherein the downstream tasks are performed on a different computer system connected over a network to the computer system that implements the computer vision tool. (Li [fig(s) 4] “Reducto Camera”, “Reducto Server” [sec(s) 1] “Running on both Raspberry Pi and VM environments similar to commodity camera settings, we find that Reducto is able to filter out 51–97% of frames compared to traditional pipelines, resulting in bandwidth savings of 21–86%, 50-96% reductions in backend computation overheads, and 66–73% lower query response times” [sec(s) 4.1] “The key question at this point is how to know, for each pair of consecutive frames, if the difference between them is sufficiently insignificant so that if the camera sends only the first one to the server (which reuses its query result for the second), the accuracy would not drop below the target” [sec(s) 4.4] “The remaining frames are compressed using H.264 at the original video’s bitrate, and sent to the server.” [sec(s) 5.2] “Second, as segment size increases, bandwidth savings increase. This is because larger segments enable more aggressive bandwidth savings from standard video encodings: more frames can avoid redundant transmission due to fewer key frames.”;) Regarding claim 16 Li teaches claim 1. Li further teaches for a subsequent frame of the video sequence, as the given frame, repeating the receiving, the using, the determining, and the regulating on a frame-by-frame basis; or for the subsequent frame of the video sequence, performing the receiving, the using, the determining, and/or the regulating for the subsequent frame concurrent with the same operation or operations for the given frame. (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 1] “Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 17 The claim is a computer-readable medium claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 18 The claim is a system claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim. Regarding claim 19 Li teaches claim 18. Li further teaches downstream tools configured to perform operations for the downstream tasks, respectively. (Li [fig(s) 4] [fig(s) 6] “Car detection results for two sets of adjacent frames from the Southampton video; subcaptions list the corresponding differencing feature values. For bounding box detection queries, slight variations can change the query result; Edge picks up on these subtle changes (top) but Area does not. In contrast, counting queries are better served by Area, which reports significant differences when counts change (bottom), but not when counts stay fixed (top)” [fig(s) 9] “detection of different objects (i.e., people and car)” [sec(s) 1] “Like edge server approaches, on-camera filtering has the potential to alleviate not only backend computation overheads (by reducing the number of frames that must be processed by the backend object detector), but also end-to-end network bottlenecks between cameras and backend servers, particularly for wireless cameras [23, 33, 81]. Furthermore, an on-camera approach can also sidestep the management and cost overheads of operating edge servers [52, 63]. We note that in targeting on-camera filtering, our aim is to eliminate the reliance on edge servers for filtering by making use of currently unused resources. Our on-camera filtering techniques could also run on edge servers (if present), outperforming existing strategies while consuming fewer resources (§5).” [sec(s) 5.1] “We ran each query class across our entire video dataset for two types of objects: people and cars.”;) Regarding claim 20 The claim is a system claim corresponding to a combination of the method claims 7 and 9, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the combination of the method claims. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claim(s) 6-7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics) in view of Sun et al. (TDViT: Temporal Dilated Video Transformer for Dense Video Tasks) Regarding claim 6 Li teaches claim 1. Li further teaches wherein each of the machine learning models uses: a two-dimensional convolutional neural network; a [three-dimensional] convolutional neural network; a video transformer; or a [temporal dilated] video transformer. (Li [fig(s) 4] [fig(s) 6, 20] [sec(s) 1] “Note that such models are cheap regression models that can run in real time even under the camera’s tight resource constraints.” [sec(s) 4.1] “The profiler ➊ then runs traditional pipelines ➌ on that video and stores the results for subsequent feature selection. … As this characterization data is collected, the profiler processes each frame in the video to extract the three low-level differencing features presented in §3. Our observation (Figure 5) is that there often exists a single feature that works the best for a query class across different videos, cameras, and accuracy targets. Hence, during profiling, the server finds the best feature for each query class that it wishes to support. At the end of this phase, the best feature for each query class is identified and stored at the server. … To answer this question, the server uses a model trainer ➋ that quickly trains, for each query, a simple (regression) model characterizing the relationships between differencing values, filtering thresholds, and query result accuracy. The model is trained by performing K-means-based clustering over the original frames sent by the camera during a short period after the query arrives. Training typically takes several seconds to finish due to the simple models used. The generated model is encoded as a hash table, where each entry represents a cluster of differencing values whose corresponding thresholds are within the same neighborhood — each key is the average differencing value and each value is the threshold for that cluster which delivers the required accuracy. Together with the selected feature, this hash table is also sent to the camera for each query.” [sec(s) 5.4] “Reducto used the more expensive Faster R-CNN model, which Chameleon treats as ground truth.”;) However, Li does not appear to explicitly teach: a [three-dimensional] convolutional neural network; or a [temporal dilated] video transformer. Sun teaches a temporal dilated video transformer. (Sun [fig(s) 2] “Patch Embedding”, “TDTB x2”, “Patch Merging”, “(a) Overview. TDViT contains four stages which consist of several temporal dilated transformer blocks (TDTB). A memory structure (purple cuboids) is introduced into the TDTB, which stores features of previous frames (yellow rectangles) and enables our approach to dynamically establish temporal connections. The temporal dilation factor Dt is used to control the memory sampling process and reduce the video redundancy. (b) Details of a TDTB. For every time step, the query tokens are from the current frame It but the key and value tokens are derived from memory sampling. Finally, the memory structure saves the output features and deletes the oldest features. Best viewed in colour. (Color figure online)” [sec(s) 1] “In this paper, we propose a Temporal Dilated Video Transformer (TDViT) whose overall architecture is shown in Fig. 1(c). Inspired by visual transformers [27,32,42] which are naturally suitable for sequence modelling, we exploit transformers to extract spatiotemporal features. Unlike most video models that are based on self-attention transformers, we present a novel temporal dilated transformer block (TDTB) with two distinct designs.”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Li with the temporal dilated video transformer of Sun. One of ordinary skill in the art would have been motived to combine in order to obtain accurate multi-frame spatiotemporal representations with a single-frame computational cost and effectively sample useful information from redundant frames. Moreover, the system captures long-range temporal correlations, further improving accuracy, by using hierarchical TDTBs. (Sun [sec(s) 5] “The key component in TDViT is the temporal dilated transformer block (TDTB), which can obtain accurate multi-frame spatiotemporal representations with a single-frame computational cost and effectively sample useful information from redundant frames. Moreover, by using hierarchical TDTBs, our approach can capture long-range temporal correlations, further improving accuracy.”) Regarding claim 7 Li teaches claim 1. However, Li does not appear to explicitly teach: wherein one of the machine learning models uses a temporal dilated video transformer, the temporal dilated video transformer comprising: an initial stage, the initial stage having a patch embedding layer and an initial set of temporal dilated transformer blocks; and a set of successive stages, each of the set of successive stages having a patch merging layer and a successive set of temporal dilated transformer blocks. Sun teaches wherein one of the machine learning models uses a temporal dilated video transformer, the temporal dilated video transformer comprising: (Sun [sec(s) 1] “In this paper, we propose a Temporal Dilated Video Transformer (TDViT) whose overall architecture is shown in Fig. 1(c). Inspired by visual transformers [27,32,42] which are naturally suitable for sequence modelling, we exploit transformers to extract spatiotemporal features. Unlike most video models that are based on self-attention transformers, we present a novel temporal dilated transformer block (TDTB) with two distinct designs.”;) an initial stage, the initial stage having a patch embedding layer and an initial set of temporal dilated transformer blocks; and (Sun [fig(s) 2] “Patch Embedding”, “TDTB x2”, “Patch Merging”, “(a) Overview. TDViT contains four stages which consist of several temporal dilated transformer blocks (TDTB). A memory structure (purple cuboids) is introduced into the TDTB, which stores features of previous frames (yellow rectangles) and enables our approach to dynamically establish temporal connections. The temporal dilation factor Dt is used to control the memory sampling process and reduce the video redundancy. (b) Details of a TDTB. For every time step, the query tokens are from the current frame It but the key and value tokens are derived from memory sampling. Finally, the memory structure saves the output features and deletes the oldest features. Best viewed in colour. (Color figure online)” [sec(s) 1] “In this paper, we propose a Temporal Dilated Video Transformer (TDViT) whose overall architecture is shown in Fig. 1(c). Inspired by visual transformers [27,32,42] which are naturally suitable for sequence modelling, we exploit transformers to extract spatiotemporal features. Unlike most video models that are based on self-attention transformers, we present a novel temporal dilated transformer block (TDTB) with two distinct designs.”;) a set of successive stages, each of the set of successive stages having a patch merging layer and a successive set of temporal dilated transformer blocks. (Sun [fig(s) 2] “Patch Embedding”, “Patch Merging”, “TDTB x2”, “(a) Overview. TDViT contains four stages which consist of several temporal dilated transformer blocks (TDTB). A memory structure (purple cuboids) is introduced into the TDTB, which stores features of previous frames (yellow rectangles) and enables our approach to dynamically establish temporal connections. The temporal dilation factor Dt is used to control the memory sampling process and reduce the video redundancy. (b) Details of a TDTB. For every time step, the query tokens are from the current frame It but the key and value tokens are derived from memory sampling. Finally, the memory structure saves the output features and deletes the oldest features. Best viewed in colour. (Color figure online)” [sec(s) 1] “In this paper, we propose a Temporal Dilated Video Transformer (TDViT) whose overall architecture is shown in Fig. 1(c). Inspired by visual transformers [27,32,42] which are naturally suitable for sequence modelling, we exploit transformers to extract spatiotemporal features. Unlike most video models that are based on self-attention transformers, we present a novel temporal dilated transformer block (TDTB) with two distinct designs.”;) Li is combinable with Sun for the same rationale as set forth above with respect to claim 6. Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li et al. (Reducto: On-Camera Filtering for Resource-Efficient Real-Time Video Analytics) in view of Chen et al. (CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification) Regarding claim 9 Li teaches claim 1. However, Li does not appear to explicitly teach: wherein the fusing the results from the machine learning models uses a cross-attention layer. Chen teaches wherein the fusing the results from the machine learning models uses a cross-attention layer. (Chen [fig(s) 2] “Cross-Attention ⨉L” [sec(s) 3] “More specifically, we first introduce a dual-branch ViT where each branch operates at a different scale (or patch size in the patch embedding) and then propose a simple yet effective module to fuse information between the branches. Figure 2 illustrates the network architecture of our proposed Cross-Attention Multi-Scale Vision Transformer (CrossViT). Our model is primarily composed of K multiscale transformer encoders where each encoder consists of two branches: (1) L-Branch: a large (primary) branch that utilizes coarse-grained patch size (Pl) with more transformer encoders and wider embedding dimensions, (2) SBranch: a small (complementary) branch that operates at fine-grained patch size (Ps) with fewer encoders and smaller embedding dimensions. Both branches are fused together L times and the CLS tokens of the two branches at the end are used for prediction. Note that for each token of both branches, we also add a learnable position embedding before the multi-scale transformer encoder for learning position information as in ViT [11].”;) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Li with the temporal dilated video transformer of Chen. One of ordinary skill in the art would have been motived to combine in order to demonstrate that the system performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. (Chen [sec(s) 5] “With extensive experiments, we demonstrate that our proposed model performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models.”) Prior Art The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. KUMAR et al. (US 2023/0171407 A1) teaches an encoded video data for a frame of video by a video decoder. Simonyan et al. (Two-Stream Convolutional Networks for Action Recognition in Videos) teaches a two-stream ConvNet architecture which incorporates spatial and temporal networks. Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Fri 9:00 AM - 5:00 PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SEHWAN KIM/Examiner, Art Unit 2129 1/20/2026 /MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129
Read full office action

Prosecution Timeline

Jun 13, 2023
Application Filed
Jan 20, 2026
Non-Final Rejection — §101, §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602595
SYSTEM AND METHOD OF USING A KNOWLEDGE REPRESENTATION FOR FEATURES IN A MACHINE LEARNING CLASSIFIER
2y 5m to grant Granted Apr 14, 2026
Patent 12602580
Dataset Dependent Low Rank Decomposition Of Neural Networks
2y 5m to grant Granted Apr 14, 2026
Patent 12602581
Systems and Methods for Out-of-Distribution Detection
2y 5m to grant Granted Apr 14, 2026
Patent 12602606
APPARATUSES, COMPUTER-IMPLEMENTED METHODS, AND COMPUTER PROGRAM PRODUCTS FOR IMPROVED GLOBAL QUBIT POSITIONING IN A QUANTUM COMPUTING ENVIRONMENT
2y 5m to grant Granted Apr 14, 2026
Patent 12541722
MACHINE LEARNING TECHNIQUES FOR VALIDATING AND MUTATING OUTPUTS FROM PREDICTIVE SYSTEMS
2y 5m to grant Granted Feb 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+65.6%)
4y 1m
Median Time to Grant
Low
PTA Risk
Based on 144 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month