Last updated: May 29, 2026
Application No. 17/676,153
SYSTEM AND METHOD FOR AUTOMATED DETECTION OF SITUATIONAL AWARENESS

Final Rejection §101§103
Filed
Feb 19, 2022
Priority
Aug 20, 2018 — provisional 62/719,849 +4 more
Examiner
KIM, SEHWAN
Art Unit
2129
Tech Center
2100 — Computer Architecture & Software
Assignee
Genesis Intelligence LLC
OA Round
6 (Final)
Interview Optional

— +65.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 60% grant rate with +65.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 146 resolved cases, 2023–2026
Examiner Intelligence

KIM, SEHWAN View full profile →
Grants 60% of resolved cases
Career Allowance Rate
87 granted / 146 resolved
+4.6% vs TC avg
Strong +66% interview lift
Without
With
+65.9%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
26 currently pending
Career history
180
Total Applications
across all art units
Statute-Specific Performance

§101
5.2%
-34.8% vs TC avg
§103
87.7%
+47.7% vs TC avg
§102
2.2%
-37.8% vs TC avg
§112
4.7%
-35.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 146 resolved cases
Office Action

§101 §103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Note
The Examiner encourages Applicant to schedule an interview to discuss issues related to, for example, the rejections noted below under 35 U.S.C § 101 and § 103, for moving toward allowance.
Providing supporting paragraph(s) for each limitation of amended/new claim(s) in Remarks is strongly requested for clear and definite claim interpretations by Examiner (e.g., to avoid rejections under 35 U.S.C § 112(a) “Lack of written description”)
Applicant can schedule an interview at any stage of the prosecution (e.g., Non-Final, Final, and After-Final) to discuss any issues related to, for example, rejections under 35 U.S.C § 101 and § 103, for moving toward allowance.

Priority
Acknowledgment is made of applicant's claim for the provisional application filed on 08/20/2018.

Response to Arguments
Applicant's arguments filed on 03/05/2026 have been fully considered but they are not persuasive.
In Remarks, pp. 17-35, Applicant contends: 
Therefore, just as Ex parte Desjardins brought forth the improvements of "effectively [learning] new tasks in succession whilst protecting knowledge about previous tasks," "[using] less of their storage capacity," and "[reducing] system complexity," the present invention combines multiple machine learning models to "minimize bias and variance" and/or "broaden the applicability of the solution."
…
One of ordinary skill in the art, then, would realize that mapping each step to one of the three layers based on the task objective would provide the requisite processing power for each step while minimizing the delay of the workflow, as reinforced by paragraph [0128], " ... the three layer computing infrastructure (cloud, gateway, sensors) may provide flexibility and adaptability for the entire workflow."
…
Paragraph [0075] then identifies the improvements provided by the combination of models of the claimed invention, stating that "[m]odel output fusion block 426 may obtain or receive features extracted from a plurality of models and may generate a combination, ensemble, or fusion of the features that may provide better accuracy, reliability, confidence, etc. over the features extracted from individual models." Paragraph [0112] further identifies the improvements provided by combination models, teaching that "[f]or any given situation, Selector 452 may not be constrained to using a single model, but may activate a combination of models for ensemble learning, for example, to minimize bias and variance. Embodiments may use various tools to determine models to combine. For example, embodiments may use cosine similarity, in which the results from different models are represented on a normalized vector space."

Examiner’s response:
The examiner understands the applicant’s assertion. 
However, it appears that each processing step is just applying the abstract idea to a general field of endeavor with additional elements. In addition, improvements to technology or technical field are not necessarily reflected in the claims. Thus, the claim does not integrate the judicial exception into a practical application, and the claim does not amount to significantly more than the judicial exception.
In addition, note that 35 USC § 101 claim eligibility is evaluated and determined based on MPEP, but not based on a comparison to other cases (e.g., the cases that the Applicant mentioned in the Remarks).

Step 2A: Prong One
The examiner understands the applicant’s assertion.
However, it appears that the limitations are mental processes, insignificant extra-solution activities, and generic computing components along with other additional elements. In addition, it appears that the alleged improvement is board and not clear, and details of how the claims may achieve the alleged improvement is missing and/or not clear.

Step 2A: Prong Two
The examiner understands the applicant’s assertion “Therefore, just as Ex parte Desjardins brought forth the improvements of "effectively [learning] new tasks in succession whilst protecting knowledge about previous tasks," "[using] less of their storage capacity," and "[reducing] system complexity," the present invention combines multiple machine learning models to "minimize bias and variance" and/or "broaden the applicability of the solution."”
As Applicant pointed out, Desjardins clearly states the technical improvements of the whole invention in the specification. However, unlike Desiardins, the alleged improvements of the present invention are about "[minimizing] bias and variance," "[broadening] the applicability of the solution" "[providing] flexibility and adaptability," and/or minimizing the delay of the workflow. Note that “variance” is stated only in par 112, and “bias” is stated only in par 112 as a specific feature for “Model Combination”. Thus, it is not clear if [minimizing] bias and variance (e.g., even along with the cosine similarity) is the key improvement of the whole invention. In addition, it is not clear if “mapping each step to one of the three layers based on the task objective would provide the requisite processing power for each step while minimizing the delay of the workflow” is an improvement since it may be known that using some layers instead of all layers may reduce a delay. Furthermore, it appears that "[broadening] the applicability of the solution" and "[providing] flexibility and adaptability" are not actual improvements of the whole invention, but just improvements for the abstract ideas and/or just advantages of the technologies.

The examiner understands the applicant’s assertion “One of ordinary skill in the art, then, would realize that mapping each step to one of the three layers based on the task objective would provide the requisite processing power for each step while minimizing the delay of the workflow, as reinforced by paragraph [0128], " ... the three layer computing infrastructure (cloud, gateway, sensors) may provide flexibility and adaptability for the entire workflow."”
However, par 128 just states “the three layer computing infrastructure (cloud, gateway, sensors) may provide flexibility and adaptability for the entire workflow”. Even though par 131 explains mapping on different layers, it is not clear if flexibility and adaptability are actual improvements based on the mapping process. Rather, it appears that the three layer computing infrastructure (cloud, gateway, sensors) itself provides flexibility and adaptability. It does not appear that the specification clearly states that the mapping process provides flexibility and adaptability for the entire workflow. Even assuming, arguendo, they are considered improvements based on the mapping process, the alleged improvements are just results of the mapping process for the three layers, but not actual improvements based on the whole invention. 

The examiner understands the applicant’s assertion “Paragraph [0075] then identifies the improvements provided by the combination of models of the claimed invention, stating that "[m]odel output fusion block 426 may obtain or receive features extracted from a plurality of models and may generate a combination, ensemble, or fusion of the features that may provide better accuracy, reliability, confidence, etc. over the features extracted from individual models."”
However, it is not clear if “better accuracy” based on the model combination is the key improvement of the whole invention. In addition, it is not clear if it is an improvement since it may be known that an ensemble model may provide a better accuracy compared to individual models. Thus, it appears that the better accuracy is not actual improvements of the whole invention, but advantages of the known technologies.

Step 2B
The examiner understands the applicant’s assertion.
However, as rejected under Claim Rejections - 35 USC § 101, individually and/or as a whole, the limitations have been considered a combination of abstract ideas and different types of additional elements. It appears that the alleged benefits are just what the claim does, but not actual improvements. Currently, it is not clear i) how to resolve the technological problems presented by the conventional approaches, and ii) how to overcome the previous technical limitations based on the current inventive concept along with the latest claim set. Even assuming, arguendo, they are considered improvements, still it is not clear how the claims reflect the alleged improvements. Basically, it appears that the alleged improvements are broad and not clear, and details of how the claims may achieve the alleged improvements is missing and/or not clear. 

The limitations do not clearly show e.g., improvements in computer technology and improvements to other technical fields. It doesn’t seem that the specification and/or the independent claims clearly show how the inventive concept of the claims enables improvements and how they are tied together. The applicant may need to amend the claims to show how the claim languages and improvements are tied together. 
To find a valid improvement to a technology, MPEP 2106.04(d)(1) says the specification must explain the improvement and that the claim must reflect the disclosed improvement. Furthermore, the improvement should not be merely a consequence of the abstract idea. See MPEP 2106.05(a). An improvement in the abstract idea itself is not an improvement to technology. For example, describing a practical application may help overcome the current rejections. 
For at least these reasons, Applicant's arguments are not convincing.

The Examiner encourages Applicant to schedule an interview to discuss issues related the rejections under 35 U.S.C § 101.

Applicant’s arguments regarding 35 USC § 103 with respect to the independent claims have been considered but are moot because the arguments are directed to amended limitation(s) that has/have not been previously examined.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding claim 1
The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1: 
The limitations of 
“…, the method comprising: 
…;
detecting at least one entity from the at least one sequence of images, the detecting the at least one entity including selecting training data sources, selecting image recognition sources, and/or …;
generating, …, an action prediction based on the at least one sequence of images, the action prediction predicting at least one probable action to be performed by an actor shown in the at least one sequence of images, the action prediction generated from the at least one sequence of images …; 
selecting, …, computing infrastructure upon which to execute tasks, the selected computing infrastructure including delegating the tasks to a plurality of layers depending on complexity of a task, …;
wherein the action prediction further includes automatically inferring regular patterns of activity and/or detecting outliers using process mining or outlier detection …;
…;
determining, …, an action responsive to the action prediction relating to tactical intent, determined using an intent extractor and actuator to infer the action responsive to the action prediction relating to tactical intent;
generating, …, a narrativization of the data characterizing the event captured in the data … wherein the combination models are determined based on a cosine similarity between the multiple models;
…, determining a strategic intent of the actor shown in the at least one sequence of images based on the action prediction relating to strategic intent, and the obtained ontology information, and 
determining an action responsive to the action prediction relating to strategic intent, determined using an intent extractor and actuator to infer an action responsive to the action prediction relating to strategic intent;
performing an action responsive to the determined strategic intent of the actor”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper).

If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.

Step 2A Prong 2: This judicial exception is not integrated into a practical application. 
The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element (“implemented in a computer system comprising a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor”, “at the computer system”, “using at least one of an image object recognition model, an image movement recognition model, an image facial recognition model, an image situation recognition model, a video object recognition model, a video movement recognition model, a video facial recognition model, and a video situation recognition model”, “from the at least one ML model”, “using at least one model for each type of the data”). The device and the user interface in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
In particular, the claim recites an additional element(s) (“stored”). The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of storing data is recited at a high-level of generality (i.e., as a generic act of storing performing a generic act function of storing data) such that it amounts no more than a mere act to apply the exception using a generic act of storing. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
In particular, the claim recites an additional element(s) (“receiving, at the computer system, data capturing an event, the received data capturing the event including at least one sequence of images”, “obtaining ontology information based on the generated narrativization and the detected at least one entity”). The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The act of receiving/obtaining data is recited at a high-level of generality (i.e., as a generic act of receiving performing a generic act function of receiving data) such that it amounts no more than a mere act to apply the exception using a generic act of receiving/obtaining. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. 
In particular, the claim recites an additional element(s) (“training at least one machine learning (ML) model”). The additional element is recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
In particular, the claim recites an additional element (“the plurality of layers comprising a sensors layer, a gateway layer, and a cloud layer, the gateway layer is equipped with a computing capability suitable for executing a neural network, the cloud layer is equipped with a computing capability suitable for training the neural network and/or executing simulation tasks”, “wherein the action prediction is operable to be related to tactical intent or related to strategic intent”, “each of the at least one model for each type of the data stored in a models database comprising at least some of text models including at least one of enrichment models adapted to improve or refine text data, encoder models adapted to reduce input dimensions and compress input data into an encoded representation, and decoder models adapted to decompress encoded data into a plaintext representation, image models including at least one of inception models adapted to filter images with multiple sizes operating on a same level, captioning models adapted to automatically caption images, and segmentation models adapted to divide images into different portions for processing or based on objects in each segment, video models, audio models adapted to provide encoding and decoding and functions including at least one of machine translation, text summarization, conversational modeling, and image captioning, and combination models adapted to use multiple models to obtain better predictive performance than could be obtained from any of models individually, …, wherein the combination models are operable to reduce bias and variance of the generated narrativization”). This is a recitation of a particular type or source of data/model to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h)

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. See MPEP 2106.05(f)
As discussed above, the claim recites the additional element(s) of storing data at a high-level of generality and is adding an insignificant extra-solution activity – see MPEP 2106.05(g) – storing data. However, the addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible.
In particular, the claim recites an additional element(s) of receiving data at a high-level of generality. The claim is adding an insignificant extra-solution activity to the judicial exception – see MPEP 2106.05(g). The addition of insignificant extra-solution activity does not amount to an inventive concept, particularly when the activity is well-understood, routine, and conventional. See MPEP 2106.05(d)(II) – “Receiving or transmitting data over a network” or “Storing and retrieving information in memory”. Accordingly, this additional element does not provide an inventive concept and significantly more than the abstract idea. Thus, the claim is not patent eligible.
The additional elements regarding training are recited at such a high level without any details as to how a model is trained such that it amounts to only the idea of a solution or outcome because it fails to recite details of how a solution to a problem is accomplished, and, therefore, represents no more than mere instructions to apply the judicial exception on a computer (see MPEP 2106.05(f)). Accordingly, this additional element does not amount to significantly more than the abstract idea. The claim is directed to an abstract idea.
This is a recitation of a particular type or source of data/model to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h).

Regarding claim 2
The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 1.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. 
In particular, the claim recites an additional element (“the data capturing an event comprises at least one of image data, video data, text data, audio data, and sensor data”). This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h)
 
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h).

Regarding claim 3
The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1: The claim recites the abstract idea identified above regarding claim 2.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. 
In particular, the claim recites an additional element (“the data capturing an event comprises at least one of real-time data relating to events occurring contemporaneously and stored data relating events that occurred in a past”). This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not integrate the abstract idea into a practical application. See MPEP 2106.05(h)
 
Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
This is a recitation of a particular type or source of data to be used in performing the abstract idea. Limiting the abstract idea to a particular type or source of data is an attempt to limit the abstract idea to a particular field of use or technological environment, which does not amount to significantly more than the abstract idea. See MPEP 2106.05(h).

Regarding claim 4
The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1: 
The limitations of 
“generating the narrativization comprises at least one of:
captioning, …, image data,
captioning, …, video data,
recognizing, …, speech included in audio data,
generating summary data, …, characterizing text data, and
generating summary data, …, characterizing sensor data”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper).

If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.

Step 2A Prong 2: This judicial exception is not integrated into a practical application. 
The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites an additional element (“at the computer system”). The device and the user interface in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. See MPEP 2106.05(f)

Regarding claim 5
The claim is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Step 1: The claim recites a method; therefore, it falls into the statutory category of processes.
Step 2A Prong 1: 
The limitations of 
“detecting the detected at least one entity comprises at least one of:
detecting, …, the at least one entity comprising at least one of an object, activity, situation, and person from image data …;
detecting, …, the at least one entity comprising at least one of an object, activity, situation, and person from video data …;
detecting, …, the at least one entity comprising at least one of an object, activity, situation, and person from audio data …;
detecting, …, the at least one entity comprising at least one of an object, activity, situation, and person from text data …; and
detecting, …, the at least one entity comprising at least one of an object, activity, situation, and person from text data …”, as drafted, are a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. That is, nothing in the claim element precludes the step from practically being performed in the mind. For example, the limitations in the context of this claim encompass the user mentally thinking with a physical aid (e.g., pencil and paper).

If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.

Step 2A Prong 2: This judicial exception is not integrated into a practical application. 
The claim recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea. See MPEP 2106.05(f). In particular, the claim recites additional elements (“at the computer system”, “using at least one of image object recognition models, image movement recognition models, image facial recognition models, and image situation recognition models”, “using at least one of video object recognition models, video movement recognition models, video facial recognition models, and video situation recognition models”, “using at least one of audio object recognition models, audio movement recognition models, audio speaker recognition models, and audio situation recognition models”, “using at least one of text object recognition models, text activity recognition models, text situation recognition models, and text person recognition models”, “using at least one of sensor object recognition models, sensor activity recognition models, sensor situation recognition models, and sensor person recognition models”). The device and the models in each step are recited at a high-level of generality (i.e., as a generic computer performing a generic computer function of processing data) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.

Step 2B: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. 
As discussed above, with respect to integration of the abstract idea into a practical application, the additional elements of using a generic computer to perform each step amount to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claim is not patent eligible. See MPEP 2106.05(f)

Regarding claim 6
The claim recites “A system comprising a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components (see MPEP 2106.05(f)) and “Storing and retrieving information in memory” (see MPEP 2106.05(g) on Insignificant Extra-Solution Activity, and MPEP 2106.05(d) on Well-Understood, Routine, Conventional Activity) cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself, the claim is rejected for reasons set forth in the rejection of Claim 1.

Regarding claim 7
The claim is rejected for the reasons set forth in the rejection of Claim 2 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 8
The claim is rejected for the reasons set forth in the rejection of Claim 3 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 9
The claim is rejected for the reasons set forth in the rejection of Claim 4 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 10
The claim is rejected for the reasons set forth in the rejection of Claim 5 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 11
The claim recites “A computer program product comprising a non-transitory computer readable storage having program instructions embodied therewith, the program instructions executable by a computer system, to cause the computer system to perform a method” to perform precisely the method of Claim 1. As performance of an abstract idea on generic computer components cannot integrate the abstract idea into a practical application nor provide significantly more than the abstract idea itself (see MPEP 2106.05(f)), the claim is rejected for the reasons set forth in the rejection of Claim 1.

Regarding claim 12
The claim is rejected for the reasons set forth in the rejection of Claim 2 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 13
The claim is rejected for the reasons set forth in the rejection of Claim 3 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 14
The claim is rejected for the reasons set forth in the rejection of Claim 4 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Regarding claim 15
The claim is rejected for the reasons set forth in the rejection of Claim 5 under 35 U.S.C. 101, mutatis mutandis, as reciting an abstract idea without integrating the judicial exception into a practical application nor providing significantly more than the judicial exception.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 5-8, 10-13, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cobb et al. (US20110044533A1) in view of Wang et al. (US 2021/0263508 A1) in view of Camilus et al. (US 20180314897 A1) in view of Karpathy et al. (Deep Visual-Semantic Alignments for Generating Image Descriptions)

Regarding claim 1
Cobb teaches
A method implemented in a computer system comprising a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor, the method comprising:
(Cobb [fig(s) 1] “Computer system”, “CPU”, “MEMORY”)

receiving, at the computer system, data capturing an event, the received data capturing the event including at least one sequence of images; 
(Cobb [fig(s) 1] [par(s) 30-34] “In one embodiment, the machine learning engine 140 receives the video frames and the data generated by the computer vision engine 135. The machine learning engine 140 may be configured to analyze the received data, build semantic representations of events depicted in the video frames, detect patterns, and, ultimately, to learn from these observed patterns to identify normal and/or abnormal events. Additionally, data describing whether a normal/abnormal behavior/event has been determined and/or what such behavior/event is may be provided to output devices 118 to issue alerts, for example, an alert message presented on a GUI screen. In general, the computer vision engine 135 and the machine learning engine 140 both process video data in real-time. However, time scales for processing information by the computer vision engine 135 and the machine learning engine 140 may differ. For example, in one embodiment, the computer vision engine 135 processes the received video data frame-by-frame, while the machine learning engine 140 processes data every N-frames. In other words, while the computer vision engine 135 analyzes each frame in real-time to derive a set of information about what is occurring within a given frame, the machine learning engine 140 is not constrained by the real-time frame rate of the video input.”;)

detecting at least one entity from the at least one sequence of images, the detecting the at least one entity including selecting training data sources, selecting image recognition sources, and/or training at least one machine learning (ML) model;
(Cobb [fig(s) 1] [par(s) 20] “Embodiments of the invention provide an interface configured to visually convey information learned by a behavior-recognition system. The behavior-recognition system may be configured to identify, learn, and recognize patterns of behavior by observing and evaluating events depicted by a sequence of video frames. In a particular embodiment, the behavior-recognition system may include both a computer vision engine and a machine learning engine.” [par(s) 22] “the computer vision engine could initially recognize the car as a foreground object; classify it as being a vehicle, and output kinematic data describing the position, movement, speed, etc., of the car in the context event stream. In tum, a primitive event detector could generate a stream of primitive events from the context event stream such as "vehicle appears," vehicle turns," "vehicle slowing," and "vehicle stops" (once the kinematic information about the car indicated a speed of 0).” [par(s) 23] “the machine learning engine may include a mapper component configured to parse data coming from the context event stream and the primitive event stream and to supply portions of these streams as input to multiple neural networks (e.g., Adaptive Resonance Theory (ART) networks).” [par(s) 31] “Network 110 receives video data (e.g., video stream (s), video images, or the like) from the video input source 105. The video input source 105 may be a video camera, a VCR, DVR, DVD, computer, web-cam device, or the like. For example, the video input source 105 may be a stationary video camera aimed at a certain area ( e.g., a subway station, a parking lot, a building entry/exit, etc.), which records the events taking place therein. Generally, the area visible to the camera is referred to as the "scene."” [par(s) 32-34] “In tum, the machine learning engine 140 may be configured to evaluate, observe, learn, and remember details regarding events (and types of events) that transpire within the scene over time. In one embodiment, the machine learning engine 140 receives the video frames and the data generated by the computer vision engine 135. The machine learning engine 140 may be configured to analyze the received data, build semantic representations of events depicted in the video frames, detect patterns, and, ultimately, to learn from these observed patterns to identify normal and/or abnormal events. Additionally, data describing whether a normal/abnormal behavior/event has been determined and/or what such behavior/event is may be provided to output devices 118 to issue alerts, for example, an alert message presented on a GUI screen.”;)

generating, at the computer system, an action prediction based on the at least one sequence of images, the action prediction predicting at least one probable action to be performed by an actor shown in the at least one sequence of images, the action prediction generated from the at least one sequence of images using at least one of an image object recognition model, an image movement recognition model, an image facial recognition model, an image situation recognition model, a video object recognition model, a video movement recognition model, a video facial recognition model, and a video situation recognition model;
(Cobb [fig(s) 1-2] “computer vision engine” and “machine learning engine” [par(s) 6] “retrieving an adaptive theory resonance (ART) network modeling the specified event type, wherein the ART network is generated from the sequence of video frames depicting the scene captured by a video camera, and wherein a location of each cluster in the ART network models a region of the scene were one or more events of the specified type has been to observed to occur” [par(s) 21-40] “Once identified, the object may be evaluated by a classifier configured to determine what is depicted by the foreground object (e.g., a vehicle or a person). Further, the computer vision engine may identify features (e.g., height/width in pixels, average color values, shape, area, and the like) used to track the object from frame-to-frame. Further still, the computer vision engine may derive a variety of information while tracking the object from frame-to-frame, e.g., position, current (and projected) trajectory, direction, orientation, velocity, acceleration, size, color, and the like. … In one embodiment, the machine learning engine may evaluate the context events to generate “primitive events” describing object behavior. Each primitive event may provide semantic meaning to a group of one or more context events. … In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0). As events occur, and re-occur, the machine learning engine may create, encode, store, retrieve, and reinforce patterns representing the events observed to have occurred, e.g., long-term memories representing a higher-level abstraction of a car parking in the scene—generated from the primitive events underlying multiple observations of different cars entering and parking. Further still, patterns representing an anomalous event (relative to prior observation) or events identified as an event of interest may result in alerts passed to users of the behavioral recognition system. In one embodiment, the machine learning engine may include a mapper component configured to parse data coming from the context event stream and the primitive event stream and to supply portions of these streams as input to multiple neural networks (e.g., Adaptive Resonance Theory (ART) networks). Each individual ART network may generate clusters from the set of inputs data specified for that ART network.” See also [par(s) 41-46]; e.g., predictions of objects on videos read(s) on “action prediction”. In addition, e.g., “computer vision engine” and “machine learning engine” along with “neural networks (e.g., Adaptive Resonance Theory (ART) networks)” and figs 1-2 read(s) on “model”.
Examiner notes that paragraph 81 of the Instant Specification describes “Action prediction in a sequence of images may be performed to predict what are the most probable actions performed by an actor in a sequence of images.”)

(Note: Hereinafter, if a limitation has bold brackets (i.e. [·]) around claim languages, the bracketed claim languages indicate that they have not been taught yet by the current prior art reference but they will be taught by another prior art reference afterwards.)

selecting, at the computer system, computing infrastructure upon which to execute tasks, the selected computing infrastructure including delegating the tasks to a plurality of layers depending on complexity of a task, [the plurality of layers comprising a sensors layer, a gateway layer, and a cloud layer, the gateway layer is equipped with a computing capability suitable for executing a neural network, the cloud layer is equipped with a computing capability suitable for training the neural network and/or executing simulation tasks]; 
(Cobb [fig(s) 1-2] [par(s) 30-46] “FIG. 1 illustrates components of a video analysis and behavior-recognition system 100, according to one embodiment of the invention. As shown, the behavior-recognition system 100 includes a video input source 105, a network 110, a computer system 115, and input and output devices 118 (e.g., a monitor, a keyboard, a mouse, a printer, and the like). The network 110 may transmit video data recorded by the video input 105 to the computer system 115. Illustratively, the computer system 115 includes a CPU 120, storage 125 ( e.g., a disk drive, optical disk drive, floppy disk drive, and the like), and a memory 130 containing both a computer vision engine 135 and a machine learning engine 140. As described in greater detail below, the computer vision engine 135 and the machine learning engine 140 may provide software applications configured to analyze a sequence of video frames provided by the video input 105. … Note, however, FIG. 1 illustrates merely one possible arrangement of the behavior-recognition system 100. For example, although the video input source 105 is shown connected to the computer system 115 via the network 110, the network 110 is not always present or needed (e.g., the video input source 105 may be directly connected to the computer system 115). Further, various components and modules of the behavior-recognition system 100 may be implemented in other systems. For example, in one embodiment, the computer vision engine 135 may be implemented as a part of a video input device ( e.g., as a firmware component wired directly into a video camera). In such a case, the output of the video camera may be provided to the machine learning engine 140 for analysis. Similarly, the output from the computer vision engine 135 and machine learning engine 140 may be supplied over computer network 110 to other computer systems. For example, the computer vision engine 135 and machine learning engine 140 may be installed on a server system and configured to process video from multiple input sources (i.e., from multiple cameras). In such a case, a client application 250, 270 running on another computer system may request (or receive) the results of over network 110.”;)

wherein the action prediction further includes automatically inferring regular patterns of activity and/or detecting outliers using process mining or outlier detection from the at least one ML model; 
(Cobb [fig(s) 1-2] “computer vision engine” and “machine learning engine” [par(s) 6] “retrieving an adaptive theory resonance (ART) network modeling the specified event type, wherein the ART network is generated from the sequence of video frames depicting the scene captured by a video camera, and wherein a location of each cluster in the ART network models a region of the scene were one or more events of the specified type has been to observed to occur” [par(s) 21-40] “Once identified, the object may be evaluated by a classifier configured to determine what is depicted by the foreground object (e.g., a vehicle or a person). Further, the computer vision engine may identify features (e.g., height/width in pixels, average color values, shape, area, and the like) used to track the object from frame-to-frame. Further still, the computer vision engine may derive a variety of information while tracking the object from frame-to-frame, e.g., position, current (and projected) trajectory, direction, orientation, velocity, acceleration, size, color, and the like. … In one embodiment, the machine learning engine may evaluate the context events to generate “primitive events” describing object behavior. Each primitive event may provide semantic meaning to a group of one or more context events. … In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0). As events occur, and re-occur, the machine learning engine may create, encode, store, retrieve, and reinforce patterns representing the events observed to have occurred, e.g., long-term memories representing a higher-level abstraction of a car parking in the scene—generated from the primitive events underlying multiple observations of different cars entering and parking. Further still, patterns representing an anomalous event (relative to prior observation) or events identified as an event of interest may result in alerts passed to users of the behavioral recognition system. In one embodiment, the machine learning engine may include a mapper component configured to parse data coming from the context event stream and the primitive event stream and to supply portions of these streams as input to multiple neural networks (e.g., Adaptive Resonance Theory (ART) networks). Each individual ART network may generate clusters from the set of inputs data specified for that ART network.” See also [par(s) 41-46]; e.g., predictions of objects on videos read(s) on “action prediction”.)

determining, at the computer system, an action responsive to the action prediction relating to [tactical intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [tactical intent];
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [fig(s) 5-7] [par(s) 21-44] “In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0).” [par(s) 45-54] “if an object classified as person were to appear outside of one of the established clusters, then the mapper component 211 may generate an alert. Further, the relative significance of each cluster may be tied to the number of input instances that mapped to a given cluster. If the ART network which generated the clusters shown in FIG. 7 mapped an instance of input data to a cluster of low relative significance, an alert may be generated to represent the occurrence of an event that, while not resulting in a new cluster being created for this ART network, was nevertheless unusual relative to what has been observed to have occurred in the scene depicted by frame 710. Using the clusters, should an object (whether classified as “person,” “vehicle,” “unknown,” or “other”) appear, e.g., outside of any cluster, the mapper component 211 may recognize this as an unusual event, and in response, generate an alert indicating that something unusual has occurred. That is, the mapper is configured to only alert based on statistically infrequent events, according to one embodiment.”; e.g., “alert” read(s) on “action responsive to the action prediction”.
Examiner notes that paragraph 44 of the Instant Specification describes “It is possible for an intent to be a combination of strategic and tactical. For example, if the user says "I want to go to the gym daily", that is a longer-term intent, not just something they will do now, but at the same time, it is not a goal by itself Most people don't go to the gym for fun, but because they want to be fit or they want to look different.”
Examiner notes that paragraph 61 of the Instant Specification describes “Embodiments may provide an Intent extractor and actuator to infer what the appropriate action for a given situation is, given the overall goal of the use case, for example, to keep a particular person safe. Embodiments may use methods that will integrate all the available information and be able to generate an action (even if the action is "do nothing"). Functionality may include expert systems, a policy generators for agents, reinforcement learning, etc. Embodiments may create controlled scenarios with the expected output including "Ideal" scenarios and noisy scenarios and may determine the best channel to express the action, such as text to voice (personal device, automated phone call), IoT device actuators (for example, closing an automated door, ringing an alarm), etc.”
Examiner notes that paragraph 85 of the Instant Specification describes “At 508, intent may be determined using the previously obtained characterizations from 502, the identifications from 504, and ontology information from 506. At 510, based on the determined intent, activities, situations, etc., appropriate action may be taken. For example, dangerous or threatening intent, activities, situations, etc. may cause embodiments to alert police, security, the fire department, etc.”)

generating, at the computer system, a narrativization of the data characterizing the event captured in the data using at least one model for each type of the data, each of the at least one model for each type of the data stored in a models database comprising at least some of text models including at least one of enrichment models adapted to improve or refine text data, encoder models adapted to reduce input dimensions and compress input data into an encoded representation, and decoder models adapted to decompress encoded data into a plaintext representation, image models including at least one of inception models adapted to filter images with multiple sizes operating on a same level, captioning models adapted to automatically caption images, and segmentation models adapted to divide images into different portions for processing or based on objects in each segment, video models, audio models adapted to provide encoding and decoding and functions including at least one of machine translation, text summarization, conversational modeling, and image captioning, and combination models adapted to use multiple models to obtain better predictive performance than could be obtained from any of models individually, wherein the combination models are determined based on [a cosine similarity between] the multiple models, wherein the combination models are operable to [reduce bias and variance of] the generated narrativization;
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [par(s) 21-40] “Further, the computer vision engine may identify features (e.g., height/width in pixels, average color values, shape, area, and the like) used to track the object from frame-to-frame. Further still, the computer vision engine may derive a variety of information while tracking the object from frame-to-frame, e.g., position, current (and projected) trajectory, direction, orientation, velocity, acceleration, size, color, and the like. In one embodiment, the computer vision outputs this information as a stream of “context events” describing a collection of kinematic information related to each foreground object detected in the video frames. … In such a case, the computer vision engine could initially recognize the car as a foreground object; classify it as being a vehicle, and output kinematic data describing the position, movement, speed, etc., of the car in the context event stream. In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0). … the primitive event detector 212 may be configured to receive the output of the computer vision engine 135 (i.e., the video images, the object classifications, and context event stream) and generate a sequence of primitive events-labeling the observed actions or behaviors in the video with semantic meaning. … the primitive event detector 212 may generate a semantic symbol stream providing a simple linguistic description of actions engaged in by the vehicle. For example, a sequence of primitive events related to observations of the computer vision engine 135 occurring at a parking lot could include formal language vectors representing the following: "vehicle appears in scene," "vehicle moves to a given location," "vehicle stops moving," …”; e.g., “BG/FG COMPONENT 205”, “TRACKER COMPONENT 210”, “PRIMITIVE EVENT DETECTOR 211”, “computer vision engine” and “machine learning engine” read(s) on “video models” since they use video images. In addition, e.g., descriptions of actions of objects on videos read(s) on “narrativization”.)

obtaining ontology information based on the generated narrativization and the at least one entity, 
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [par(s) 21-40] “the computer vision engine could initially recognize the car as a foreground object; classify it as being a vehicle, and output kinematic data describing the position, movement, speed, etc., of the car in the context event stream. In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0). … In tum, the primitive event detector 212 may generate a semantic symbol stream providing a simple linguistic description of actions engaged in by the vehicle. For example, a sequence of primitive events related to observations of the computer vision engine 135 occurring at a parking lot could include formal language vectors representing the following: "vehicle appears in scene," "vehicle moves to a given location," "vehicle stops moving," "person appears proximate to vehicle," "person moves," person leaves scene" "person appears in scene," "person moves proximate to vehicle," "person disappears," "vehicle starts moving," and "vehicle disappears."”; e.g., objects on videos read(s) on “entity”. In addition, e.g., properties and relationships of objects in video frames read(s) on “ontology information”. 
Examiner notes that par(s) 76 of the Instant Specification describe(s) “User domain 440 may include ontology data including concepts and categories relating to users of the system and may include data showing the properties and the relations between the users and their data. … Context domain 442 may include ontology data including concepts and categories relating to the context of data in the system and may include data showing the properties and the relations between the contexts and the other data. Intent domain 443 may include ontology data including concepts and categories relating to intents of people who may be monitored by the system and may include data showing the properties and the relations between the those people, their actions, their characteristics, etc.”)
determining [a strategic intent] of the actor shown in the at least one sequence of images based on the action prediction relating to [strategic intent], and the obtained ontology information, and 
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [par(s) 21-44] “That is, the episodic memory 235 may encode specific details of a particular event, i.e., “what, when, and where” something occurred within a scene, such as a particular vehicle (car A) moved to a location believed to be a parking space (parking space 5) at 9:43 AM. The long-term memory 225 may store data generalizing events observed in the scene. To continue with the example of a vehicle parking, the long-term memory 225 may encode information capturing observations and generalizations learned by an analysis of the behavior of objects in the scene such as “vehicles tend to park in a particular place in the scene,” “when parking vehicles tend to move a certain speed,” and “after a vehicle appears and parks, people tend to appear in the scene proximate to the vehicle,” etc. Thus, the long-term memory 225 stores observations about what happens within a scene with much of the particular episodic details stripped away. In this way, when a new event occurs, memories from the episodic memory 235 and the long-term memory 225 may be used to relate and understand a current event, i.e., the new event may be compared with past experience, leading to both reinforcement, decay, and adjustments to the information stored in the long-term memory 225, over time.”)
determining an action responsive to the action prediction relating to [strategic intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [strategic intent] and the obtained ontology information; and
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [fig(s) 5-7] [par(s) 21-44] “In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0).” [par(s) 45-54] “if an object classified as person were to appear outside of one of the established clusters, then the mapper component 211 may generate an alert. Further, the relative significance of each cluster may be tied to the number of input instances that mapped to a given cluster. If the ART network which generated the clusters shown in FIG. 7 mapped an instance of input data to a cluster of low relative significance, an alert may be generated to represent the occurrence of an event that, while not resulting in a new cluster being created for this ART network, was nevertheless unusual relative to what has been observed to have occurred in the scene depicted by frame 710. Using the clusters, should an object (whether classified as “person,” “vehicle,” “unknown,” or “other”) appear, e.g., outside of any cluster, the mapper component 211 may recognize this as an unusual event, and in response, generate an alert indicating that something unusual has occurred. That is, the mapper is configured to only alert based on statistically infrequent events, according to one embodiment.”; e.g., “alert” read(s) on “action”.)

performing an action [responsive to the determined strategic intent] of the actor.
(Cobb [fig(s) 5-7] [par(s) 45-54] “if an object classified as person were to appear outside of one of the established clusters, then the mapper component 211 may generate an alert. Further, the relative significance of each cluster may be tied to the number of input instances that mapped to a given cluster. If the ART network which generated the clusters shown in FIG. 7 mapped an instance of input data to a cluster of low relative significance, an alert may be generated to represent the occurrence of an event that, while not resulting in a new cluster being created for this ART network, was nevertheless unusual relative to what has been observed to have occurred in the scene depicted by frame 710. Using the clusters, should an object (whether classified as “person,” “vehicle,” “unknown,” or “other”) appear, e.g., outside of any cluster, the mapper component 211 may recognize this as an unusual event, and in response, generate an alert indicating that something unusual has occurred. That is, the mapper is configured to only alert based on statistically infrequent events, according to one embodiment.”; e.g., “alert” read(s) on “action”.)

However, Cobb does not appear to explicitly teach:
selecting, at the computer system, computing infrastructure upon which to execute tasks, the selected computing infrastructure including delegating the tasks to a plurality of layers depending on complexity of a task, [the plurality of layers comprising a sensors layer, a gateway layer, and a cloud layer, the gateway layer is equipped with a computing capability suitable for executing a neural network, the cloud layer is equipped with a computing capability suitable for training the neural network and/or executing simulation tasks]; 
wherein the action prediction is operable to be related to tactical intent or related to strategic intent;
determining, at the computer system, an action responsive to the action prediction relating to [tactical intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [tactical intent];
wherein the combination models are determined based on [a cosine similarity between] the multiple models, wherein the combination models are operable to [reduce bias and variance of] the generated narrativization;
determining [a strategic intent] of the actor shown in the at least one sequence of images based on the action prediction relating to [strategic intent], and the obtained ontology information, and 
determining an action responsive to the action prediction relating to [strategic intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [strategic intent] and the obtained ontology information;
performing an action [responsive to the determined strategic intent] of the actor.

(Note: Hereinafter, if a limitation has one or more bold underlines, the one or more underlined claim languages indicate that they are taught by the current prior art reference, while the one or more non-underlined claim languages indicate that they have been taught already by one or more previous art references.)

Wang teaches
selecting, at the computer system, computing infrastructure upon which to execute tasks, the selected computing infrastructure including delegating the tasks to a plurality of layers depending on complexity of a task, the plurality of layers comprising a sensors layer, a gateway layer, and a cloud layer, the gateway layer is equipped with a computing capability suitable for executing a neural network, the cloud layer is equipped with a computing capability suitable for training the neural network and/or executing simulation tasks; 
(Wang [fig(s) 1] [par(s) 82-104] “The embodiments of the present invention are applied to a prediction system framework diagram of a product performance prediction modeling method as shown in FIG. 1. The system framework diagram includes: a sensor 101, a sensor 102, a sensor 103, a gateway 104, a cloud platform 105 and a computer device 106. The sensor 101, the sensor 102 and the sensor 103 are configured to monitor a production line (not drawn in the figure) to obtain sample data. … Meanwhile, the cloud platform 105 is in communication connection with a plurality of gateways 104. Each gateway 104 may be disposed in different production lines of the same factory, or may also be disposed in factories at different locations in the same area, or may also be disposed in factories in different areas, so as to comb and classify the sample data acquired by the sensors and then to upload the data to the cloud platform 105. The computer device 106 operates a machine learning model, and is configured to learn the sample data collected by the cloud platform 105 and predict the performance of a product manufactured on the production line. The machine learning model is stored in the computer device 106. In an embodiment, the machine learning model is stored in the cloud platform 105.” [par(s) 109] “Specifically, a prediction model may be established based on a supervised machine learning method, the product performance prediction model may be trained by using a back propagation algorithm, and an error-minimum parameter value of an artificial neural network of the product performance prediction model may be solved by continuously using gradient descent, so as to obtain a product performance prediction model including a locally optimal artificial neural network. It should be noted that in the present embodiment, the product performance prediction model trained by using the training set data may be a product performance prediction model established by using a random forest algorithm and the like in advance.” ;)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the recommendation system of Cobb with the new model of Wang. 
One of ordinary skill in the art would have been motived to combine in order to accurately predict the final product performance in time, thereby improving the accuracy of the product performance prediction.
(Wang, [pars 5-31] “The foregoing product performance prediction modeling method, product performance prediction method, product performance prediction modeling apparatus, product performance prediction system, computer device and storage medium simulate, according to device outlier data, a product performance and perform machine learning through a machine learning model, so that a mapping relationship between the device outlier data and the product performance can be obtained, a product performance prediction model is established, and therefore the product performance prediction model can accurately predict the final product performance in time in case of device exception, thereby improving the accuracy of the product performance prediction.”;)

However, the combination of Cobb, Wang does not appear to explicitly teach:
wherein the action prediction is operable to be related to tactical intent or related to strategic intent;
determining, at the computer system, an action responsive to the action prediction relating to [tactical intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [tactical intent];
wherein the combination models are determined based on [a cosine similarity between] the multiple models, wherein the combination models are operable to [reduce bias and variance of] the generated narrativization;
determining [a strategic intent] of the actor shown in the at least one sequence of images based on the action prediction relating to [strategic intent], and the obtained ontology information, and 
determining an action responsive to the action prediction relating to [strategic intent], determined using [an intent extractor and actuator] to infer the action responsive to the action prediction relating to [strategic intent] and the obtained ontology information;
performing an action [responsive to the determined strategic intent] of the actor.

Camilus teaches
wherein the action prediction is operable to be related to tactical intent or related to a strategic intent;
(Camilus [fig(s) 2] “Alert for security threats” [par(s) 3-6] “The present invention concerns surveillance systems that flag the potential threats automatically using intelligent systems. It can then notify or automatically alert the security personnel of impending dangers. The inventive surveillance systems employ video analytics strategies to predict and possibly even avoid incidents before they happen. This is possible by predicting human behaviors and their intent by analyzing their actions over a period of time utilizing the videos captured by surveillance cameras. The surveillance systems analyze human actions in a video and audio feeds to obtain insights on their intentions.” [par(s) 17] “FIG. 1 illustrates the design of the system. Every person, person 1, person 2 . . . person n, of a scene provided in a video feed 110 is identified on surveillance feeds is monitored by a software solution. Individual's actions are recognized time to time in an action identification step 120-1, 120-n. If the action is determined to be harmful in steps 122-1, 122-n, then, based on the action type, the person's vulnerability score is updated in steps 124-1, 124-n. If the actions are unusual or dangerous, say for example pushing a man, hitting a man, taking a gun out, taking a knife out, his vulnerability score is increment. If the person's activity score is above a threshold as determined in step 126-1, 126-n, the video observations are informed to control rooms for further actions in steps 128-1, 128-n,.”; e.g., “predicting human behaviors and their intent” along with “Individual's actions are recognized time to time” read(s) on “tactical intent”.)

determining, at the computer system, an action responsive to the action prediction relating to tactical intent, determined using an intent extractor and actuator to infer the action responsive to the action prediction relating to tactical intent;
(Camilus [fig(s) 2] “Alert for security threats” [par(s) 3-6] “The present invention concerns surveillance systems that flag the potential threats automatically using intelligent systems. It can then notify or automatically alert the security personnel of impending dangers. The inventive surveillance systems employ video analytics strategies to predict and possibly even avoid incidents before they happen. This is possible by predicting human behaviors and their intent by analyzing their actions over a period of time utilizing the videos captured by surveillance cameras. The surveillance systems analyze human actions in a video and audio feeds to obtain insights on their intentions.” [par(s) 17] “FIG. 1 illustrates the design of the system. Every person, person 1, person 2 . . . person n, of a scene provided in a video feed 110 is identified on surveillance feeds is monitored by a software solution. Individual's actions are recognized time to time in an action identification step 120-1, 120-n. If the action is determined to be harmful in steps 122-1, 122-n, then, based on the action type, the person's vulnerability score is updated in steps 124-1, 124-n. If the actions are unusual or dangerous, say for example pushing a man, hitting a man, taking a gun out, taking a knife out, his vulnerability score is increment. If the person's activity score is above a threshold as determined in step 126-1, 126-n, the video observations are informed to control rooms for further actions in steps 128-1, 128-n,.”; e.g., “alert” read(s) on “action responsive to the action prediction”. In addition, e.g., “predicting human behaviors and their intent” along with “Individual's actions are recognized time to time” read(s) on “tactical intent”.)

determining a strategic intent of the actor shown in the at least one sequence of images based on the action prediction relating to strategic intent, and the obtained ontology information, and 
(Camilus [fig(s) 2] [par(s) 3-6] “The inventive surveillance systems employ video analytics strategies to predict and possibly even avoid incidents before they happen. This is possible by predicting human behaviors and their intent by analyzing their actions over a period of time utilizing the videos captured by surveillance cameras. The surveillance systems analyze human actions in a video and audio feeds to obtain insights on their intentions.” [par(s) 17] “Individual's actions are recognized time to time in an action identification step 120-1, 120-n. If the action is determined to be harmful in steps 122-1, 122-n, then, based on the action type, the person's vulnerability score is updated in steps 124-1, 124-n. If the actions are unusual or dangerous, say for example pushing a man, hitting a man, taking a gun out, taking a knife out, his vulnerability score is increment.”; e.g., “predicting human behaviors and their intent” along with “Individual's actions are recognized time to time” read(s) on “strategic intent”.)

determining an action responsive to the action prediction relating to strategic intent, determined using an intent extractor and actuator to infer the action responsive to the action prediction relating to strategic intent and the obtained ontology information;
(Camilus [fig(s) 2] “Alert for security threats” [par(s) 3-6] “The present invention concerns surveillance systems that flag the potential threats automatically using intelligent systems. It can then notify or automatically alert the security personnel of impending dangers. The inventive surveillance systems employ video analytics strategies to predict and possibly even avoid incidents before they happen. This is possible by predicting human behaviors and their intent by analyzing their actions over a period of time utilizing the videos captured by surveillance cameras. The surveillance systems analyze human actions in a video and audio feeds to obtain insights on their intentions.” [par(s) 17] “FIG. 1 illustrates the design of the system. Every person, person 1, person 2 . . . person n, of a scene provided in a video feed 110 is identified on surveillance feeds is monitored by a software solution. Individual's actions are recognized time to time in an action identification step 120-1, 120-n. If the action is determined to be harmful in steps 122-1, 122-n, then, based on the action type, the person's vulnerability score is updated in steps 124-1, 124-n. If the actions are unusual or dangerous, say for example pushing a man, hitting a man, taking a gun out, taking a knife out, his vulnerability score is increment. If the person's activity score is above a threshold as determined in step 126-1, 126-n, the video observations are informed to control rooms for further actions in steps 128-1, 128-n,.”; e.g., “alert” read(s) on “action responsive to the action prediction”. In addition, e.g., “predicting human behaviors and their intent” along with “Individual's actions are recognized time to time” read(s) on “strategic intent”.)

performing an action responsive to the determined strategic intent of the actor.
(Camilus [fig(s) 2] “Alert for security threats” [par(s) 3-6] “The present invention concerns surveillance systems that flag the potential threats automatically using intelligent systems. It can then notify or automatically alert the security personnel of impending dangers. The inventive surveillance systems employ video analytics strategies to predict and possibly even avoid incidents before they happen. This is possible by predicting human behaviors and their intent by analyzing their actions over a period of time utilizing the videos captured by surveillance cameras. The surveillance systems analyze human actions in a video and audio feeds to obtain insights on their intentions.” [par(s) 17] “FIG. 1 illustrates the design of the system. Every person, person 1, person 2 . . . person n, of a scene provided in a video feed 110 is identified on surveillance feeds is monitored by a software solution. Individual's actions are recognized time to time in an action identification step 120-1, 120-n. If the action is determined to be harmful in steps 122-1, 122-n, then, based on the action type, the person's vulnerability score is updated in steps 124-1, 124-n. If the actions are unusual or dangerous, say for example pushing a man, hitting a man, taking a gun out, taking a knife out, his vulnerability score is increment. If the person's activity score is above a threshold as determined in step 126-1, 126-n, the video observations are informed to control rooms for further actions in steps 128-1, 128-n,.”; e.g., “alert” read(s) on “action”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Cobb, Wang with the intention-based action of Camilus.
One of ordinary skill in the art would have been motived to combine in order to lower the cognitive load on the security personnel and assist them to bring to prioritize their attention to potential threats and thereby improve the overall efficiency of the system.
(Camilus [par(s) 3] “The present invention concerns surveillance systems that flag the potential threats automatically using intelligent systems. It can then notify or automatically alert the security personnel of impending dangers. Such a system can lower the cognitive load on the security personnel and can assist them to bring to prioritize their attention to potential threats and thereby improve the overall efficiency of the system. There could also be savings in labor cost.”)

However, the combination of Cobb, Wang, Camilus does not appear to explicitly teach:
wherein the combination models are determined based on [a cosine similarity between] the multiple models, wherein the combination models are operable to [reduce bias and variance of] the generated narrativization;

Karpathy teaches
wherein the combination models are determined based on a cosine similarity between the multiple models, wherein the combination models are operable to reduce bias and variance of the generated narrativization;
(Karpathy [fig(s) 2] “generate novel descriptions” [fig(s) 3] “Diagram for evaluating the image-sentence score Skl. Object regions are embedded with a CNN (left). Words (enriched by their context) are embedded in the same multimodal space with a BRNN (right). Pairwise similarities are computed with inner products (magnitudes shown in grayscale) and finally reduced to image-sentence score with Equation 8.” [sec(s) 1] “We develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe. … • We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval based baselines, and produce sensible qualitative predictions.” [sec(s) 3.1] “The model of Karpathyeta. [24] interprets the dot product vTist between the i-th region and t-th word as a measure of similarity and use it to define the score between image k and sentence l as: 
    PNG
    media_image1.png
    65
    315
    media_image1.png
    Greyscale
. (7) Here, gk is the set of image fragments in image k and gl is the set of sentence fragments in sentence l. The indices k, l range over the images and sentences in the training set. Together with their additional Multiple Instance Learning objective, this score carries the interpretation that a sentence fragment aligns to a subset of the image regions whenever the dot product is positive. We found that the following reformulation simplifies the model and alleviates the need for additional objectives and their hyperparameters: 
    PNG
    media_image2.png
    153
    662
    media_image2.png
    Greyscale
. (8) Here, every word st aligns to the single best image region. As we show in the experiments, this simplified model also leads to improvements in the final ranking performance. … This objective encourages aligned image-sentences pairs to have a higher score than misaligned pairs, by a margin. … To address this issue, we treat the true alignments as latent variables in a Markov Random Field (MRF) where the binary interactions between neighboring words encourage an alignment to the same region. … This parameter allows us to interpolate between single-word alignments (β = 0) and aligning the entire sentence to a single, maximally scoring region when β is large. We minimize the energy to find the best alignments a using dynamic programming.”; e.g., “dot product” read(s) on “cosine similarity”. In addition, e.g., semantic gap between images and sentences read(s) on “bias”. Furthermore, e.g., scattered and/or inconsistent results read(s) on “variance”.)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Cobb, Wang, Camilus with the intention-based action of Karpathy.
One of ordinary skill in the art would have been motived to combine in order to surpass the prediction performance of the state-of-the-art in the image-sentence retrieval experiments.
(Karpathy [sec(s) 1] “We develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe. Our model associates the two modalities through a common, multimodal embedding space and a structured objective. We validate the effectiveness of this approach on image-sentence retrieval experiments in which we surpass the state-of-the-art. • We introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text. Our experiments show that the generated sentences significantly outperform retrieval based baselines, and produce sensible qualitative pre dictions. We then train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations.”)

Regarding claim 2
The combination of Cobb, Wang, Camilus, Karpathy teaches claim 1.

Cobb further teaches 
wherein the data capturing an event comprises at least one of image data, video data, text data, audio data, and sensor data.
(Cobb [fig(s) 1] [par(s) 30-34] “In one embodiment, the machine learning engine 140 receives the video frames and the data generated by the computer vision engine 135. The machine learning engine 140 may be configured to analyze the received data, build semantic representations of events depicted in the video frames, detect patterns, and, ultimately, to learn from these observed patterns to identify normal and/or abnormal events. Additionally, data describing whether a normal/abnormal behavior/event has been determined and/or what such behavior/event is may be provided to output devices 118 to issue alerts, for example, an alert message presented on a GUI screen. In general, the computer vision engine 135 and the machine learning engine 140 both process video data in real-time. However, time scales for processing information by the computer vision engine 135 and the machine learning engine 140 may differ. For example, in one embodiment, the computer vision engine 135 processes the received video data frame-by-frame, while the machine learning engine 140 processes data every N-frames. In other words, while the computer vision engine 135 analyzes each frame in real-time to derive a set of information about what is occurring within a given frame, the machine learning engine 140 is not constrained by the real-time frame rate of the video input.”;)

Regarding claim 3
The combination of Cobb, Wang, Camilus, Karpathy teaches claim 2.

Camilus further teaches 
wherein the data capturing an event comprises at least one of real-time data relating to events occurring contemporaneously and stored data relating events that occurred in a past.
(Camilus [par(s) 24-27] “In the CNN, we use the parameters as listed in Table 1. We provide 16 frames at a time to model temporal information and use a biased training data that has walking and running video frames in the ratio of 1:3 to model the human walking pattern. We captured around 45 minutes of running videos coving over 72 subjects and over 2 hours of walking videos in the vicinity of Bangalore Johnson Controls campus. These videos were being used for training and testing of the 3D-CNN. To validate our method, we used 171 running, 229 walking videos as training data and 30 running, 30 walking videos as testing data, The run time is reported in Table 2. Use of CPUs' enhances the speed of execution. … Teaching a CNN with general image features is important to improve its accuracy. We use a pre-trained model of 487 classes of actions trained on sports-1 million data base [9] and we replace the final classification layer with 2 classes that we require for running and walking classification. The final layer of the network is trained again with our own running and walking images. By this way, the network has seen many images and their association with corresponding actions.”)

The combination of Cobb, Wang, Camilus, Karpathy is combinable with Camilus for the same rationale as set forth above with respect to claim 1.

Regarding claim 5
The combination of Cobb, Wang, Camilus, Karpathy teaches claim 1.

Cobb further teaches 
wherein detecting the at least one entity comprises at least one of: 
detecting, at the computer system, the at least one entity comprising at least one of an object, activity, situation, and person from image data using at least one of image object recognition models, image movement recognition models, image facial recognition models, and image situation recognition models; 
detecting, at the computer system, the at least one entity comprising at least one of an object, activity, situation, and person from video data using at least one of video object recognition models, video movement recognition models, video facial recognition models, and video situation recognition models; 
detecting, at the computer system, the at least one entity comprising at least one of an object, activity, situation, and person from audio data using at least one of audio object recognition models, audio movement recognition models, audio speaker recognition models, and audio situation recognition models; 
detecting, at the computer system, the at least one entity comprising at least one of an object, activity, situation, and person from text data using at least one of text object recognition models, text activity recognition models, text situation recognition models, and text person recognition models; and 
detecting, at the computer system, the at least one entity comprising at least one of an object, activity, situation, and person from sensor data using at least one of sensor object recognition models, sensor activity recognition models, sensor situation recognition models, and sensor person recognition models.
(Cobb [fig(s) 1] “computer vision engine” and “machine learning engine” [par(s) 21-40] “Once identified, the object may be evaluated by a classifier configured to determine what is depicted by the foreground object (e.g., a vehicle or a person). Further, the computer vision engine may identify features (e.g., height/width in pixels, average color values, shape, area, and the like) used to track the object from frame-to-frame. Further still, the computer vision engine may derive a variety of information while tracking the object from frame-to-frame, e.g., position, current (and projected) trajectory, direction, orientation, velocity, acceleration, size, color, and the like. In one embodiment, the computer vision outputs this information as a stream of “context events” describing a collection of kinematic information related to each foreground object detected in the video frames. Each context event may provide kinematic data related to a foreground object observed by the computer vision engine in the sequence of video frames. Data output from the computer vision engine may be supplied to the machine learning engine. In one embodiment, the machine learning engine may evaluate the context events to generate “primitive events” describing object behavior. Each primitive event may provide semantic meaning to a group of one or more context events. For example, assume a camera records a car entering a scene, and that the car turns and parks in a parking spot. In such a case, the computer vision engine could initially recognize the car as a foreground object; classify it as being a vehicle, and output kinematic data describing the position, movement, speed, etc., of the car in the context event stream. In turn, a primitive event detector could generate a stream of primitive events from the context event stream such as “vehicle appears,” vehicle turns,” “vehicle slowing,” and “vehicle stops” (once the kinematic information about the car indicated a speed of 0). As events occur, and re-occur, the machine learning engine may create, encode, store, retrieve, and reinforce patterns representing the events observed to have occurred, e.g., long-term memories representing a higher-level abstraction of a car parking in the scene—generated from the primitive events underlying multiple observations of different cars entering and parking.”;)

Regarding claim 6
The claim is a system claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.
Note that Cobb teaches a processor, memory. 
(Cobb [fig(s) 1] “computer system”, “CPU”, and “MEMORY”)

Regarding claim 7
The claim is a system claim corresponding to the method claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 8
The claim is a system claim corresponding to the method claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 10
The claim is a system claim corresponding to the method claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 11
The claim is a computer program product claim corresponding to the method claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.
Note that Cobb teaches a computer, computer readable storage. 
(Cobb [fig(s) 1] “computer system”, “CPU”, and “MEMORY”)

Regarding claim 12
The claim is a computer program product claim corresponding to the method claim 2, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 13
The claim is a computer program product claim corresponding to the method claim 3, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 15
The claim is a computer program product claim corresponding to the method claim 5, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Claim(s) 4, 9, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cobb et al. (US20110044533A1) in view of Wang et al. (US 2021/0263508 A1) in view of Camilus et al. (US 20180314897 A1) in view of Karpathy et al. (Deep Visual-Semantic Alignments for Generating Image Descriptions) in view of Vinyals et al. (Show and Tell: A Neural Image Caption Generator)

Regarding claim 4
The combination of Cobb, Wang, Camilus, Karpathy teaches claim 2.

However, the combination of Cobb, Wang, Camilus, Karpathy does not appear to explicitly teach:
wherein generating the narrativization comprises at least one of: 
captioning, at the computer system, image data, 
captioning, at the computer system, video data, 
recognizing, at the computer system, speech included in audio data, 
generating summary data, at the computer system, characterizing text data, and 
generating summary data, at the computer system, characterizing sensor data.

Vinyals teaches
wherein generating the narrativization comprises at least one of: (see the rejections of claim 1)
captioning, at the computer system, image data, 
captioning, at the computer system, video data, 
recognizing, at the computer system, speech included in audio data, 
generating summary data, at the computer system, characterizing text data, and 
generating summary data, at the computer system, characterizing sensor data.
(Vinyals [fig(s) 1] “NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates complete sentences in natural language from an input image, as shown on the example above.” [sec(s) 1] “Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC.” [fig(s) 4-5] [sec(s) 4.3.6] “Figure 4 shows the result of the human evaluations of the descriptions provided by NIC, as well as a reference system and groundtruth on various datasets. We can see that NIC is better than the reference system, but clearly worse than the groundtruth, as expected. This shows that BLEU is not a perfect metric, as it does not capture well the difference between NIC and human descriptions assessed by raters. Examples of rated images can be seen in Figure 5. It is interesting to see, for instance in the second image of the first column, how the model was able to notice the frisbee given its size.”;)

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system of Cobb, Wang, Camilus, Karpathy with the image data captioning of Vinyals.
One of ordinary skill in the art would have been motived to combine in order to automatically describe the content of an image using fluent English sentences with accuracy and better performance, qualitatively and quantitatively, compared to the conventional approaches. 
(Vinyals [sec(s) Abs] “Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively.” [sec(s) 5] “Experiments on several datasets show the robustness of NIC in terms of qualitative results (the generated sentences are very reasonable) and quantitative evaluations, using either ranking metrics or BLEU, a metric used in machine translation to evaluate the quality of generated sentences.”)

Regarding claim 9
The claim is a system claim corresponding to the method claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.

Regarding claim 14
The claim is a computer program product claim corresponding to the method claim 4, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejections of the method claim.
Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Schiappa et al. (US 2016/0080419 A1) teaches threat detection.
Cobb et al. (US20100260376A1) teaches processing videos, generating alerts.
Koppula et al. (Anticipating Human Activities for Reactive Robotic Response) teaches predicting human activities.
Jan et al. (NEURAL NETWORK CLASSIFIERS FOR AUTOMATED VIDEO SURVEILLANCE) teaches alerts based on actor motions on videos.
Pei et al. (Learning and parsing video events with goal and intent prediction) teaches predicting intents based on video images.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409. The examiner can normally be reached Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SEHWAN KIM/Examiner, Art Unit 2129
Read full office action
Prosecution Timeline

Show 8 earlier events
May 22, 2025
Final Rejection mailed — §101, §103
Nov 19, 2025
Examiner Interview Summary
Nov 19, 2025
Applicant Interview (Telephonic)
Nov 20, 2025
Request for Continued Examination
Nov 30, 2025
Response after Non-Final Action
Dec 05, 2025
Non-Final Rejection mailed — §101, §103
Mar 05, 2026
Response Filed
Apr 23, 2026
Final Rejection mailed — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/046,492
Patent 12619853
DECISION-MAKING DEVICE, UNMANNED SYSTEM, DECISION-MAKING METHOD, AND PROGRAM
5y 6m to grant Granted May 05, 2026
17/899,124
Patent 12619921
PREDICTIVE FOG DATA CENTER MIGRATION
3y 8m to grant Granted May 05, 2026
18/000,845
Patent 12608592
AUTOMATED ELECTRIC SUBMERSIBLE PUMP (ESP) FAILURE ANALYSIS
3y 4m to grant Granted Apr 21, 2026
15/360,454
Patent 12602595
SYSTEM AND METHOD OF USING A KNOWLEDGE REPRESENTATION FOR FEATURES IN A MACHINE LEARNING CLASSIFIER
9y 4m to grant Granted Apr 14, 2026
16/453,380
Patent 12602580
Dataset Dependent Low Rank Decomposition Of Neural Networks
6y 9m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

7-8
Expected OA Rounds
60%
Grant Probability
99%
With Interview (+65.9%)
4y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 146 resolved cases by this examiner. Grant probability derived from career allowance rate.