Last updated: April 19, 2026

Application No. 18/669,706

SYSTEM AND METHOD USING REASONING FOR VIDEO ANOMALY DETECTION WITH LARGE LANGUAGE MODELS

Non-Final OA §102§103

Filed

May 21, 2024

Examiner

ABOUZAHRA, MAHMOUD KAMAL

Art Unit

2486

Tech Center

2400 — Computer Networks

Assignee

Honda Motor Co. Ltd.

OA Round

1 (Non-Final)

Interview Optional

— +4.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 28 resolved cases, 2023–2026

Examiner Intelligence

ABOUZAHRA, MAHMOUD KAMAL View full profile →

Grants 57% of resolved cases

Career Allow Rate

16 granted / 28 resolved

-0.9% vs TC avg

Minimal +4% lift

Without

With

+4.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 7m

Avg Prosecution

41 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

0.5%

-39.5% vs TC avg

§103

74.2%

+34.2% vs TC avg

§102

12.2%

-27.8% vs TC avg

§112

5.4%

-34.6% vs TC avg

Black line = Tech Center average estimate • Based on career data from 28 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of claims
The following is a Non-Final Office Action in response to the correspondence filed on 05/21/2024 
Claims 1-20 are considered in this Office Action. Claims 1-20 are currently pending.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1- 13, and 15- 20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kamal Nasrollahi (US 20250252735 A1) (hereinafter Nasrollahi):

Regarding Claim 1, Nasrollahi teaches a video anomaly detection (VAD) system (The present disclosure more particularly relates to Video Anomaly Detection (VAD). [0002]) comprising: 
an induction stage receiving a plurality of video frames as a reference ([0128] Next, in a step S310, the method comprises acquiring metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers) and deriving a rule for a normal event occurrence and a corresponding rule for an anomaly event occurrence by contrasting the corresponding rule for the anomaly event occurrence to the rule for the normal event occurrence ([0116] Generating the message, instruction, event or additional metadata may be triggered by a rules engine, as a first rules engine, based on the said user-defined semantic query. For instance, the rules engine may comprise a rule such as ‘send a text to person X upon detection of a person falling by camera no. 1’.); and
a deduction stage applying the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence to determine anomalies in non-reference video frames ([0129] Next, in a step S310, the method comprises identifying user-relevant content using either one of a user-defined semantic query and a user-selected Video Anomaly Detection—VAD—model, in combination with either one of a unique identifier and a subset of the unique identifiers. ).

Regarding Claim 2, Nasrollahi teaches the VAD system of Claim 1. Nasrollahi further teaches wherein the induction stage comprises large language models (LLMs) to induce the rule for the normal event occurrence from a representative set of normal scenarios from the plurality of video frames as the reference and which differentiates the normal event occurrence and the anomaly event occurrence ([0106] the user-selected VAD model used by the user according to the disclosure may comprise a MLM or a rules engine configured to perform VAD on the acquired video data corresponding to the said unique identifier or the said subset of the unique identifiers, as need be for the use case. [0088] The said MLM preferably comprises a Large Language Model, LLM, as a first LLM, the first LLM being configured to perform captioning of the video data.).

Regarding Claim 3, Nasrollahi teaches the VAD system of Claim 1. Nasrollahi further teaches wherein the induction stage comprises: 
a visual perception unit converting visual features into textual descriptions from the plurality of video frames as the reference ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers);
a rules generation unit generating the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence from the textual descriptions ([0103] For instance, if the configuration includes ‘detecting a person falling’ in the video data from camera no. 1, a person lying in bed captured by the video camera no. 1 will not trigger a “fall” event (or alarm or alert) in the VMS, whereas a person lying on the floor captured by the video camera no. 1 will.); and
a rules aggregation unit applying randomize smoothing to generate the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence ([0122] According to the disclosure, fact-checking may be done before or after identification of the said user-relevant content. However, it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits.).

Regarding Claim 4, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches, wherein the visual perception module uses a Vision Language Model (VLM) to convert visual features into textual descriptions ([0014]: Large Vision Language Models (LVLMs) to detect previously unknown objects).

Regarding Claim 5, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches wherein the visual perception unit decouples the plurality of video frames into multiple categories ([0085] The step S220′ may be triggered only when a predefined threshold is met (e.g. a threshold for detecting motion or specific objects such as humans or vehicles), for computational efficiency.).

Regarding Claim 6, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches wherein the two categories are human activities and environmental objects ([0085] The step S220′ may be triggered only when a predefined threshold is met (e.g. a threshold for detecting motion or specific objects such as humans or vehicles), for computational efficiency.).

Regarding Claim 7, Nasrollahi teaches the VAD system of Claim 6. Nasrollahi further teaches wherein the rules generation unit generates rules for human activities and environmental objectives (The metadata may also define what type of object has been detected e.g. person, car, dog, bicycle, and/or characteristics of the object (e.g. colour, speed of movement etc). [0066]).

Regarding Claim 8, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches wherein the rules generation unit queries the textual descriptions and detects patterns to define the rule for the normal event occurrence ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers).

Regarding Claim 9, Nasrollahi teaches the VAD system of Claim 8. Nasrollahi further teaches wherein the rules generation unit derives the corresponding rule for the anomaly event occurrence based on the rule for the normal event occurrence that has been defined ([0116] Generating the message, instruction, event or additional metadata may be triggered by a rules engine, as a first rules engine, based on the said user-defined semantic query. For instance, the rules engine may comprise a rule such as ‘send a text to person X upon detection of a person falling by camera no. 1’.).

Regarding Claim 10, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches wherein the rules generation unit uses analogical reasoning (Comparison to evaluate analogously. “[0121] Comparing the said first graph with a graph representing ground truth allows to eliminate fictional triples, and thus to reduce the risk of using erroneous data (e.g. for VAD or statistical purposes).”).

Regarding Claim 11, Nasrollahi teaches the VAD system of Claim 3. Nasrollahi further teaches wherein the rules aggregation unit samples a plurality of batches of the video frames as the reference each containing a predefined number of frames, each of the plurality of batches of the video frames as the reference run independently through the visual perception unit and the rules generation unit to generate the rule for the normal event occurrence ([0070]: clips are a predefined number of frames [0012] object-level analysis is performed independently for each object on each frame,).

Regarding Claim 12, Nasrollahi teaches the VAD system of Claim 11. Nasrollahi further teaches wherein the rules aggregation unit uses a large language model (LLM) with a voting mechanism to generate the rule for the normal event occurrence based on appearance in the plurality of batches of the video frames as the reference ([0005] The identification of whether information is important is typically made by the viewer, although the viewer can be assisted by the alert and/or event identifying that the information could be important. Typically, the viewer is interested to view video data that depicts the motion of objects that are of particular interest, such as people or vehicles.).

Regarding Claim 13, Nasrollahi teaches the VAD system of Claim 1. Nasrollahi further teaches wherein the deduction stage
a visual perception unit processing the non-reference video frames and outputting the textual descriptions ([0079] In step S220′, (content) metadata is generated by performing captioning of the video data. The metadata comprises semantic data (according to a semantic data model, SDM) to represent content in the video data in combination with the unique identifiers (i.e. the respective IDs of the video surveillance cameras).);
a perception smoothing unit using exponential majority smoothing for perception error reduction and temporal consistency (Such a graph may serve as a historical record or log of detected subjects, predicates, and objects, and/or serve to fact-check the accuracy of the detection model. [0120]); and
a robust reasoning module with a double-check system to reduce false negative outputs (it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits. [0122]).

Regarding Claim 15, Nasrollahi teaches the VAD system of Claim 13. Nasrollahi further teaches wherein the robust reasoning module uses a large language model (LLM) to take a modified description from each of the nonreference video frames and a dummy answer and checks to confirm if the dummy answer matches a description based on the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence ([0108] Relying on a LLM to directly detect anomalies from image descriptions as in the prior art is unpredictable. However, the present disclosure offers better control, and both model fine-tuning and embedding training have low computational requirements.).

Regarding Claim 16, Nasrollahi teaches a method for video anomaly detection (VAD) (the method according to the present disclosure, the said user-selected VAD model comprises at least one machine learning model, MLM, or a rules engine, configured to perform VAD, [0037]) comprising: 
receiving a plurality of video frames as a reference ([0128] Next, in a step S310, the method comprises acquiring metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers); deriving a rule for a normal event occurrence and a corresponding rule for an anomaly event occurrence by contrasting the corresponding rule for the anomaly event occurrence to the rule for the normal event occurrence ([0116] Generating the message, instruction, event or additional metadata may be triggered by a rules engine, as a first rules engine, based on the said user-defined semantic query. For instance, the rules engine may comprise a rule such as ‘send a text to person X upon detection of a person falling by camera no. 1’.); and
applying the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence to determine anomalies in non-reference video frames ([0129] Next, in a step S310, the method comprises identifying user-relevant content using either one of a user-defined semantic query and a user-selected Video Anomaly Detection—VAD—model, in combination with either one of a unique identifier and a subset of the unique identifiers.).

Regarding Claim 17, Nasrollahi teaches the method of Claim 16. Nasrollahi further teaches wherein deriving the rule for the normal event occurrence and the corresponding rule for the anomaly event comprises: 
converting visual features in each of the plurality of video frames as the reference into textual descriptions ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers);
generating the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence from the textual descriptions ([0103] For instance, if the configuration includes ‘detecting a person falling’ in the video data from camera no. 1, a person lying in bed captured by the video camera no. 1 will not trigger a “fall” event (or alarm or alert) in the VMS, whereas a person lying on the floor captured by the video camera no. 1 will.); and
applying randomize smoothing to generate the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence ([0122] According to the disclosure, fact-checking may be done before or after identification of the said user-relevant content. However, it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits.).

Regarding Claim 18, Nasrollahi teaches The method of Claim 17. Nasrollahi further teaches decoupling the plurality of video frames into multiple categories ([0085] The step S220′ may be triggered only when a predefined threshold is met (e.g. a threshold for detecting motion or specific objects such as humans or vehicles), for computational efficiency.).

Regarding Claim 19, Nasrollahi teaches the method of Claim 17. Nasrollahi further teaches processing the non-reference video frames to output textual descriptions of the non-reference video frames ([0079] In step S220′, (content) metadata is generated by performing captioning of the video data. The metadata comprises semantic data (according to a semantic data model, SDM) to represent content in the video data in combination with the unique identifiers (i.e. the respective IDs of the video surveillance cameras).);
applying exponential majority smoothing for perception error reduction and temporal consistency (Such a graph may serve as a historical record or log of detected subjects, predicates, and objects, and/or serve to fact-check the accuracy of the detection model. [0120]); and
applying a double-check system to reduce false negative outputs (it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits. [0122]).

Regarding Claim 20, Nasrollahi teaches a method for video anomaly detection (VAD) (the method according to the present disclosure, the said user-selected VAD model comprises at least one machine learning model, MLM, or a rules engine, configured to perform VAD, [0037]), the method implemented using a computer system including a processor (comprising one or more processors [0039]) communicatively coupled to a memory device ( a non-transitory computer readable storage medium storing a program for causing a computer to execute a method [0038]), the method comprising: 
receiving a plurality of video frames as a reference ([0128] Next, in a step S310, the method comprises acquiring metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers)
converting visual features into textual descriptions from the plurality of video frames as the reference ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers);
querying the textual descriptions to detect patterns ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers).
generating a rule for a normal event occurrence and a corresponding rule for an anomaly event occurrence based on the detected patterns ([0041] acquire metadata generated by performing captioning of the video data, wherein the metadata comprises semantic data to represent content in the video data in combination with the unique identifiers) ; 
applying randomize smoothing to generate the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence ([0122] According to the disclosure, fact-checking may be done before or after identification of the said user-relevant content. However, it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits.);
processing non-reference video frames to output textual descriptions from the non-reference video frames ([0079] In step S220′, (content) metadata is generated by performing captioning of the video data. The metadata comprises semantic data (according to a semantic data model, SDM) to represent content in the video data in combination with the unique identifiers (i.e. the respective IDs of the video surveillance cameras).);
applying the rule for the normal event occurrence and the corresponding rule for the anomaly event occurrence to determine anomalies in the non-reference video frames ([0129] Next, in a step S310, the method comprises identifying user-relevant content using either one of a user-defined semantic query and a user-selected Video Anomaly Detection—VAD—model, in combination with either one of a unique identifier and a subset of the unique identifiers.); 
applying exponential majority smoothing for perception error reduction and temporal consistency (Such a graph may serve as a historical record or log of detected subjects, predicates, and objects, and/or serve to fact-check the accuracy of the detection model. [0120]); and 
applying a double-check system to reduce false negative outputs (it is preferable to perform fact-checking before the said identification, in order to avoid presenting the user with false hits. [0122]).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Kamal Nasrollahi (US 20250252735 A1) (hereinafter Nasrollahi) in view of Mia Siemon (US 20240303986 A1) (hereinafter (Siemon):

Regarding Claim 14, Nasrollahi teaches the VAD system of Claim 13; however, does not explicitly teach wherein the perception smoothing unit uses a moving average that places a higher weighted value on more recent data points and focuses on a single category.
however, in an analogous art, Siemon teaches wherein the perception smoothing unit uses a moving average that places a higher weighted value on more recent data points and focuses on a single category (Evaluations on macro-level take all test videos concatenated to a single recording into consideration, while those on macro-level report the results obtained through the weighted average after considering each test video individually.[0127]).
It would have been obvious to the person of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings disclosed by Nasrollahi to add the teachings of Siemon as above, in order to increase the accuracy of the anomaly detection (Siemon, [0132]).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAHMOUD KAMAL ABOUZAHRA whose telephone number is (703)756-1694. The examiner can normally be reached M-F 7:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jamie Atala can be reached at (571) 272-7384. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MAHMOUD KAMAL ABOUZAHRA/Examiner, Art Unit 2486                                                                                                    

/Justin W Rider/Primary Patent Examiner, Art Unit 2486

Read full office action

Prosecution Timeline

May 21, 2024

Application Filed

Dec 27, 2025

Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/088,213

Patent 12558845

System and Method for a Three-Dimensional Optical Switch Display Device

2y 5m to grant Granted Feb 24, 2026

18/186,084

Patent 12464148

COMPUTER-IMPLEMENTED MULTI-SCALE MACHINE LEARNING MODEL FOR THE ENHANCEMENT OF COMPRESSED VIDEO

2y 5m to grant Granted Nov 04, 2025

18/318,251

Patent 12422691

VEHICULAR CAMERA ASSEMBLY WITH LENS BARREL WELDED AT IMAGER HOUSING

2y 5m to grant Granted Sep 23, 2025

18/198,983

Patent 12387309

INSPECTION APPARATUS AND INSPECTION METHOD

2y 5m to grant Granted Aug 12, 2025

18/367,287

Patent 12389089

THERMAL SENSOR, THERMAL SENSOR ARRAY, ELECTRONIC APPARATUS INCLUDING THE THERMAL SENSOR, AND OPERATING METHOD OF THE THERMAL SENSOR

2y 5m to grant Granted Aug 12, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

57%

Grant Probability

62%

With Interview (+4.4%)

2y 7m

Median Time to Grant

Low

PTA Risk

Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.