Last updated: May 29, 2026
Application No. 17/991,410
SYSTEM AND METHOD FOR SELF-SUPERVISED VIDEO TRANSFORMER

Non-Final OA §103
Filed
Nov 21, 2022
Examiner
HERNANDEZ, ALEJANDRO
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Mohamed Bin Zayed University Of Artificial Intelligence
OA Round
2 (Non-Final)
Interview Optional

— +27.5% interview lift. Examiner has a relatively high allowance rate (78%); +27.5% interview lift. A written response may suffice.
Based on 41 resolved cases, 2023–2026
Examiner Intelligence

HERNANDEZ, ALEJANDRO View full profile →
Grants 78% — above average
Career Allowance Rate
32 granted / 41 resolved
+16.0% vs TC avg
Strong +28% interview lift
Without
With
+27.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
9 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
1.2%
-38.8% vs TC avg
§103
85.2%
+45.2% vs TC avg
§102
6.2%
-33.8% vs TC avg
§112
7.4%
-32.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 41 resolved cases
Office Action

§103
Detailed Action

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
	The amendments to the claims filed 08/25/2025 have been acknowledged, accepted, and entered. Previously claims 1 – 20 were pending. Claims 1, 5, 11, 15, and 19 have been amended and now claims 1 – 20 are still pending.

Response to Arguments
Regarding Applicant’s Arguments/Remarks, field on 08/25/2025, the applicant states that the art “Ryoo” is not eligible to be cited as prior art and attached a declaration in the form of “Affidavit-Rule 130(a)-America Invents Act (First Inventor to File) Only” filed on 08/25/2025. However, this declaration fails to provide reasonable explanation as to why other people (the other people being “Ryoo”) are named as authors and inventors in the prior art, the prior art being Ryoo (Self-supervised Video Transformer). Please see MPEP section 2155.01 and section 717.01(a)(1) for further explanation and examples. As noted below the declaration is ineffective and therefore, the independent claims have been updated to incorporate the teachings of Ryoo, as necessitated by the amendments.
The declaration under 37 CFR 1.130(a) filed 08/25/2025 is insufficient to overcome the rejection of claims 5 and 15 (now incorporated into the independent claims) based upon Caron in view of Qian and Xiong, further in view of Ryoo as set forth in the last Office action because the affidavit fails to provide reasonable explanation as to why other people (the other people being “Ryoo”) are named as authors and inventors in the prior art, the prior art being Ryoo (Self-supervised Video Transformer).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5, 11, 15, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; as seen in IDS filed 11/21/2022; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) and further in view of Xiong; Shaomin et al. (US 20220374635 A1; hereinafter simply referred to as Xiong) and further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo)

Regarding independent claim 1, Caron teaches of: 

	A method of training a video transformer, using machine learning circuitry (The video transformer in the teachings of Caron uses training patches to train the video transformer “ViT”, which inherently uses machine learning circuitry as a transformer is a machine learning model, See Figure 1 Paragraph, Page 9651, Left Column, Section 1 Introduction, Paragraph 3, Abstract, and figure 1) (See Figure 1 Paragraph, “Self-attention from a Vision Transformer with 8 × 8 patches trained with no supervision.”);
	
sampling video clips from different spatiotemporal windows in local views (The transformer “ViT” take as input (samples) video clips being contiguous images in several local views, the different spatiotemporal windows being the different images that make up a video which encompass a different time and space within the video, See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2) (See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2, “The ViT architecture takes as input a grid of non-overlapping contiguous image patches of resolution N × N.” … “adapt the problem in Eq. (2) to self-supervised learning. First, we construct different distorted views, or crops, of an image with multicrop strategy [9]. More precisely, from a given image, we generate a set V of different views. This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution”);
	
matching via the machine learning circuitry, the global and local views in a framework of student teacher network to learn cross-view correspondence between local and global views, (The global and local views are matched in a knowledge distillation framework with student and teacher networks to learn local-global correspondence (cross-view correspondence) between global and local views, See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 1, 2, 3, and Paragraph “Teacher network”) (See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 2, and 3, “This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution. All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.” … “Knowledge distillation is a learning paradigm where we train a student network gθs to match the output of a given teacher network gθt , parameterized by θs and θt respectively. Given an input image x, both networks output probability distributions over K dimensions denoted by Ps and Pt.”).

	Caron does not explicitly disclose the use of human action recognition in a video, sampling video clips with varying temporal resolutions in global views, and learning motion correspondence between varying temporal resolutions.

	However, Qian teaches of human action recognition in a video (The framework of Qian is used for human action recognition in a video, See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, Abstract, Page 3, Section 1 Introduction Summary point 4, Page 9, Section 4 Experiments Subsection 4.2 Implementation details Paragraph “Action Recognition” and tables 1 and 2) (See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, “We use 4 video action recognition datasets”);
	
sampling video clips with varying temporal resolutions in global views (The global feature temporal resolution Tv as seen in figure 4 has varying values, See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4, “Two aspects were investigated, one is the number of local clips K, the other is the global video feature temporal resolution Tv, which is obtained by adjusting temporal convolution stride,”);
	
learn motion correspondence between varying temporal resolutions (The motion correspondence (spatiotemporal region correspondence) is found for varying temporal resolutions (Tv), See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4 , “By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. And it is worth noting that when the ratio Tv/KTc < 1, the granularity of local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching,”)

	As taught by Qian the use of sampling video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions allows for significantly improved action recognition. (See Page 13 section 4.4 Ablation study first paragraph, “Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching, and significantly improves action recognition.”) As both the teachings of Caron and Qian deal with the technical field of processing different views of a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings Caron with Qian to sample video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions in order to significantly improve human action recognition. 

	Caron in view of Qian does not explicitly disclose displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action.

	However, Xiong teaches of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action, (The video analysis subsystem, ‘156’, uses recognized human motion/action that triggers the displaying on a display device, (display subsystem ‘158’) video clips that are focused on the motion/action that triggered the video being displayed. See ¶ 45, Figures 1 and 4) (See ¶ 45, “For example, video analysis subsystem 156 may use motion, tripwire, object recognition, facial recognition, audio detection, speech recognition, and/or other algorithms to determine events occurring in a video stream and tag them in a corresponding metadata track and/or separate metadata table associated with the video data object. In some embodiments, video analysis subsystem 156 may include event handling logic for determining response to detection of one or more detected events, such as raising an alert to user device 170 or triggering selective display of a video stream including the detected event through video display subsystem 158.”)

	As taught by Xiong the use of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action allows for the video clips to be displayed on a user device so they can see the video clips, (See ¶ 46, “For example, video display subsystem 158 may include a monitoring or display configuration for displaying one or more video streams in real-time or near real-time on a graphical user display of user device 170”) As both the teachings of Caron in view of Qian and Xiong deal with the technical field of determining human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Coran in view of Qian with Xiong to display the video clips with an emphasis on the human action in order for the user to be able to view the recognized human action video clips.

	Caron in view of Qian and Xiong does not explicitly disclose the motion and cross-view correspondence involve varying spatial and temporal resolutions, and the matching includes  a dynamic positional encoding of the video clips.

	However, Ryoo teaches of The motion and cross-view correspondences involve varying spatial and temporal resolutions  (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 wherein The motion and cross-view correspondences involve varying spatial and temporal resolutions) 

and the matching includes a dynamic positional encoding of the video clips (See Page 2875 Left Column First Paragraph, and last 2 bullet points, Page 2878 Left Column Section 3.2.3 Dynamic Positional Embedding Paragraph 1,  Abstract, and Figure 1 wherein the matching includes a dynamic positional encoding of the video clips)

As taught by Ryoo the use of dynamic positional embedding allows for the model to process inputs of varying resolutions benefitting the tasks that require different sized inputs by being dynamic in nature. (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 “This allows our single SVT model to process inputs of varying resolution while also giving the positional embedding a dynamic nature which is more suited for different sized inputs in the downstream tasks.”) As both the teachings of Caron in view of Qian and Xiong and Ryoo deal with the technical field of recognizing human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of  the claimed invention to combine the teachings of Caron in view of Qian and Xiong with Ryoo to have the motion and cross-view correspondences involve varying spatial and temporal resolutions and the matching includes a dynamic positional encoding of the video clips in order to have a model that is better suited to dealing with inputs of varying sizes.

Regarding dependent claim 5, Caron in view of Qian, Xiong, and Ryoo teaches: 

The varying spatial and temporal resolutions results in variable spatial and temporal input tokens, (The varying spatial and temporal correspondences of the motion and cross-view correspondences result in varying spatial and input tokens, See Ryoo Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1)

 the dynamic positional encoding of the video clips includes using a separate positional encoding vector for spatial and temporal dimensions and fixing these vectors to a maximum resolution across each dimension; (Separate positional encoding vectors are used for the temporal and spatial dimensions and are fixed to a maximum resolution across each dimension, See Ryoo Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1)  

and varying the positional encoding vectors through interpolation to account for missing spatial or temporal tokens at lower frame rate or spatial size. (The positional encoding vectors are varied through interpolation to account for missing spatial or temporal token at lower frame rates or spatial size, See Ryoo Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1) 

Regarding independent claim 11, Caron teaches of: 

	Sampling video clips from different spatiotemporal windows in local views (The transformer “ViT” take as input (samples) video clips being contiguous images in several local views, the different spatiotemporal windows being the different images that make up a video which encompass a different time and space within the video, See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2) (See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2, “The ViT architecture takes as input a grid of non-overlapping contiguous image patches of resolution N × N.” … “adapt the problem in Eq. (2) to self-supervised learning. First, we construct different distorted views, or crops, of an image with multicrop strategy [9]. More precisely, from a given image, we generate a set V of different views. This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution”);
	
matching via the machine learning circuitry, the global and local views in a framework of student teacher network to learn cross-view correspondence between local and global views, (The global and local views are matched in a knowledge distillation framework (machine learning framework that requires machine learning circuitry, which further requires processing circuitry) with student and teacher networks to learn local-global correspondence (cross-view correspondence) between global and local views, See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 1, 2, 3, and Paragraph “Teacher network”, Page 9651, Section 1 Introduction, Left Column Paragraph 4 (last paragraph)) (See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 2, and 3, Page 9651, Section 1 Introduction, Left Column Paragraph 4 (last paragraph), “This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution. All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.” … “Knowledge distillation is a learning paradigm where we train a student network gθs to match the output of a given teacher network gθt , parameterized by θs and θt respectively. Given an input image x, both networks output probability distributions over K dimensions denoted by Ps and Pt.” … “In particular, training DINO with ViT takes just two 8-GPU servers over 3 days to achieve 76.1% on ImageNet linear benchmark,”).

	Caron does not explicitly disclose the use of a system for human action recognition in a video, sampling video clips with varying temporal resolutions in global views, and learning motion correspondence between varying temporal resolutions.

	However, Qian teaches of a system for human action recognition in a video (The framework of Qian is used for human action recognition in a video, the  See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, Abstract, Page 3, Section 1 Introduction Summary point 4, Page 9, Section 4 Experiments Subsection 4.2 Implementation details Paragraph “Action Recognition” and tables 1 and 2) (See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, “We use 4 video action recognition datasets”);
	sampling video clips with varying temporal resolutions in global views (The global feature temporal resolution Tv as seen in figure 4 has varying values, See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4, “Two aspects were investigated, one is the number of local clips K, the other is the global video feature temporal resolution Tv, which is obtained by adjusting temporal convolution stride,”);
	
learn motion correspondence between varying temporal resolutions (The motion correspondence (spatiotemporal region correspondence) is found for varying temporal resolutions (Tv), See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4 , “By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. And it is worth noting that when the ratio Tv/KTc < 1, the granularity of local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching,”)

	As taught by Qian the use of sampling video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions allows for significantly improved action recognition. (See Page 13 section 4.4 Ablation study first paragraph, “Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching, and significantly improves action recognition.”) As both the teachings of Caron and Qian deal with the technical field of processing different views of a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings Caron with Qian to sample video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions in order to significantly improve human action recognition. 

	Caron in view of Qian does not explicitly disclose displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action.

	However, Xiong teaches of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action, (The video analysis subsystem, ‘156’, uses recognized human motion/action that triggers the displaying on a display device, (display subsystem ‘158’) video clips that are focused on the motion/action that triggered the video being displayed. See ¶ 45, Figures 1 and 4) (See ¶ 45, “For example, video analysis subsystem 156 may use motion, tripwire, object recognition, facial recognition, audio detection, speech recognition, and/or other algorithms to determine events occurring in a video stream and tag them in a corresponding metadata track and/or separate metadata table associated with the video data object. In some embodiments, video analysis subsystem 156 may include event handling logic for determining response to detection of one or more detected events, such as raising an alert to user device 170 or triggering selective display of a video stream including the detected event through video display subsystem 158.”)

	As taught by Xiong the use of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action allows for the video clips to be displayed on a user device so they can see the video clips, (See ¶ 46, “For example, video display subsystem 158 may include a monitoring or display configuration for displaying one or more video streams in real-time or near real-time on a graphical user display of user device 170”) As both the teachings of Caron in view of Qian and Xiong deal with the technical field of determining human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Coran in view of Qian with Xiong to display the video clips with an emphasis on the human action in order for the user to be able to view the recognized human action video clips.

Caron in view of Qian and Xiong does not explicitly disclose the motion and cross-view correspondence involve varying spatial and temporal resolutions, and the matching includes  a dynamic positional encoding of the video clips.

	However, Ryoo teaches of The motion and cross-view correspondences involve varying spatial and temporal resolutions  (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 wherein The motion and cross-view correspondences involve varying spatial and temporal resolutions) 

and the matching includes a dynamic positional encoding of the video clips (See Page 2875 Left Column First Paragraph, and last 2 bullet points, Page 2878 Left Column Section 3.2.3 Dynamic Positional Embedding Paragraph 1,  Abstract, and Figure 1 wherein the matching includes a dynamic positional encoding of the video clips)

As taught by Ryoo the use of dynamic positional embedding allows for the model to process inputs of varying resolutions benefitting the tasks that require different sized inputs by being dynamic in nature. (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 “This allows our single SVT model to process inputs of varying resolution while also giving the positional embedding a dynamic nature which is more suited for different sized inputs in the downstream tasks.”) As both the teachings of Caron in view of Qian and Xiong and Ryoo deal with the technical field of recognizing human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of  the claimed invention to combine the teachings of Caron in view of Qian and Xiong with Ryoo to have the motion and cross-view correspondences involve varying spatial and temporal resolutions and the matching includes a dynamic positional encoding of the video clips in order to have a model that is better suited to dealing with inputs of varying sizes.

Regarding dependent claim 15, claim 15 is a system claim corresponding to claim 5. Please see the discussion of claim 5 above.

Regarding independent claim 19, Caron teaches of: 

	Sampling video clips from different spatiotemporal windows in local views (The transformer “ViT” take as input (samples) video clips being contiguous images in several local views, the different spatiotemporal windows being the different images that make up a video which encompass a different time and space within the video, See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2) (See Page 9653, Section 3 Approach, Subsection 3.2 Implementation and evaluation protocols, Paragraph 2 “Vision Transformer”, and Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation right column Paragraph 2, “The ViT architecture takes as input a grid of non-overlapping contiguous image patches of resolution N × N.” … “adapt the problem in Eq. (2) to self-supervised learning. First, we construct different distorted views, or crops, of an image with multicrop strategy [9]. More precisely, from a given image, we generate a set V of different views. This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution”);
	
matching via the machine learning circuitry, the global and local views in a framework of student teacher network to learn cross-view correspondence between local and global views, (The global and local views are matched in a knowledge distillation framework (machine learning framework requiring a machine learning engine) with student and teacher networks to learn local-global correspondence (cross-view correspondence) between global and local views, See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 1, 2, 3, and Paragraph “Teacher network”) (See Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 2, and 3, “This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution. All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.” … “Knowledge distillation is a learning paradigm where we train a student network gθs to match the output of a given teacher network gθt , parameterized by θs and θt respectively. Given an input image x, both networks output probability distributions over K dimensions denoted by Ps and Pt.”).

	Caron does not explicitly disclose the use of human action recognition in a video, sampling video clips with varying temporal resolutions in global views, and learning motion correspondence between varying temporal resolutions.

	However, Qian teaches of human action recognition in a video (The framework of Qian is used for human action recognition in a video, See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, Abstract, Page 3, Section 1 Introduction Summary point 4, Page 9, Section 4 Experiments Subsection 4.2 Implementation details Paragraph “Action Recognition” and tables 1 and 2) (See Page 9 Section 4 Experiment Subsection 4.1 Datasets, Paragraph 1, “We use 4 video action recognition datasets”);
	
sampling video clips with varying temporal resolutions in global views (The global feature temporal resolution Tv as seen in figure 4 has varying values, See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4, “Two aspects were investigated, one is the number of local clips K, the other is the global video feature temporal resolution Tv, which is obtained by adjusting temporal convolution stride,”);
	
learn motion correspondence between varying temporal resolutions (The motion correspondence (spatiotemporal region correspondence) is found for varying temporal resolutions (Tv), See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4 , “By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. And it is worth noting that when the ratio Tv/KTc < 1, the granularity of local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching,”)

	As taught by Qian the use of sampling video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions allows for significantly improved action recognition. (See Page 13 section 4.4 Ablation study first paragraph, “Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching, and significantly improves action recognition.”) As both the teachings of Caron and Qian deal with the technical field of processing different views of a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings Caron with Qian to sample video clips with varying temporal resolutions in global views, and learning motion correspondence/spatio-temporal region correspondence between the varying temporal resolutions in order to significantly improve human action recognition. 

	Caron in view of Qian does not explicitly disclose displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action, a non-transitory computer readable storage medium storing program code, which when executed by a computer having a CPU, perform a method.

	However, Xiong teaches of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action, (The video analysis subsystem, ‘156’, uses recognized human motion/action that triggers the displaying on a display device, (display subsystem ‘158’) video clips that are focused on the motion/action that triggered the video being displayed. See ¶ 45, Figures 1 and 4) (See ¶ 45, “For example, video analysis subsystem 156 may use motion, tripwire, object recognition, facial recognition, audio detection, speech recognition, and/or other algorithms to determine events occurring in a video stream and tag them in a corresponding metadata track and/or separate metadata table associated with the video data object. In some embodiments, video analysis subsystem 156 may include event handling logic for determining response to detection of one or more detected events, such as raising an alert to user device 170 or triggering selective display of a video stream including the detected event through video display subsystem 158.”)
	
a non-transitory computer readable storage medium storing program code, which when executed by a computer having a CPU, perform a method (The hard disk drive storage (non-transitory computer readable storage medium) store programs that are executed by a CPU to perform a method, See ¶ 36 and 37) (See ¶ 36 and 37, “In some embodiments, storage device 140 may include one or more hard disk drives.” … “In some embodiments, each storage device 140 includes a device controller 144, which includes one or more processing units (also sometimes called CPUs or processors or microprocessors or microcontrollers) configured to execute instructions in one or more programs.”)

	As taught by Xiong the use of displaying, via a display device, video clips  with the emphasis on the attention to the recognized human action allows for the video clips to be displayed on a user device so they can see the video clips, (See ¶ 46, “For example, video display subsystem 158 may include a monitoring or display configuration for displaying one or more video streams in real-time or near real-time on a graphical user display of user device 170”) As both the teachings of Caron in view of Qian and Xiong deal with the technical field of determining human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Coran in view of Qian with Xiong to display the video clips with an emphasis on the human action in order for the user to be able to view the recognized human action video clips.

Caron in view of Qian and Xiong does not explicitly disclose the motion and cross-view correspondence involve varying spatial and temporal resolutions, and the matching includes  a dynamic positional encoding of the video clips.

	However, Ryoo teaches of The motion and cross-view correspondences involve varying spatial and temporal resolutions  (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 wherein The motion and cross-view correspondences involve varying spatial and temporal resolutions) 

and the matching includes a dynamic positional encoding of the video clips (See Page 2875 Left Column First Paragraph, and last 2 bullet points, Page 2878 Left Column Section 3.2.3 Dynamic Positional Embedding Paragraph 1,  Abstract, and Figure 1 wherein the matching includes a dynamic positional encoding of the video clips)

As taught by Ryoo the use of dynamic positional embedding allows for the model to process inputs of varying resolutions benefitting the tasks that require different sized inputs by being dynamic in nature. (See Page 2868, Section 3 Self-supervised Video Transformer, Subsection 3.2.3 Dynamic Positional Embedding Paragraph 1 “This allows our single SVT model to process inputs of varying resolution while also giving the positional embedding a dynamic nature which is more suited for different sized inputs in the downstream tasks.”) As both the teachings of Caron in view of Qian and Xiong and Ryoo deal with the technical field of recognizing human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of  the claimed invention to combine the teachings of Caron in view of Qian and Xiong with Ryoo to have the motion and cross-view correspondences involve varying spatial and temporal resolutions and the matching includes a dynamic positional encoding of the video clips in order to have a model that is better suited to dealing with inputs of varying sizes.

Claims 2, 3, 12, 13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) further in view of Xiong; Shaomin et al. (US 20220374635 A1; hereinafter simply referred to as Xiong) further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo) and further in view of Bi: Jiarui et al. (Transformer in Computer Vision; hereinafter simply referred to as Bi)

Regarding dependent claim 2, Caron in view of Qian, Xiong, and Ryoo teaches:

	The matching comprising of randomly selecting one global view and passing the selected global view through the teacher network to generate a target; passing other global and local views through the student network to learn the cross - view correspondences (The cross-view correspondence or “local to global correspondence” is learned via the passing the global views through the teacher network and passing the global and local views through the student network, See Caron Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 1, 2, and 3) (See Caron Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 2, and 3 “This set contains two global views, x g 1 and x g 2 and several local views of smaller resolution. All crops are passed through the student while only the global views are passed through the teacher, therefore encouraging “local-to-global” correspondences.”)  and motion correspondences (The motion correspondence or “spatiotemporal region correspondence” as learned by Qian also incorporates local and global views that can be used in the student teacher network as taught by Caron to learn the motion correspondence, See Qian  Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4) (See Qian Page 12 and 13 section 4.4 Ablation study first paragraph and table 4 paragraph and table 4 , “By varying the number of local clips K from 1 to 4, we find that having more local clips tend to improve the performance due to more fine-grained feature alignment. And it is worth noting that when the ratio Tv/KTc < 1, the granularity of local-global correspondence becomes too coarse, which constricts the performance. Overall, accurate spatio-temporal region correspondence does provide reliable reference for appearance and motion pattern matching,”)
	updating student network weight parameter values by matching the student local and global views to the target generated by the teacher network; (The student weight parameters are matched to the target values as set by the teacher network that set target feature values of a higher quality, See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Teacher Network”) (See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Teacher Network”, “Unlike knowledge distillation, we do not have a teacher gθt given a priori and hence, we build it from past iterations of the student network. We study different update rules for the teacher in Appendix and show that freezing the teacher network over an epoch works surprisingly well in our framework, while copying the student weight for the teacher fails to converge. Of particular interest, using an exponential moving average (EMA) on the student weights, i.e., a momentum encoder [26], is particularly well suited for our framework. The update rule is θt ← λθt + (1 − λ)θs, with λ following a cosine schedule from 0.996 to 1 during training [23]. Originally the momentum encoder has been introduced as a substitute for a queue in contrastive learning [26]. However, in our framework, its role differs since we do not have a queue nor a contrastive loss, and may be closer to the role of the mean teacher used in self-training [52]. Indeed, we observe that this teacher performs a form of model ensembling similar to Polyak-Ruppert averaging with an exponential decay [41, 48]. Using PolyakRuppert averaging for model ensembling is a standard practice to improve the performance of a model [31]. We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality.”)
	predicating target features using multilayer perception (The output of the projection head consisting of multilayer perception is the outputted features, See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Network Architecture”) (See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Network Architecture”, “The neural network g is composed of a backbone f (ViT [16] or ResNet [27]), and of a projection head h: g = h ◦ f. The features used in downstream tasks are the backbone f output. The projection head consists of a 3-layer multi-layer perceptron (MLP) with hidden dimension 2048 followed by `2 normalization and a weight normalized fully connected layer [50] with K dimensions,”)

	Caron in view of Qian, Xiong, and Ryoo does not explicitly disclose the use of a video transformer having separate space-time attention for predicting target features.

	However, Bi teaches of a video transformer that uses separate space-time attention for predicting target features (The video transformer has separate space-time attention used in action detection which inherently requires target features to be predicted to predict the action being done, See Page 182, Section “TimeSformer” and figure 10) (See Page 182, Section “TimeSformer” and figure 10, “Inspired by ViT, Bertasius et al. [11] developed a complete transformer-based video classification model that includes no convolutional operation, and they call their model ‘TimeSformer’. Experimenting with various time-space transformer designs concluded that the “Divided Space-Time Attention” model (shown in Figure 10) exhibits the best outcome regarding the accuracy and computational cost. This “Divided Space-Time Attention” model first performs self-attention on all the patches at the same special location in the temporal domain.” … “Experiments suggest that the TimeSformer model outperforms all traditional action detectors on Kinetics-400, Kinetics-600, Diving-48, and HowTo100M, especially in long-term video modeling, which makes it one of the best long-term video action detector in the field.”)

	As taught by Bi, the use of the TimeSformer with its separate space-time attention requires less training as compared to other 3D Convolutional neural networks, and outperforms other traditional action detectors. (See Page 182, Section “TimeSformer” and figure 10, “Experiments suggest that the TimeSformer model outperforms all traditional action detectors on Kinetics-400, Kinetics-600, Diving-48, and HowTo100M, especially in long-term video modeling, which makes it one of the best long-term video action detector in the field. In addition, this outcome can be achieved with less training time comparing to 3D CNNs like SlowFast and I3D.”). As both the teachings of Caron in view of Qian, Xiong, and Ryoo and Bi deal with the technical field of determining human action in a video with a transformer it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, and Ryoo with Bi to use a video transformer having separate space-time attention for predicting target features in order to have an implementation that requires less training and outperforms other traditional action detectors.

Regarding dependent claim 3, Caron in view of Qian, Xiong, Ryoo and Bi teaches:
	
	The student network processes the local and global views to produce feature vectors, and the feature vectors are matched to the target through a loss consisting of a motion correspondence loss and a cross-view correspondence loss. (Loss functions are used in both the learning of the cross-view correspondence and motion correspondence which compare the local and global views to the target, See Caron Figure 2 Paragraph, Page 9652, Section 3 Approach, Subsection 3.1 SSL with Knowledge distillation, Paragraphs 2, 3 and 4 and Qian Table 7 paragraph) (See Caron Figure 2 Paragraph, “We illustrate DINO in the case of one single pair of views (x1, x2) for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch. Each networks outputs a K dimensional feature that is normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss.”) (See Qian Table 7 paragraph, “Ablation study on all learning objectives. Note that Lnce is the standard contrastive loss function in previous works.”)

Regarding dependent claim 12, claim 12 is a system claim corresponding to claim 2. Please see the discussion of claim 2 above.

Regarding dependent claim 13, claim 13 is a system claim corresponding to claim 3. Please see the discussion of claim 3 above.

Regarding dependent claim 20, claim 20 is a non-transitory computer readable storage claim corresponding to claim 2. Please see the discussion of claim 2 above. Furthermore, Xiong teaches of a non-transitory computer readable storage medium, (The hard disk drive storage (non-transitory computer readable storage medium) store programs that are executed by a CPU to perform a method, See Xiong ¶ 36 and 37) (See Xiong ¶ 36 and 37, “In some embodiments, storage device 140 may include one or more hard disk drives.” … “In some embodiments, each storage device 140 includes a device controller 144, which includes one or more processing units (also sometimes called CPUs or processors or microprocessors or microcontrollers) configured to execute instructions in one or more programs.”)

Claims 4 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) further in view of Xiong; Shaomin et al. (US 20220374635 A1; hereinafter simply referred to as Xiong) further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo)  and further in view of Chen; Hanting et al. (Learning Student Networks via Feature Embedding; hereinafter simply referred to as Chen)

Regarding dependent claim 4, Caron in view of Qian, Xiong, and Ryoo teaches:
	
	During each training step, the method includes updating weight parameter values of the student network while updating weight parameter values of the teacher as an exponential moving average of the student weight parameter values. (An exponential moving average (EMA) of the wight parameters is used for the updating of the weight parameters with regards to the teacher as an exponential moving average of the student weight parameter values, See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Teacher Network”) (See Caron Page 9653, Section 3 Approach, Subsection 3.1 SSL with Knowledge Distillation, Paragraph “Teacher Network”, “Unlike knowledge distillation, we do not have a teacher gθt given a priori and hence, we build it from past iterations of the student network. We study different update rules for the teacher in Appendix and show that freezing the teacher network over an epoch works surprisingly well in our framework, while copying the student weight for the teacher fails to converge. Of particular interest, using an exponential moving average (EMA) on the student weights, i.e., a momentum encoder [26], is particularly well suited for our framework. The update rule is θt ← λθt + (1 − λ)θs, with λ following a cosine schedule from 0.996 to 1 during training [23]. Originally the momentum encoder has been introduced as a substitute for a queue in contrastive learning [26]. However, in our framework, its role differs since we do not have a queue nor a contrastive loss, and may be closer to the role of the mean teacher used in self-training [52]. Indeed, we observe that this teacher performs a form of model ensembling similar to Polyak-Ruppert averaging with an exponential decay [41, 48]. Using PolyakRuppert averaging for model ensembling is a standard practice to improve the performance of a model [31]. We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality.”)

	Caron in view of Qian, Xiong, and Ryoo does not explicitly disclose the use of backpropagation for the updating of student weight parameters

	However, Chen teaches of updating of a student network weight parameters with the use of backpropagation (Backpropagation is used to train the student networks and the results of the backpropagation are used to update the weight parameters in the step of gradient descent, See Page 30 Section 5, Experiments, Subsection 2 Compression results Paragraph 1, and algorithm 1, and table 1, and figure 2) (See Page 30 Section 5, Experiments, Subsection 2 Compression results Paragraph 1, “Table I reports the results of different networks on the MNIST data set by exploiting the proposed method. In order to illustrate the advantage of the introduced locality preserving loss, the performance of student networks with the same architecture trained by using standard backpropagation, KD [19], and FitNets [21] was also reported. It can be found in Table I that the student network trained using the standard backpropagation achieved a 1.90% error rate.”)

	As taught by Chen the use of the backpropagation for training the student network can achieve a low error rate below 2 percent (See Page 30 Section 5, Experiments, Subsection 2 Compression results Paragraph 1 “It can be found in Table I that the student network trained using the standard backpropagation achieved a 1.90% error rate.”) As both the teachings of Caron in view of Qian, Xiong, and Ryoo and Chen deal with student teacher knowledge distillation networks it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, and Ryoo with Chen to update a student network weight parameters with the use of backpropagation in order to achieve a low error rate.

Regarding dependent claim 14, claim 14 is a system claim corresponding to claim 4. Please see the discussion of claim 4 above.
	
Claims 6, 7, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) further in view of Xiong; Shaomin et al. (US 20220374635 A1; hereinafter simply referred to as Xiong) further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo) and further in view of Ueda; Junko et al. (US 20210150220 A1; hereinafter simply referred to as Ueda).

Regarding dependent claim 6, Caron in view of Qian, Xiong, and Ryoo teaches:

	Displaying, via a display device selected video clips in which the predetermined human actions are recognized with the emphasis on the attention to the predetermined human action, (The video analysis subsystem, ‘156’, uses predetermined human motion/action that triggers the displaying on a display device, (display subsystem ‘158’) video clips that are focused on the motion/action that triggered the video being displayed. See Xiong ¶ 45, Figures 1 and 4) (See Xiong, ¶ 45, “For example, video analysis subsystem 156 may use motion, tripwire, object recognition, facial recognition, audio detection, speech recognition, and/or other algorithms to determine events occurring in a video stream and tag them in a corresponding metadata track and/or separate metadata table associated with the video data object. In some embodiments, video analysis subsystem 156 may include event handling logic for determining response to detection of one or more detected events, such as raising an alert to user device 170 or triggering selective display of a video stream including the detected event through video display subsystem 158.”)

	Caron in view of Qian, Xiong, and Ryoo does not explicitly disclose of the analyzing and recognizing of predetermined human action in a video of a sporting event, 

	However, Ueda teaches of the analyzing and recognizing of predetermined human action in a video of a sporting event, (The action determiner ‘103’, is able to analyze and recognize an “action type” or human action from a predetermined set of action types, via the analysis of moving image frames (video), See ¶ 9, 38, 40, 47, and 53, and figure 1) (See ¶ 47, 9, “Note that action determiner 103 may determine the action type on the basis of the trajectory change of the ball, the three-dimensional position and the speed of the ball, the rule of the ball game and the like. Action types of volleyball include serve, reception, dig, tossing, attack, and block. For example, when the trajectory of the ball that is detected first after the start of the analysis has a movement component of the Y-axis direction (the long side direction of the court illustrated in FIG. 1) and the speed component of the ball in the Y-axis direction is within a predetermined range, action determiner 103 determines that the action type is “serve”. For another example, when the trajectory of the ball after “serve” extends across the coordinate of net 11 in the Y axis, and the three-dimensional position of the ball has changed from a downward movement to an upward movement (i.e., the change in the coordinate in the Z-axis direction has become plus), action determiner 103 determines that the action type is “reception”. In the rule of volleyball, an action of receiving “serve” is “reception”, and thus “reception” and “dig” can be distinguished from each other by making a determination based on that rule.” … “A ball game image analysis apparatus according to an aspect of the present disclosure is configured to analyze an image of a ball game, the ball game image analysis apparatus including an image receptor configured to receive a plurality of moving image frames of the ball game captured by a plurality of cameras located at different positions;”)

	As taught by Ueda the use of analyzing and recognizing predetermined actions in video sports clips allows for the predetermined actions to be tracked and further displayed to the user in the form of statistics, (Tracking the predetermined human actions via recognition from video clips allows for a user to see who performed a certain action, where they performed it and how they performed it, which provides analysis for the user to see the statistics of the human actions, See ¶ 44 and Figure 3) (See ¶ 44, and Figure 3, “as illustrated in FIG. 3. In this manner, with analysis result information 205, the user or other apparatuses can know that the player (actor) of the uniform number “14” has performed “attack” of a speed “ST (km/h)” for the ball at the three-dimensional position (x.sub.T, y.sub.T, z.sub.T) at frame time T.”) As both the teachings of Caron in view of Qian, Xiong, and Ryoo and Ueda deal with the technical field of recognizing human action in a video, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, and Ryoo with Ueda to analyze and recognize predetermined human action in a video of a sporting event in order to track the human actions and display them to a user as statistics for further analysis.

Regarding dependent claim 7, Caron in view of Qian, Xiong, and Ryoo, and Ueda teaches:

	The predetermined actions include ball transfer actions that are recognized from the video clips, (The ball transfer actions are the actions such as “serve” and “dig” which transfer a ball in a game of volleyball, which are recognized from video clips (moving image frames), See Ueda ¶ 47, 9) (See Ueda ¶ 47, 9, “Note that action determiner 103 may determine the action type on the basis of the trajectory change of the ball, the three-dimensional position and the speed of the ball, the rule of the ball game and the like. Action types of volleyball include serve, reception, dig, tossing, attack, and block. For example, when the trajectory of the ball that is detected first after the start of the analysis has a movement component of the Y-axis direction (the long side direction of the court illustrated in FIG. 1) and the speed component of the ball in the Y-axis direction is within a predetermined range, action determiner 103 determines that the action type is “serve”. For another example, when the trajectory of the ball after “serve” extends across the coordinate of net 11 in the Y axis, and the three-dimensional position of the ball has changed from a downward movement to an upward movement (i.e., the change in the coordinate in the Z-axis direction has become plus), action determiner 103 determines that the action type is “reception”. In the rule of volleyball, an action of receiving “serve” is “reception”, and thus “reception” and “dig” can be distinguished from each other by making a determination based on that rule.” … “A ball game image analysis apparatus according to an aspect of the present disclosure is configured to analyze an image of a ball game, the ball game image analysis apparatus including an image receptor configured to receive a plurality of moving image frames of the ball game captured by a plurality of cameras located at different positions;”)
	displaying via a display device video clips that emphasize attention to the ball transfer actions (The video analysis subsystem, ‘156’, uses predetermined human motion/action that triggers the displaying on a display device, (display subsystem ‘158’) video clips that are focused on the motion/action that triggered the video being displayed, (the motion/actions that trigger the displaying being the ball transfer actions as taught by Ueda, See Ueda ¶ 47),  See Xiong ¶ 45, Figures 1 and 4) (See Xiong, ¶ 45, “For example, video analysis subsystem 156 may use motion, tripwire, object recognition, facial recognition, audio detection, speech recognition, and/or other algorithms to determine events occurring in a video stream and tag them in a corresponding metadata track and/or separate metadata table associated with the video data object. In some embodiments, video analysis subsystem 156 may include event handling logic for determining response to detection of one or more detected events, such as raising an alert to user device 170 or triggering selective display of a video stream including the detected event through video display subsystem 158.”)

Regarding dependent claim 16, claim 16 is a system claim corresponding to claim 6. Please see the discussion of claim 6 above.

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) further in view of Xiong; Shaomin (US 20220374635 A1; hereinafter simply referred to as Xiong) further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo) further in view of Ueda; Junko (US 20210150220 A1; hereinafter simply referred to as Ueda), and further in view of Lee; Keng Fai (US 11380100 B2; hereinafter simply referred to as Lee)

Regarding dependent claim 8, Caron in view of Qian, Xiong, and Ryoo, and Ueda teaches:
	
	Generating via processing circuitry, statistics for players in the sporting event based on the recognized ball transfer actions, (The recognized ball transfer action, such as “attack”, are associated with a player/actor such as “14”, in order to form statistics, using a processor, See Ueda ¶ 44, 114, Figure 3) (See Ueda ¶ 44, Figure 3, “Result outputter 108 generates analysis result information 205 by correlating ball trajectory information 202, action information 203, and actor information 204, and stores it in storage 109. For example, in the case where action frame time T and an action type “attack” are correlated with each other in action information 203, and action frame time T and the uniform number “14” of the actor are correlated with each other in actor information 204, result outputter 108 generates analysis result information 205 as illustrated in FIG. 3. Specifically, result outputter 108 generates analysis result information 205 in which the action type “attack” and the uniform number “14” of the actor are correlated with each other for frame time T of ball trajectory information 202 as illustrated in FIG. 3. In this manner, with analysis result information 205, the user or other apparatuses can know that the player (actor) of the uniform number “14” has performed “attack” of a speed “ST (km/h)” for the ball at the three-dimensional position (x.sub.T, y.sub.T, z.sub.T) at frame time T.” … “In addition, the way of achieving the integrated circuit is not limited to LSIs, and may also be achieved with a dedicated circuit or a general-purpose processor.”)

	Caron in view of Qian, Xiong, and Ryoo and Ueda teaches of displaying the statistics as seen in Ueda figure 3, but does not explicitly disclose displaying on a mobile device.

	However, Lee teaches of displaying the generated statistics on a mobile device,  (See Lee Col 9, Lines 15-20, “FIG. 7 is a screen capture 700 of an additional detailed chart showing personal records by the player Colin Wan, according to one embodiment of the present invention. In implementations on a mobile device such as a smartphone, game statistics may be displayed on the screen in a scrollable fashion,”)

	As taught by Lee the use of displaying statistics on a mobile device allows for the statistics to be seen in a scrollable fashion, (See Lee Col 9, Lines 17-20, “In implementations on a mobile device such as a smartphone, game statistics may be displayed on the screen in a scrollable fashion,”) As both the teachings of Caron in view of Qian, Xiong, and Ryoo and Ueda and Lee deal with the technical field of recognizing human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, Ryoo and Ueda with Lee to display the generated statistics on a mobile device in order for the statistics to be available in a scrollable fashion.

Regarding dependent claim 17, Caron in view of Qian, Xiong, Ryoo, and Ueda teaches:

	The predetermined actions include ball transfer actions that are recognized from the video clips, (The ball transfer actions are the actions such as “serve” and “dig” which transfer a ball in a game of volleyball, which are recognized from video clips (moving image frames), See Ueda ¶ 47, 9) (See Ueda ¶ 47, 9, “Note that action determiner 103 may determine the action type on the basis of the trajectory change of the ball, the three-dimensional position and the speed of the ball, the rule of the ball game and the like. Action types of volleyball include serve, reception, dig, tossing, attack, and block. For example, when the trajectory of the ball that is detected first after the start of the analysis has a movement component of the Y-axis direction (the long side direction of the court illustrated in FIG. 1) and the speed component of the ball in the Y-axis direction is within a predetermined range, action determiner 103 determines that the action type is “serve”. For another example, when the trajectory of the ball after “serve” extends across the coordinate of net 11 in the Y axis, and the three-dimensional position of the ball has changed from a downward movement to an upward movement (i.e., the change in the coordinate in the Z-axis direction has become plus), action determiner 103 determines that the action type is “reception”. In the rule of volleyball, an action of receiving “serve” is “reception”, and thus “reception” and “dig” can be distinguished from each other by making a determination based on that rule.” … “A ball game image analysis apparatus according to an aspect of the present disclosure is configured to analyze an image of a ball game, the ball game image analysis apparatus including an image receptor configured to receive a plurality of moving image frames of the ball game captured by a plurality of cameras located at different positions;”) 	generating via processing circuitry, statistics for players in the sporting event based on the recognized ball transfer actions, (The recognized ball transfer action, such as “attack”, are associated with a player/actor such as “14”, in order to form statistics, using a processor, See Ueda ¶ 44, 114, Figure 3) (See Ueda ¶ 44, Figure 3, “Result outputter 108 generates analysis result information 205 by correlating ball trajectory information 202, action information 203, and actor information 204, and stores it in storage 109. For example, in the case where action frame time T and an action type “attack” are correlated with each other in action information 203, and action frame time T and the uniform number “14” of the actor are correlated with each other in actor information 204, result outputter 108 generates analysis result information 205 as illustrated in FIG. 3. Specifically, result outputter 108 generates analysis result information 205 in which the action type “attack” and the uniform number “14” of the actor are correlated with each other for frame time T of ball trajectory information 202 as illustrated in FIG. 3. In this manner, with analysis result information 205, the user or other apparatuses can know that the player (actor) of the uniform number “14” has performed “attack” of a speed “ST (km/h)” for the ball at the three-dimensional position (x.sub.T, y.sub.T, z.sub.T) at frame time T.” … “In addition, the way of achieving the integrated circuit is not limited to LSIs, and may also be achieved with a dedicated circuit or a general-purpose processor.”)

	Caron in view of Qian, Xiong, Ryoo and Ueda teaches of displaying the statistics as seen in Ueda figure 3, but does not explicitly disclose displaying on a mobile device.

	However, Lee teaches of displaying the generated statistics on a mobile device,  (See Lee Col 9, Lines 15-20, “FIG. 7 is a screen capture 700 of an additional detailed chart showing personal records by the player Colin Wan, according to one embodiment of the present invention. In implementations on a mobile device such as a smartphone, game statistics may be displayed on the screen in a scrollable fashion,”)

	As taught by Lee the use of displaying statistics on a mobile device allows for the statistics to be seen in a scrollable fashion, (See Lee Col 9, Lines 17-20, “In implementations on a mobile device such as a smartphone, game statistics may be displayed on the screen in a scrollable fashion,”) As both the teachings of Caron in view of Qian, Xiong, Ryoo and Ueda and Lee deal with the technical field of recognizing human action in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, Ryoo and Ueda with Lee to display the generated statistics on a mobile device in order for the statistics to be available in a scrollable fashion.

Claims 9, 10, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Caron; Mathilde et al. (Emerging Properties in Self-Supervised Vision Transformers; hereinafter simply referred to as Caron) in view of Qian; Rui et al. (Controllable Augmentations for Video Representation Learning; hereinafter simply referred to as Qian) further in view of Xiong; Shaomin et al. (US 20220374635 A1; hereinafter simply referred to as Xiong) further in view of Ryoo et al. (Self-supervised Video Transformer; hereinafter simply referred to as Ryoo) further in view of Hioki; Toshikazu et al. (US 12221111 B2; hereinafter simply referred to as Hioki) and further in view of Felleisen; Juergen et al. (US 20230007914 A1; hereinafter simply referred to as Felleisen)

Regarding dependent claim 9, Caron in view of Qian Xiong, and Ryoo does not teach:

	The video is captured by one or more cameras on a vehicle in motion and analyzing the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and informing a vehicle control system of the potential safety action.

	However Hioki teaches of the video is captured by one or more cameras on a vehicle in motion and analyzing the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and informing a vehicle control system of the potential safety action. (The video is captured by the use of camera ‘129A’ and is used to detect obstacles and humans in motion that may cause a safety action, if the motion is considered a safety risk, the safety system  ‘125’ is alerted and can cause the car to brake, indicating that the car is in motion, See Col 4, Lines 66-67, Col 5, Lines 1-9) (See Col 4, Lines 66-67, Col 5, Lines 1-9 “Active safety system 125 detects an obstacle (a pedestrian, a bicycle, a parked vehicle, a utility pole, or the like) in front or in the rear of the vehicle with the use of camera 129A and radar sensors 129B and 129C. Active safety system 125 determines whether or not vehicle 10 may collide with the obstacle based on a distance between vehicle 10 and the obstacle and a direction of movement of vehicle 10. Then, when active safety system 125 determines that there is possibility of collision, it outputs a braking command to brake system 121 through integrated control manager 115 so as to increase braking force of the vehicle.”) 

	As taught by Hioki, the use of recognizing the predetermined action as a potential safety action and informing a vehicle control system allows for the control system to take action such as braking. (The use of the safety system being informed of a potential safety action allows for the safety system to prevent collision by outputting a braking command, See Col 5, Lines 6-9) (See Col 5, Lines 6-9 “Then, when active safety system 125 determines that there is possibility of collision, it outputs a braking command to brake system 121 through integrated control manager 115 so as to increase braking force of the vehicle.”) As both the teachings of Caron in view of Qian Xiong and Ryoo and Hioki deal with the technical field of detecting huma motion in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong and Ryoo with Hioki to capture video by one or more cameras on a vehicle in motion, analyze the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and inform a vehicle control system of the potential safety action in order to allow for a vehicle control system to take action and prevent collision by outputting a braking command.

	Caron in view of Qian, Xiong, Ryoo and Hioki does not explicitly disclose the cameras being in a vehicle.

	However, Felleisen teaches of the cameras being in the vehicle, (See ¶ 21, 24, “In some aspects of the disclosure, the principles and methods disclosed herein may be performed using one or more in-cabin sensors” … “These one or more in-cabin sensors may include, for instance, one or more cameras (e.g. image sensors, video cameras, depth cameras, etc.),”)

	As taught by Felleisen using in vehicle cameras have a greater ability to perceive objects (The use of in vehicle sensors allows for better perception of objects as cameras that are not in the vehicle often have limited ability to perceive objects, See ¶ 24) (See ¶  24, “In many instances, it may be preferable to utilize in-vehicle sensors” … “This may be at least because the geometry of image sensors mounted on a side of the vehicle often have a limited ability to perceive objects (e.g. such as bicycles)”) As both the teachings of Caron in view of Qian, Xiong, Ryoo and Hioki and Felleisen deal with the technical field of determining human action from a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian, Xiong, Ryoo and Hioki with Felleisen to have the cameras in the car in order to have a greater ability at detecting objects.

Regarding dependent claim 10, Caron in view of Qian, Xiong, Ryoo Hioki, and Felleisen teaches:

	The predetermined human movement is a person riding a bicycle, wherein the predetermined movement is a potential safety action and the vehicle control system is informed of the potential safety action, (The detection of obstacles and humans in motion may cause a safety action, such as someone on a bike that may collide with the car, if the motion is considered a safety risk, the safety system  ‘125’ is alerted and can cause the car to brake, See Hioki Col 4, Lines 66-67, Col 5, Lines 1-9) (See Hioki, Col 4, Lines 66-67, Col 5, Lines 1-9 “Active safety system 125 detects an obstacle (a pedestrian, a bicycle, a parked vehicle, a utility pole, or the like) in front or in the rear of the vehicle with the use of camera 129A and radar sensors 129B and 129C. Active safety system 125 determines whether or not vehicle 10 may collide with the obstacle based on a distance between vehicle 10 and the obstacle and a direction of movement of vehicle 10. Then, when active safety system 125 determines that there is possibility of collision, it outputs a braking command to brake system 121 through integrated control manager 115 so as to increase braking force of the vehicle.”) 

Regarding dependent claim 18, Caron in view of Qian Xiong, and Ryoo teaches:

Uniformly sample two clips of the video, one with high spatial but low temporal resolution, and a second with low spatial but high temporal resolution; pass the two clips through a single network to generate two different feature vectors; combining the two feature vectors to obtain a joint vector (Two clips of the same video are sampled, one with high spatial but low temporal resolution and a second with low spatial but high temporal resolution in a “slow-fast interference” method which then passes the two clips through a network were they are jointly combined to obtain a joint vector, See figure 3 paragraph) (See Figure 3 Paragraph, “we uniformly sample two clips of the same video at resolutions (8, 224, 244) and (64, 96, 96), pass through a shared network, and generate two different feature vectors (class tokens). These vectors are combined in a deterministic manner (with no learnable parameters), e.g. summation, to obtain a joint vector that is fed to the downstream task classifier.”)

	Caron in view of Qian Xiong, and Ryoo does not explicitly disclose the video is captured by one or more cameras on a vehicle in motion and analyzing the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and informing a vehicle control system of the potential safety action.

	However Hioki teaches of the video is captured by one or more cameras on a vehicle in motion and analyzing the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and informing a vehicle control system of the potential safety action. (The video is captured by the use of camera ‘129A’ and is used to detect obstacles and humans in motion that may cause a safety action, if the motion is considered a safety risk, the safety system  ‘125’ is alerted and can cause the car to brake, indicating that the car is in motion, See Col 4, Lines 66-67, Col 5, Lines 1-9) (See Col 4, Lines 66-67, Col 5, Lines 1-9 “Active safety system 125 detects an obstacle (a pedestrian, a bicycle, a parked vehicle, a utility pole, or the like) in front or in the rear of the vehicle with the use of camera 129A and radar sensors 129B and 129C. Active safety system 125 determines whether or not vehicle 10 may collide with the obstacle based on a distance between vehicle 10 and the obstacle and a direction of movement of vehicle 10. Then, when active safety system 125 determines that there is possibility of collision, it outputs a braking command to brake system 121 through integrated control manager 115 so as to increase braking force of the vehicle.”) 

	As taught by Hioki, the use of recognizing the predetermined action as a potential safety action and informing a vehicle control system allows for the control system to take action such as braking. (The use of the safety system being informed of a potential safety action allows for the safety system to prevent collision by outputting a braking command, See Col 5, Lines 6-9) (See Col 5, Lines 6-9 “Then, when active safety system 125 determines that there is possibility of collision, it outputs a braking command to brake system 121 through integrated control manager 115 so as to increase braking force of the vehicle.”) As both the teachings of Caron in view of Qian Xiong, and Ryoo and Hioki deal with the technical field of detecting huma motion in a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian Xiong, and Ryoo with Hioki to capture video by one or more cameras on a vehicle in motion, analyze the video for a predetermined human motion wherein the predetermined human motion is a potential safety action and inform a vehicle control system of the potential safety action in order to allow for a vehicle control system to take action and prevent collision by outputting a braking command.

	Caron in view of Qian Xiong, Ryoo and Hioki does not explicitly disclose the cameras being in a vehicle.

	However, Felleisen teaches of the cameras being in the vehicle, (See ¶ 21, 24, “In some aspects of the disclosure, the principles and methods disclosed herein may be performed using one or more in-cabin sensors” … “These one or more in-cabin sensors may include, for instance, one or more cameras (e.g. image sensors, video cameras, depth cameras, etc.),”)

	As taught by Felleisen using in vehicle cameras have a greater ability to perceive objects (The use of in vehicle sensors allows for better perception of objects as cameras that are not in the vehicle often have limited ability to perceive objects, See ¶ 24) (See ¶  24, “In many instances, it may be preferable to utilize in-vehicle sensors” … “This may be at least because the geometry of image sensors mounted on a side of the vehicle often have a limited ability to perceive objects (e.g. such as bicycles)”) As both the teachings of Caron in view of Qian Xiong, Ryoo and Hioki and Felleisen deal with the technical field of determining human action from a video it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Caron in view of Qian Xiong, Ryoo and Hioki with Felleisen to have the cameras in the car in order to have a greater ability at detecting objects.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See attached PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEJANDRO HERNANDEZ whose telephone number is (703)756-1876. The examiner can normally be reached M-F 8 am - 5 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John M Villecco can be reached on (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ALEJANDRO HERNANDEZ/Examiner, Art Unit 2661                                                                                                                                                                                                        
/AARON W CARTER/Primary Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Nov 21, 2022
Application Filed
Feb 25, 2025
Non-Final Rejection mailed — §103
Aug 25, 2025
Response Filed
Oct 29, 2025
Final Rejection mailed — §103
Mar 05, 2026
Response after Non-Final Action
May 08, 2026
Request for Continued Examination
May 09, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/318,090
Patent 12626355
IMAGE PROCESSING APPARATUS, METHOD OF CONTROLLING THE SAME, AND STORAGE MEDIUM
2y 12m to grant Granted May 12, 2026
18/302,252
Patent 12597147
IMAGE CORRELATION PROCESSING BY ADDITION OF REAGGREGATION
2y 11m to grant Granted Apr 07, 2026
17/855,288
Patent 12573013
REGION-OF-INTEREST (ROI)-BASED IMAGE ENHANCEMENT USING A RESIDUAL NETWORK
3y 8m to grant Granted Mar 10, 2026
18/333,091
Patent 12573169
COMMON VIEW REGION IDENTIFICATION AND SCALE ALIGNMENT FOR FEATURE MATCHING IN IMAGE PAIRS
2y 9m to grant Granted Mar 10, 2026
18/553,578
Patent 12567268
AUTOMATED NANOSCOPY SYSTEM HAVING INTEGRATED ARTIFACT MINIMIZATION MODULES, INCLUDING EMBEDDED NANOMETER POSITION TRACKING BASED ON PHASOR ANALYSIS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

2-3
Expected OA Rounds
78%
Grant Probability
99%
With Interview (+27.5%)
2y 10m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 41 resolved cases by this examiner. Grant probability derived from career allowance rate.