Last updated: April 19, 2026
Application No. 17/960,370
COMPOSITIONAL REASONING OF GROUP ACTIVITY IN VIDEOS WITH KEYPOINT-ONLY MODALITY

Non-Final OA §103
Filed
Oct 05, 2022
Examiner
PHAM, NHUT HUY
Art Unit
2674
Tech Center
2600 — Communications
Assignee
NEC Laboratories America Inc.
OA Round
3 (Non-Final)
Interview Optional

— +26.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 53 resolved cases, 2023–2026
Examiner Intelligence

PHAM, NHUT HUY View full profile →
Grants 79% — above average
Career Allow Rate
42 granted / 53 resolved
+17.2% vs TC avg
Strong +27% interview lift
Without
With
+26.8%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
31 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
9.4%
-30.6% vs TC avg
§103
62.2%
+22.2% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
14.5%
-25.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 53 resolved cases
Office Action

§103
CONTINUED EXAMINATION UNDER 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant’s submission filed on 06/16/2025 has been entered.

DETAIL OFFICE ACTIONS
The United States Patent & Trademark Office appreciates the response filed for the current application that is submitted on 06/16/2025. The United States Patent & Trademark Office reviewed the following documents submitted and has made the following comments below.

Amendment
Applicant submitted amendments on 06/16/2025. The Examiner acknowledges the amendment and has reviewed the claims accordingly.


Applicant Arguments:
Applicant/s state/s that the cited prior arts do not teach the amended claims, specially, the limitation “clustering groups of keypoint persons in the video frames using contrastive learning and passing the clustering groups through multi-scale prediction, the scales including different granularities of the video frames”; therefore, the rejection under 35 U.S.C. 103 should be withdrawn. 

Examiner’s Responses:
Applicant’s arguments and amendments, see Remarks, filed 06/16/2025, with respect to the rejection(s) of claim(s) 1, 8 and 15 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration of amendments, a new ground(s) of rejection is made in view of 
Li et al. (Li, Shuaicheng, et al. "Groupformer: Group activity recognition with clustered spatial-temporal transformer." IEEE, published 08/28/2021, hereinafter, Li) in view of 
Braso et al. (Brasó, Guillem et al. "The center of attention: Center-keypoint grouping via attention for multi-person pose estimation." IEEE, published 10/11/2021, hereinafter, Braso), in view of 
Wang et al. (Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." IEEE, published 08/11/2021, hereinafter, Wang), in view of
Li-Yunfan et al. (Li-Yunfan, et al. "Contrastive clustering." Proceedings of the AAAI conference on artificial intelligence, published 05/18/2021, hereinafter, Li-Yunfan)
Applicant’s arguments and amendments, see Remarks, filed 06/16/2025, with respect to the rejection(s) of claim(s) 5, 12 and 19 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  

Claim Status
Claims 1-4, 8-11 and 15-18 are rejected under 35 USC § 103 over Li in view of Braso, in view of Wang, in view of Li-Yunfan.
Claims 5-7, 12-14 and 19-20 are objected.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1-4, 8-11 and 15-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over 
Li et al. (Li, Shuaicheng, et al. "Groupformer: Group activity recognition with clustered spatial-temporal transformer." IEEE, published 08/28/2021, hereinafter, Li) in view of 
Braso et al. (Brasó, Guillem et al. "The center of attention: Center-keypoint grouping via attention for multi-person pose estimation." IEEE, published 10/11/2021, hereinafter, Braso), in view of 
Wang et al. (Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." IEEE, published 08/11/2021, hereinafter, Wang), in view of
Li-Yunfan et al. (Li-Yunfan, et al. "Contrastive clustering." Proceedings of the AAAI conference on artificial intelligence, published 05/18/2021, hereinafter, Li-Yunfan)

CLAIM 1
In regards to Claim 1, Li teaches a method for compositional reasoning of group activity (Li, Abstract: “To address these issues, we propose a novel group activity recognition network termed GroupFormer.”; Figure 4) in videos with keypoint-only modality (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), the method comprising: obtaining video frames from a video stream received from a plurality of video image capturing devices (Li, Page 13653, section 4.1 Datasets: “Volleyball Dataset. This dataset [25] contains 55 volleyball videos with 4,830 labeled frames (3493/1337 for training/testing)… Collective Activity Dataset. This dataset [14] contains 2481 activity clips of 44 video sequences captured by handheld cameras in the street and indoor scenes.”); tokenizing the keypoint data (Li, Page 13651, section 3.2: “video frames can be summarized by a set of feature vectors called visual tokens.”) with time (Li, Page 13651, section 3.2: “For scene feature Xg, we view the time dimension as batch dimension”) and segment information (Li, Page 13651, section 3.1 and 3.2: “, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features…For aligned individual features X0 I, we feed a learned query shaped as T × D and individual features into a decoder to generate an individual token” The Examiner notes features of each person, which reads on segmentation information, in image frame is obtained and later tokenized); clustering groups of keypoint persons in the video frames (Li, Page 13649, right col, second paragraph: “A clustered attention mechanism is introduced to assign individuals into groups and build inter- and intragroup relations to enrich the global activity context.”; 
    PNG
    media_image1.png
    944
    2185
    media_image1.png
    Greyscale
see sub-image of Figure 2, below) 

and performing a prediction to provide a group activity prediction of a scene in the video frames. (Li, Page 13652, left col, section Group Decoder: “It takes the enhanced individual representation XI and group representation XG as input. Motivated by the learned object query proposed by [10], we adopt group representation, termed as group query, to perform group activity context augmenting from individual representation termed as key. Thus, the group query summarize the overall context from augmented individual representation, and group activity prediction is realized by the updated group query.”; Figure 4)
Li does not explicitly teach extracting keypoints all of persons detected in the video frames to define keypoint data
Braso is in the same field of art of tracking human activity. Further, Braso teaches extracting keypoints all of persons detected in the video frames to define keypoint data (Braso, Page 2, right col, section Bottom-up methods and One-shot methods; Section 5.2. Keypoint and Center Detection: “We follow HigherHRNet to obtain identity-agnostic keypoint proposals for each of J joint types being considered. HigherHRNet uses an HRNet backbone, followed by two keypoint prediction heads that regress heatmaps at 1/4 and 1/2 of the original image scale for every joint type”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li by incorporating the method of extract keypoints and feed them to a vision transformer for contextual information that is taught by Braso, to make a group activity recognition method that can obtain contextual information from keypoints; thus, one of ordinary skilled in the art would be motivated to combine the references since Braso’s method help improving accuracy, speed and efficiency of the task detection (Braso, Page 12, left col, third paragraph: “Regarding performance, we note that the increase in accuracy .... Overall, it outperforms the current state-of-theart method…and being 2.5x faster, which confirms CenterGroup’s increased efficiency”).
	The combination of Li and Braso does not explicitly disclose passing the clustering groups through multi-scale prediction, the scales including different granularities of the video frames.
	Wang is in the same field of art of image prediction using vision transformer. Further, Wang teaches passing the clustering groups through multi-scale prediction (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3), the scales including different granularities of the video frames. (Wang, section 3.1 and 3.2, FIG. 3, section 5.5, Pyramid Structure: “our model can process high-resolution feature maps in shallow stages and low-resolution feature maps in deep stages”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li and Braso by incorporating multiscale prediction that is taught by Wang, to make a group activity recognition method that can process features in multiple scales; thus, one of ordinary skilled in the art would be motivated to combine the references since said combination improve the performance of the system (Wang, Page 2, right col, last paragraph: “our PVT with different parameter scales can consistently archived improved performance compared to the prior arts”).
The combination of Li, Braso and Wang does not explicitly disclose clustering using contrastive learning.
Li-Yunfan is in the same field of art of data clustering in computer vision. Further, Li-Yunfan teaches clustering using contrastive learning. (Li-Yunfan, Sections Instance-level Contrastive Head, Cluster-level Contrastive Head and Objective Function. The Examiner summarizes Li-Yunfan’s teaching: combining the losses of instance-level and cluster-level contrastive head to improve clustering:
Instance-level contrastive learning: focus on learning better representations by pulling augmented views of same data point closer and pushing different data points apart. 
Cluster-level contrastive learning: Improving global structure by ensuring data points in same clusters remain close, while different clusters are well-separated.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li, Braso and Wang by incorporating the clustering framework that is taught by Li-Yunfan, to make a group activity recognition method with an exceptional clustering framework; thus, one of ordinary skilled in the art would be motivated to combine the references since Li-Yunfan’s method improve performance of the task clustering, which consequently benefits many other tasks in computer vision such as pattern recognition, data preprocessing, etc (Li-Yunfan, Abstract: “CC achieves … an up to 19% (39%) performance improvement compared with the best baseline”; Page 7, section Broader Impact).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.

CLAIM 2
In regards to Claim 2, The combination of Li, Braso, Wang and Li-Yunfan teaches the method of claim 1. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches tokenizing includes defining a person keypoint token (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), a person token (Li, Page 13651, section 3.2, see reconstructed text below, red underline), a person-to-person interaction token (Li, Page 5, left col, Group Decoder: “Summarizing individuals’ interactions in a multi-person scenario is critical to group activity inferring”; See Figure 4 (c) below, whiteline represent interactions between individuals in a cluster), a group token (Li, Page 13651, section 3.2, see reconstructed text below, blue underline), a classification (CLS) token (Li, See Figure 4 (c) below, “moving”, “setting”, “waiting”, “standing” are task-related representative 
    PNG
    media_image2.png
    1227
    1990
    media_image2.png
    Greyscale
information of a group), 

    PNG
    media_image3.png
    405
    1351
    media_image3.png
    Greyscale
and an object keypoint token. (Li, Page 13651, section 3.2, see reconstructed text below, green underline)


CLAIM 3
In regards to Claim 3, the combination of Li, Braso, Wang and Li-Yunfan teaches the method of claim 2. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches the tokens are fed into a multiscale transformer (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3) to perform relational reasoning (Li, Page 13648, Introduction: “The intuitive tactic to recognize group activity is to model relevant relations between individuals and infer their collective activity”) with four transformer encoders.  (Wang, Figure 3, with annotation, four encoders are circled 
    PNG
    media_image4.png
    1342
    2463
    media_image4.png
    Greyscale
red, see below).


CLAIM 4
In regards to Claim 4, the combination of Li, Braso, Wang and Li-Yunfan teaches the method of claim 3. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches each of the four transformer encoders represents a scale (Wang, Figure 3, green circles, see above) to provide attention-based reasoning (Wang, Page 4, section 3.3: “The Transformer encoder in the stage i has Li encoder layers, each of which is composed of an attention layer and a feed-forward layer [64].”; Figure 4) over the tokens at each scale. (Wang, Figure 3, green circles, see figure in rejection of claim 3)

CLAIM 8
In regards to Claim 8, Li teaches a method for compositional reasoning of group activity (Li, Abstract: “To address these issues, we propose a novel group activity recognition network termed GroupFormer.”; Figure 4) in videos with keypoint-only modality (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), the method comprising: obtaining video frames from a video stream received from a plurality of video image capturing devices (Li, Page 13653, section 4.1 Datasets: “Volleyball Dataset. This dataset [25] contains 55 volleyball videos with 4,830 labeled frames (3493/1337 for training/testing)… Collective Activity Dataset. This dataset [14] contains 2481 activity clips of 44 video sequences captured by handheld cameras in the street and indoor scenes.”); tokenizing the keypoint data (Li, Page 13651, section 3.2: “video frames can be summarized by a set of feature vectors called visual tokens.”) with time (Li, Page 13651, section 3.2: “For scene feature Xg, we view the time dimension as batch dimension”) and segment information (Li, Page 13651, section 3.1 and 3.2: “, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features…For aligned individual features X0 I, we feed a learned query shaped as T × D and individual features into a decoder to generate an individual token” The Examiner notes features of each person, which reads on segmentation information, in image frame is obtained and later tokenized); clustering groups of keypoint persons in the video frames (Li, Page 13649, right col, second paragraph: “A clustered attention mechanism is introduced to assign individuals into groups and build inter- and intragroup relations to enrich the global activity context.”; 
    PNG
    media_image1.png
    944
    2185
    media_image1.png
    Greyscale
see sub-image of Figure 2, below) 

and performing a prediction to provide a group activity prediction of a scene in the video frames. (Li, Page 13652, left col, section Group Decoder: “It takes the enhanced individual representation XI and group representation XG as input. Motivated by the learned object query proposed by [10], we adopt group representation, termed as group query, to perform group activity context augmenting from individual representation termed as key. Thus, the group query summarize the overall context from augmented individual representation, and group activity prediction is realized by the updated group query.”; Figure 4)
Li does not explicitly teach extracting keypoints all of persons detected in the video frames to define keypoint data
Braso is in the same field of art of tracking human activity. Further, Braso teaches extracting keypoints all of persons detected in the video frames to define keypoint data (Braso, Page 2, right col, section Bottom-up methods and One-shot methods; Section 5.2. Keypoint and Center Detection: “We follow HigherHRNet to obtain identity-agnostic keypoint proposals for each of J joint types being considered. HigherHRNet uses an HRNet backbone, followed by two keypoint prediction heads that regress heatmaps at 1/4 and 1/2 of the original image scale for every joint type”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li by incorporating the method of extract keypoints and feed them to a vision transformer for contextual information that is taught by Braso, to make a group activity recognition method that can obtain contextual information from keypoints; thus, one of ordinary skilled in the art would be motivated to combine the references since Braso’s method help improving accuracy, speed and efficiency of the task detection (Braso, Page 12, left col, third paragraph: “Regarding performance, we note that the increase in accuracy .... Overall, it outperforms the current state-of-theart method…and being 2.5x faster, which confirms CenterGroup’s increased efficiency”).
	The combination of Li and Braso does not explicitly disclose passing the clustering groups through multi-scale prediction, the scales including different granularities of the video frames.
	Wang is in the same field of art of image prediction using vision transformer. Further, Wang teaches passing the clustering groups through multi-scale prediction (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3), the scales including different granularities of the video frames. (Wang, section 3.1 and 3.2, FIG. 3, section 5.5, Pyramid Structure: “our model can process high-resolution feature maps in shallow stages and low-resolution feature maps in deep stages”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li and Braso by incorporating multiscale prediction that is taught by Wang, to make a group activity recognition method that can process features in multiple scales; thus, one of ordinary skilled in the art would be motivated to combine the references since said combination improve the performance of the system (Wang, Page 2, right col, last paragraph: “our PVT with different parameter scales can consistently archived improved performance compared to the prior arts”).
The combination of Li, Braso and Wang does not explicitly disclose clustering using contrastive learning.
Li-Yunfan is in the same field of art of data clustering in computer vision. Further, Li-Yunfan teaches clustering using contrastive learning. (Li-Yunfan, Sections Instance-level Contrastive Head, Cluster-level Contrastive Head and Objective Function. The Examiner summarizes Li-Yunfan’s teaching: combining the losses of instance-level and cluster-level contrastive head to improve clustering:
Instance-level contrastive learning: focus on learning better representations by pulling augmented views of same data point closer and pushing different data points apart. 
Cluster-level contrastive learning: Improving global structure by ensuring data points in same clusters remain close, while different clusters are well-separated.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li, Braso and Wang by incorporating the clustering framework that is taught by Li-Yunfan, to make a group activity recognition method with an exceptional clustering framework; thus, one of ordinary skilled in the art would be motivated to combine the references since Li-Yunfan’s method improve performance of the task clustering, which consequently benefits many other tasks in computer vision such as pattern recognition, data preprocessing, etc (Li-Yunfan, Abstract: “CC achieves … an up to 19% (39%) performance improvement compared with the best baseline”; Page 7, section Broader Impact).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.

CLAIM 9
In regards to Claim 9, the combination of Li, Braso, Wang and Li-Yunfan teaches the medium of claim 8. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches tokenizing includes defining a person keypoint token (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), a person token (Li, Page 13651, section 3.2, see reconstructed text below, red underline), a person-to-person interaction token (Li, Page 5, left col, Group Decoder: “Summarizing individuals’ interactions in a multi-person scenario is critical to group activity inferring”; See Figure 4 (c) below, whiteline represent interactions between individuals in a cluster), a group token (Li, Page 13651, section 3.2, see reconstructed text below, blue underline), a classification (CLS) token (Li, See Figure 4 (c) below, “moving”, “setting”, “waiting”, “standing” are task-related representative 
    PNG
    media_image2.png
    1227
    1990
    media_image2.png
    Greyscale
information of a group), 


    PNG
    media_image3.png
    405
    1351
    media_image3.png
    Greyscale
and an object keypoint token. (Li, Page 13651, section 3.2, see reconstructed text below, green underline)



CLAIM 10

    PNG
    media_image4.png
    1342
    2463
    media_image4.png
    Greyscale
In regards to Claim 10, the combination of Li, Braso, Wang and Li-Yunfan teaches the medium of claim 9. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches the tokens are fed into a multiscale transformer (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3) to perform relational reasoning (Li, Page 13648, Introduction: “The intuitive tactic to recognize group activity is to model relevant relations between individuals and infer their collective activity”) with four transformer encoders.  (Wang, Figure 3, with annotation, four encoders are circled red, see below).

CLAIM 11
In regards to Claim 11, the combination of Li, Braso, Wang and Li-Yunfan teaches the medium of claim 10. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches each of the four transformer encoders represents a scale (Wang, Figure 3, green circles, see above) to provide attention-based reasoning (Wang, Page 4, section 3.3: “The Transformer encoder in the stage i has Li encoder layers, each of which is composed of an attention layer and a feed-forward layer [64].”; Figure 4) over the tokens at each scale. (Wang, Figure 3, green circles, see figure in rejection of claim 10)

CLAIM 15
In regards to Claim 15, Li teaches a system for compositional reasoning of group activity (Li, Abstract: “To address these issues, we propose a novel group activity recognition network termed GroupFormer.”; Figure 4) in videos with keypoint-only modality (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), the method comprising: obtaining video frames from a video stream received from a plurality of video image capturing devices (Li, Page 13653, section 4.1 Datasets: “Volleyball Dataset. This dataset [25] contains 55 volleyball videos with 4,830 labeled frames (3493/1337 for training/testing)… Collective Activity Dataset. This dataset [14] contains 2481 activity clips of 44 video sequences captured by handheld cameras in the street and indoor scenes.”); tokenizing the keypoint data (Li, Page 13651, section 3.2: “video frames can be summarized by a set of feature vectors called visual tokens.”) with time (Li, Page 13651, section 3.2: “For scene feature Xg, we view the time dimension as batch dimension”) and segment information (Li, Page 13651, section 3.1 and 3.2: “, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features…For aligned individual features X0 I, we feed a learned query shaped as T × D and individual features into a decoder to generate an individual token” The Examiner notes features of each person, which reads on segmentation information, in image frame is obtained and later tokenized); clustering groups of keypoint persons in the video frames (Li, Page 13649, right col, second paragraph: “A clustered attention mechanism is introduced to assign individuals into groups and build inter- and intragroup relations to enrich the global activity context.”; 
    PNG
    media_image1.png
    944
    2185
    media_image1.png
    Greyscale
see sub-image of Figure 2, below) 

and performing a prediction to provide a group activity prediction of a scene in the video frames. (Li, Page 13652, left col, section Group Decoder: “It takes the enhanced individual representation XI and group representation XG as input. Motivated by the learned object query proposed by [10], we adopt group representation, termed as group query, to perform group activity context augmenting from individual representation termed as key. Thus, the group query summarize the overall context from augmented individual representation, and group activity prediction is realized by the updated group query.”; Figure 4)
Li does not explicitly teach extracting keypoints all of persons detected in the video frames to define keypoint data
Braso is in the same field of art of tracking human activity. Further, Braso teaches extracting keypoints all of persons detected in the video frames to define keypoint data (Braso, Page 2, right col, section Bottom-up methods and One-shot methods; Section 5.2. Keypoint and Center Detection: “We follow HigherHRNet to obtain identity-agnostic keypoint proposals for each of J joint types being considered. HigherHRNet uses an HRNet backbone, followed by two keypoint prediction heads that regress heatmaps at 1/4 and 1/2 of the original image scale for every joint type”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li by incorporating the method of extract keypoints and feed them to a vision transformer for contextual information that is taught by Braso, to make a group activity recognition method that can obtain contextual information from keypoints; thus, one of ordinary skilled in the art would be motivated to combine the references since Braso’s method help improving accuracy, speed and efficiency of the task detection (Braso, Page 12, left col, third paragraph: “Regarding performance, we note that the increase in accuracy .... Overall, it outperforms the current state-of-theart method…and being 2.5x faster, which confirms CenterGroup’s increased efficiency”).
	The combination of Li and Braso does not explicitly disclose passing the clustering groups through multi-scale prediction, the scales including different granularities of the video frames.
	Wang is in the same field of art of image prediction using vision transformer. Further, Wang teaches passing the clustering groups through multi-scale prediction (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3), the scales including different granularities of the video frames. (Wang, section 3.1 and 3.2, FIG. 3, section 5.5, Pyramid Structure: “our model can process high-resolution feature maps in shallow stages and low-resolution feature maps in deep stages”)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li and Braso by incorporating multiscale prediction that is taught by Wang, to make a group activity recognition method that can process features in multiple scales; thus, one of ordinary skilled in the art would be motivated to combine the references since said combination improve the performance of the system (Wang, Page 2, right col, last paragraph: “our PVT with different parameter scales can consistently archived improved performance compared to the prior arts”).
The combination of Li, Braso and Wang does not explicitly disclose clustering using contrastive learning.
Li-Yunfan is in the same field of art of data clustering in computer vision. Further, Li-Yunfan teaches clustering using contrastive learning. (Li-Yunfan, Sections Instance-level Contrastive Head, Cluster-level Contrastive Head and Objective Function. The Examiner summarizes Li-Yunfan’s teaching: combining the losses of instance-level and cluster-level contrastive head to improve clustering:
Instance-level contrastive learning: focus on learning better representations by pulling augmented views of same data point closer and pushing different data points apart. 
Cluster-level contrastive learning: Improving global structure by ensuring data points in same clusters remain close, while different clusters are well-separated.)
Therefore, it would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Li, Braso and Wang by incorporating the clustering framework that is taught by Li-Yunfan, to make a group activity recognition method with an exceptional clustering framework; thus, one of ordinary skilled in the art would be motivated to combine the references since Li-Yunfan’s method improve performance of the task clustering, which consequently benefits many other tasks in computer vision such as pattern recognition, data preprocessing, etc (Li-Yunfan, Abstract: “CC achieves … an up to 19% (39%) performance improvement compared with the best baseline”; Page 7, section Broader Impact).
Thus, the claimed subject matter would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention.

CLAIM 16
In regards to Claim 16, the combination of Li, Braso, Wang and Li-Yunfan teaches the system of claim 15. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches tokenizing includes defining a person keypoint token (Li, Page 13651, section 3.1: “In addition, the pose information of each actor is obtained by AlphaPose [19] and concatenated with the above individual features to provide the final individual features”), a person token (Li, Page 13651, section 3.2, see reconstructed text below, red underline), a person-to-person interaction token (Li, Page 5, left col, Group Decoder: “Summarizing individuals’ interactions in a multi-person scenario is critical to group activity inferring”; See Figure 4 (c) below, whiteline represent interactions between individuals in a cluster), a group token (Li, Page 13651, section 3.2, see reconstructed text below, blue underline), a classification (CLS) token (Li, See Figure 4 (c) below, “moving”, “setting”, “waiting”, “standing” are task-related representative 
    PNG
    media_image2.png
    1227
    1990
    media_image2.png
    Greyscale
information of a group), 


    PNG
    media_image3.png
    405
    1351
    media_image3.png
    Greyscale
and an object keypoint token. (Li, Page 13651, section 3.2, see reconstructed text below, green underline)


CLAIM 17
In regards to Claim 17, the combination of Li, Braso, Wang and Li-Yunfan teaches the system of claim 16. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches the tokens are fed into a multiscale transformer (Wang, section 3.1 and 3.2: “Our goal is to introduce the pyramid structure into the Transformer framework, so that it can generate multi-scale feature maps for dense prediction tasks (e.g., object detection and semantic segmentation)”, FIG. 3) to perform relational reasoning (Li, Page 13648, Introduction: “The intuitive tactic to recognize group activity is to model relevant relations between individuals and infer their collective activity”) with four transformer encoders.  (Wang, Figure 3, with annotation, four encoders are circled 
    PNG
    media_image4.png
    1342
    2463
    media_image4.png
    Greyscale
red, see below).


CLAIM 18
In regards to Claim 18, the combination of Li, Braso, Wang and Li-Yunfan teaches the system of claim 17. In addition, the combination of Li, Braso, Wang and Li-Yunfan teaches each of the four transformer encoders represents a scale (Wang, Figure 3, green circles, see above) to provide attention-based reasoning (Wang, Page 4, section 3.3: “The Transformer encoder in the stage i has Li encoder layers, each of which is composed of an attention layer and a feed-forward layer [64].”; Figure 4) over the tokens at each scale. (Wang, Figure 3, green circles, see figure in rejection of claim 17)
Allowable Subject Matter
Claims 5, 12 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The closest prior arts for Claim 5, 12 and 19 are:
Wang et al. (Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." IEEE, published 08/11/2021, hereinafter, Wang). Wang teaches a Pyramid Vision Transformer (PVT), which is a combination of the pyramid structure from CNNs and Vision Transformer, to be used as a versatile backbone for many computer vision tasks such as object detection (DET), instance and semantic segmentation (SEG). 
Li-Yunfan et al. (Li-Yunfan, et al. "Contrastive clustering." Proceedings of the AAAI conference on artificial intelligence, published 05/18/2021, hereinafter, Li-Yunfan). Li-Yunfan teaches an online clustering method called Contrastive Clustering (CC) which explicitly performs the instance- and cluster-level contrastive learning. To be specific, for a given dataset, the positive and negative instance pairs are constructed through data augmentations and then projected into a feature space. Therein, the instance- and cluster-level contrastive learning are respectively conducted in the row and column space by maximizing the similarities of positive pairs while minimizing those of negative ones.
While both Wang and Li-Yunfan teach image analysis techniques for semantic segmentation and instance segmentation. Neither Wang, or Li-Yunfan, nor the combination teaches “a cluster assignment of each scale is predicted, by a swapped prediction component, from a representation of another scale to capture an agreement of common semantic information hidden across the scales.” (emphasis added)
Claims 6-7, 13-14 and 20 are also objected due to their dependence on objected claims 5, 12, and 19, respectively.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NHUT HUY (JEREMY) PHAM whose telephone number is (703)756-5797. The examiner can normally be reached Mo - Fr. 8:30am - 6pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O'Neal Mistry can be reached on (313)446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/NHUT HUY PHAM/Examiner, Art Unit 2674                                                                                                                                                                                                        

/ONEAL R MISTRY/Supervisory Patent Examiner, Art Unit 2674
Read full office action
Prosecution Timeline

Oct 05, 2022
Application Filed
Nov 14, 2024
Non-Final Rejection — §103
Jan 27, 2025
Interview Requested
Feb 05, 2025
Examiner Interview Summary
Feb 14, 2025
Response Filed
Mar 11, 2025
Final Rejection — §103
Jun 04, 2025
Interview Requested
Jun 12, 2025
Examiner Interview Summary
Jun 16, 2025
Request for Continued Examination
Jun 17, 2025
Response after Non-Final Action
Jun 26, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/925,903
Patent 12598397
DIRT DETECTION METHOD AND DEVICE FOR CAMERA COVER
2y 5m to grant Granted Apr 07, 2026
17/990,310
Patent 12598074
FACIAL RECOGNITION METHOD AND APPARATUS, DEVICE, AND MEDIUM
2y 5m to grant Granted Apr 07, 2026
17/992,917
Patent 12597254
TRACKING OPERATING ROOM PHASE FROM CAPTURED VIDEO OF THE OPERATING ROOM
2y 5m to grant Granted Apr 07, 2026
18/125,767
Patent 12592087
IMAGE PROCESSING DEVICE, IMAGE PROCESSING METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
17/973,627
Patent 12579622
METHOD AND APPARATUS FOR PROCESSING IMAGE SIGNAL, ELECTRONIC DEVICE, AND COMPUTER-READABLE STORAGE MEDIUM
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
79%
Grant Probability
99%
With Interview (+26.8%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 53 resolved cases by this examiner. Grant probability derived from career allow rate.
COMPOSITIONAL REASONING OF GROUP ACTIVITY IN VIDEOS WITH KEYPOINT-ONLY MODALITY

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email