Last updated: May 29, 2026
Application No. 17/752,614
TRAINING A NEURAL NETWORK FOR ACTION RECOGNITION

Non-Final OA §103§112
Filed
May 24, 2022
Priority
Jun 11, 2021 — SE 2150749-6
Examiner
BAKER, EZRA JAMES
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Sony Group Corporation
OA Round
3 (Non-Final)
This examiner grants 50% of cases after interview

— +53.3% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 16 resolved cases, 2023–2026
Examiner Intelligence

BAKER, EZRA JAMES View full profile →
Grants 50% of resolved cases
Career Allowance Rate
8 granted / 16 resolved
-5.0% vs TC avg
Strong +53% interview lift
Without
With
+53.3%
Interview Lift
resolved cases with interview
Typical timeline
4y 0m
Avg Prosecution
23 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
5.5%
-34.5% vs TC avg
§103
90.8%
+50.8% vs TC avg
§102
3.7%
-36.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 16 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/02/2026 has been entered.
 
Status of Claims
	The present application is being examined under the claims filed 01/06/2026.
	Claims 1 and 3-20 are pending.

Response to Amendment
	This Office Action is in response to applicant’s communication filed 02/02/2026 in response to office action mailed 11/06/2025. The applicant’s remarks and any amendments to the claims or specification have been considered with the results that follow.


Response to Arguments
In Remarks page 9, Argument 1
(Examiner summarizes Applicant’s arguments) Applicant argues that claim amendments comply with 35 U.S.C. 112(b) requirements.
Examiner’s response to Argument 1,	One of the rejections of claim 4 under 35 U.S.C. 112(b) have been withdrawn, however the rest remain.

In Remarks pages 9-11, Argument 2
(Examiner summarizes Applicant’s arguments) Applicant argues with particular rationale that the rejections under 35 U.S.C. 101 should be withdrawn in view of the claim amendments.
Examiner’s response to Argument 2
	The rejections under 35 U.S.C. 101 have been withdrawn thus rendering Applicant’s arguments moot.

In Remarks pages 13-14, Argument 3
(Examiner summarizes Applicant’s arguments) Applicant argues that Nanni does not teach the limitation “include both of the corresponding first and second augmented versions in both of the first and second input data such that the first and second untrained networks operate concurrently on the corresponding first and second augmented versions” because Nanni teaches utilizing multiple augmentation approaches separately in isolation and does not teach utilizing augmentations in conjunction with each other.
Examiner’s response to Argument 3
	Examiner no longer relies on Nanni, rendering Applicant’s arguments moot. However, Examiner notes that the broadest reasonable interpretation of the limitation includes processing augmented sequences independently of one another. The specification supports this interpretation. New art is applied accordingly.
(page 9) “Reverting to FIG. 3, it is to be noted that MAS l, MAS2 may be included in I1,I2 so as to be provided concurrently to NN1 and NN2. Thereby, NN1 and NN2 will jointly and concurrently operate on the pair of augmented action sequences, (MAS 1, MAS2), and generate corresponding first and second representation data for processing by the first updating module 13. Further, as noted above, MAS1 and MAS2 are included in both I1 and I2, which means that NN1 will operate on MAS 1 while NN2 operates on MAS2, and NN1 operates on MAS2 while NN2 operates on MAS1.”

    PNG
    media_image1.png
    492
    749
    media_image1.png
    Greyscale


Claim Objections
Claim 17 objected to because of the following informalities: “computer-readable instruction for a fourth neural network and a fourth updating module.” should read “computer-readable instruction for a fourth neural network and a fourth updating module,[[.]]” to replace the period with a comma.  Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
	The following appears to be the closest portions of the specification corresponding to the 35 U.S.C. 112(f) invocations:
updating module
(page 7 line 18) A first updating module 13 is arranged to receive first representation data from NN1 and second representation data from NN2. The first and second representation data may be in any format and are generated to represent I1 and I2, respectively. The first updating module 13 is configured to update control parameters of NN1 to minimize a difference between the first representation data and the second representation data. The module 13 may implement any conventional updating function, for example backpropagation, by use of any suitable classification loss function, including but not limited to cross- entropy loss, log loss, hinge loss, square loss, or variants or derivatives thereof. The backpropagation may include any suitable stochastic or non-stochastic optimization algorithm, including but not limited to gradient descent, nonlinear conjugate gradient, limited-memory BFGS, Levenberg-Marquardt algorithm, etc. A second updating module 14 is configured to update control parameters of NN2 as a function of the control parameters of NN1. In some embodiments, the second updating module 14 updates NN2 whenever the first updating module 13 has updated NN1. By the second updating module 14, NN2 is "bootstrapped" to NN2. In the pre-training system 1A, NN1 may be seen as an online neural network, and NN2 may be seen as a target network. In some embodiments, NN1 and NN2 share the same network architecture, at least in part, so that there is one-to-one correspondence between the control parameters of NN2 that are updated by module 14 and control parameters in NN1. In some embodiments, the module 14 may be configured to replace the control parameters in NN2 by the control parameters in NN1. In some embodiments, to stabilize the bootstrapping, module 14 generates the value of the respective control parameter in NN2 based on a temporal aggregation of values of the corresponding control parameter in NN1.
augmentation module
(page 8) The system lA in FIG. 3 further comprises an augmentation module 20, which is configured to operate on action sequences retrieved from the database 10 to generate Il and I2. Data augmentation is a well-known concept in the field of neural networks and involves imparting selected modifications to the input data of a neural network during training to improve the ability of the neural network, when trained, to handle variations in the input data. The augmentation module 20 in FIG. 3 is configured to operate on pairs of action sequences to generate pairs of augmented action sequences and include the respective pair of augmented action sequences in both Il and I2.
sub-module
(Specification pages 8-9) The augmentation module 20 in FIG. 3 is configured to operate on pairs of action sequences to generate pairs of augmented action sequences and include the respective pair of augmented action sequences in both Il and I2. In the illustrated example, the augmentation module 20 comprises a first sub-module 21 which is configured to generate a first augmented action sequence based on a first action sequence in each pair, and second sub-module 22 which is configured to generate a second augmented action sequence based on a second action sequence in each pair. An output sub-module 23 is arranged to include the first and second augmented action sequences in I1 and I2.
[…]
FIG. 4 is a block diagram of sub-modules 21 and 22 in accordance with an example. Sub-module 21 is operable to apply a set of first augmentation functions on an incoming action sequence, AS1, to generate an augmented or modified action sequence, MAS 1. The first augmentation functions are represented as F 11, ..., F Ilm in FIG. 4. Sub-module 22 is operable to apply a set of second augmentation functions on an incoming action sequence, AS2, to generate an augmented or modified action sequence, MAS2. The second augmentation functions are represented as F21, F22, F23, ..., F2n in FIG. 4.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.



Claims 1 and 3-18 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.

Regarding 35 U.S.C. 112(f) invocations
The following claim limitations invoke 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
“execute the first updating module to update parameters of the first neural network” as recited in claim 1
“execute the second updating module to update parameters of the second neural network” as recited in claim 1
“execute the augmentation module to retrieve a plurality of corresponding first and second unlabeled action sequences” as recited in claim 1
Further recitations of the term “augmentation module” in claims 2 and 14
“execute the first sub-module to generate a first augmented version based on a respective first unlabeled action sequence” as recited in claim 1
Further recitations of the term “first sub-module” in claims 3-4, 13, and 16
“and execute the second sub-module to generate a second augmented version based on a respective second unlabeled action sequence, wherein the second sub-module differs from the first sub-module” as recited in claim 1
Further recitations of the term “second sub-module” in claims 3-4, 6-8, and 10-12
“execute the third updating module to update parameters of the third network to minimize a difference between the third representation data and activity label data associated with the third input data” and “by the third updating module, train the third network” as recited in claim 15
“execute the further augmentation module to retrieve third action sequences of one or more objects performing one or more activities” and “wherein the further augmentation module is configured in correspondence” as recited in claim 16
“execute the fourth updating module to update parameters of the fourth network” as recited in claim 17
However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. MPEP 2181 II. B. recites “However, if there is no corresponding structure disclosed in the specification (i.e., the limitation is only supported by software and does not correspond to an algorithm and the computer or microprocessor programmed with the algorithm), the limitation should be deemed indefinite as discussed above, and the claim should be rejected under 35 U.S.C. 112(b)  or pre-AIA  35 U.S.C. 112, second paragraph.” The portions of the specification identified above do not clearly link the claim language to a “computer or microprocessor programmed with the algorithm” for any of the modules recited in the claims.

 Therefore, the claims are indefinite and are rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Regarding Dependent Claims
Claims
2-18 are dependent upon claim 1
6-13 are dependent upon claim 5
9 is dependent upon claim 8
16-18 are dependent upon claim 15
are therefore similarly rejected for including the deficiencies of claims 1, 5, 8, 15, and 20 respectively.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 3, 5, 8-9, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over NPL reference Grill et al. “Bootstrap your own latent: A new approach to self-supervised Learning” herein referred to as Grill in view of NPL reference Xu et al. “Ensemble One-Dimensional Convolution Neural Networks for Skeleton-Based Action Recognition” herein referred to as Xu, Adama et al. “Adaptive Segmentation and Sequence Learning of human activities from skeleton data” herein referred to as Adama, and Wu et al. “MULTI-TEACHER KNOWLEDGE DISTILLATION FOR COMPRESSED VIDEO ACTION RECOGNITION ON DEEP NEURAL NETWORKS” herein referred to as Wu.

Regarding Claim 1
Grill teaches:
a first untrained neural network; a second untrained neural network;
(page 5 last paragraph) “We assess the performance of BYOL’s representation after self-supervised pretraining[*Examiner notes: untrained neural networks] on the training set of the ImageNet ILSVRC-2012 dataset [21].”; [*Examiner notes: The neural networks undergo pre-training, and are untrained prior to pre-training.] (Figure 2)

    PNG
    media_image2.png
    539
    979
    media_image2.png
    Greyscale


a first updatinq module
(page 4 above equation 3) “We symmetrize the loss Lθ,ξ in Eq. 2 by separately feeding v’ to the online network and v to the target network to compute Leθ,ξ.”

a second updating module; and
[*Examiner notes: See annotations of equation 1 below.  θ are the parameters of the first neural network and ξ are the parameters of the second neural network]

    PNG
    media_image3.png
    330
    797
    media_image3.png
    Greyscale


wherein the memory storing further computer-executable instructions that, when executed by the processor, configure the processor to: execute the first untrained neural network on first input data to generate first representation data; execute the second untrained neural network on second input data to generate second representation data;
(page 1 abstract) “BYOL relies on two neural networks, referred to as online[*Examiner notes: first neural network] and target networks[*Examiner notes: second neural network], that interact and learn from each other.”; (page 3 last sentence) “BYOL produces two augmented views v =∆ t(x)[*Examiner notes: first input data] and v’ =∆ t’(x)[*Examiner notes: second input data] from x by applying respectively image augmentations t ∼ T and t’ ∼ T’.”; (page 4 first paragraph) “From the first augmented view v, the online network outputs a representation yθ =∆ fθ(v) and a projection zθ =∆ gθ(y). The target network outputs y’ξ =∆ fξ(v’) and the target projection z0ξ =∆ gξ(y’) from the second augmented view v’. We then output a prediction qθ(zθ) of z’ξ and l2-normalize both qθ(zθ) and z’ξ to qθ(zθ) =∆ qθ(zθ)/||qθ(zθ)||2[*Examiner notes: mapped to first representation] and z’ξ =∆ z’ξ/||z’ξ||2[*Examiner notes: mapped to second representation].”; [*Examiner notes: See figure 2 annotated below]

    PNG
    media_image4.png
    493
    1321
    media_image4.png
    Greyscale



execute the first updating module to update parameters of the first untrained neural network to minimize a difference between the first representation data and the second representation data;
(page 4 paragraph 1) “From the first augmented view v, the online network outputs a representation yθ =∆ fθ(v) and a projection zθ =∆ gθ(y). The target network outputs y’ξ =∆ fξ(v’) and the target projection z’ξ =∆ gξ(y’) from the second augmented view v’. We then output a prediction qθ(zθ) of z’ξ and l2 normalize both qθ(zθ) and z’ξ to qθ(zθ) =∆ qθ(zθ)/kqθ(zθ)||2 and 
    PNG
    media_image5.png
    18
    262
    media_image5.png
    Greyscale
.
Note that this predictor is only applied to the online branch, making the architecture asymmetric between the online and target pipeline. Finally we define the following mean squared error between the normalized predictions and target projections”; (page 4 above equation 3) “We symmetrize the loss Lθ,ξ in Eq. 2 by separately feeding v’ to the online network and v to the target network to compute Leθ,ξ. At each training step, we perform a stochastic optimization step to minimize LBYOLθ,ξ = Lθ,ξ + Leθ,ξ with respect to θ only[*Examiner notes: update parameters of first neural network to minimize difference], but not ξ, as depicted by the stop-gradient in Figure 2. BYOL’s dynamics are summarized as”; [*Examiner note: see equations 2 and 3 annotated below]

    PNG
    media_image6.png
    212
    787
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    166
    466
    media_image7.png
    Greyscale



execute the second updating module to update parameters of the second untrained neural network as a function of the parameters of the first untrained neural network; and 
[*Examiner notes: See annotations of equation 1 below.  θ are the parameters of the first neural network and ξ are the parameters of the second neural network]

    PNG
    media_image3.png
    330
    797
    media_image3.png
    Greyscale

and subsequently provide at least a subset of the parameters of the first untrained neural network, after execution on the one or more instances of the first and second input data, as a parameter definition of a pre-trained neural network.
(page 4 below equation 1) “At the end of training, we only keep the encoder fθ[*Examiner notes: provide parameters of first neural network as parameter definition of pre-trained neural network]; as in [9]”; (page 6 paragraph 2) “We first evaluate BYOL’s representation by training a linear classifier on top of the frozen representation, following the procedure described in [48, 74, 41, 10, 8], and appendix C.1; we report top-1 and top-5 accuracies in % on the test set in Table 1.”; [*Examiner notes: The broadest reasonable interpretation of a pre-trained neural network includes a neural network which is already trained. Therefore, since fθ is already trained, it is a pre-trained neural network]

Grill does not explicitly teach:
A system for pre-training of a neural network from only unlabeled data, said system comprising a processor coupled to a memory storing computer-executable instructions for: 
an augmentation module havinq a first sub-module and a second sub-module, 
execute the augmentation module to: retrieve a plurality of corresponding first and second unlabeled action sequences, each including a time sequence of poses being a time sequence of object representations depicting a respective object performing a respective activity where each object representation provides locations of predefined features of the object, and to generate the first and second input data to include augmented versions of the first and second unlabeled action sequences; and
include both of the corresponding first and second augmented versions in both of the first and second input data such that the first and second untrained networks operate concurrently on the corresponding first and second augmented versions,
wherein the processor is further configured to execute the first sub- module to generate a first augmented version based on a respective first unlabeled action sequence, and execute the second sub-module to generate a second augmented version based on a respective second unlabeled action sequence, wherein the second sub-module differs from the first sub-module, 
and wherein the processor is further configured to execute the first and second untrained neural networks on one or more instances of the first and second input data, generated by execution of the augmentation module

However, Xu teaches:
an augmentation module havinq a first sub-module and a second sub-module, 
(page 1045 column 2 last two paragraphs) “We design a Body-part Net to extract the features of different body parts. The human body can be divided into five parts naturally[*Examiner notes: augmentation module having first and second sub-module], including two arms, two legs, and a trunk [22] . Some actions are performed by few body parts. For instance, only one or two arms participate in the action of waving hands, and the rest body parts are stationary.”; (Fig. 3)

    PNG
    media_image8.png
    425
    954
    media_image8.png
    Greyscale



execute the augmentation module to: retrieve a plurality of corresponding first and second […] action sequences, each including a time sequence of poses being a time sequence of object representations depicting a respective object performing a respective activity where each object representation provides locations of predefined features of the object, and to generate the first and second input data to include augmented versions of the first and second […] action sequences; and
(page 1045 column 2 paragraph 2) “As same as RGB videos, skeleton video sequences also have both spatial and temporal information[*Examiner notes: time sequence of poses]. The spatial information represents the interaction between joints, and the temporal information records the dynamic changes of motions.”; (page 1047 column 1 paragraph 2) “NTU RGB+D Dataset. This dataset is the largest skeleton-based human action dataset captured by three Kinect V2 cameras, including more than 56 000 sequences in 60 classes of actions performed by 40 subjects[*Examiner notes: depicting a respective activity].”; (page 1045 column 2 last paragraph) “The Body-part Net consists of five Base-Nets and SoftMax layers as shown in Fig. 3. The sequence data of five body parts are fed[*Examiner notes: generate input data including augmented versions of action sequences] into five Base-Nets to model the motions of every body parts. And the scores produced by the SoftMax layers are fused based on (3), where Q equals 5 in this subnet.”

include both of the corresponding first and second augmented versions in both of the first and second input data such that the first and second untrained networks operate concurrently on the corresponding first and second augmented versions,
(page 1045 column 2 last paragraph) “The Body-part Net consists of five Base-Nets[*Examiner notes: neural networks] and SoftMax layers as shown in Fig. 3. The sequence data of five body parts are fed into five Base-Nets to model the motions of every body parts[*Examiner notes: operating concurrently]. And the scores produced by the SoftMax layers are fused based on (3), where Q equals 5 in this subnet.”; (page 1047) “For the CV protocol, videos of two view-points are used for training and the rest one for testing. For the CS protocol, 20 subjects are used for training and the rest for testing.”; [*Examiner notes: Before the neural network is trained, it is an untrained model. The broadest reasonable interpretation of “concurrently” includes neural networks operating side-by-side in an ensemble manner (see figure 3 annotated below)]; (figure 3)

    PNG
    media_image9.png
    329
    938
    media_image9.png
    Greyscale


wherein the processor is further configured to execute the first sub- module to generate a first augmented version based on a respective first […] action sequence, and execute the second sub-module to generate a second augmented version based on a respective second […] action sequence, wherein the second sub-module differs from the first sub-module, 
[*Examiner notes: According Applicant’s specification, the first action sequence and second action sequence may be identical (Specification page 10) “In some embodiments, the action sequences AS1, AS2 in at least some of the pairs are identical. For example, the augmentation module 20may retrieve a single action sequence from the database 10 and duplicate it to form AS 1, AS2”. Therefore, the broadest reasonable interpretation of “a respective first action sequence” and “a respective second action sequence” includes exactly the same action sequence]; (Figure 3)

    PNG
    media_image10.png
    450
    915
    media_image10.png
    Greyscale


and wherein the processor is further configured to execute the first and second untrained neural networks on one or more instances of the first and second input data, generated by execution of the augmentation module
(page 1045 column 2 last paragraph) “The sequence data of five body parts[*Examiner notes: execution of the augmentation module] are fed into five Base-Nets to model the motions of every body parts[*Examiner notes: execute the first and second neural networks on one or more instances of input data]. And the scores produced by the SoftMax layers are fused based on (3), where Q equals 5 in this subnet.”

	Grill, Xu, and the instant application are analogous because they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill with the keypoint action sequences, augmentation, and neural network input process taught by Xu because (Xu page 1044 abstract) “Experimental results show that the proposed Ensem-NN performs better than state-of-the-art methods on three widely used datasets”, (Xu page 1047 column 2 last paragraph) “We design four subnets based on a Base-Net (1-D CNN with residual structure) to extract diverse and complementary features”, and (Grill page 10 last paragraph) “To generalize BYOL to other modalities (e.g., audio, video, text, ...) it is necessary to obtain similarly suitable augmentations for each of them.” The body-parts augmentations taught by Xu provides the requisite types of augmentations required to generalize the BYOL method to action recognitions, which is a stated goal of future work in Grill.

Adama teaches:
A system for pre-training of a neural network from only unlabeled data
(page 1 abstract) “This paper proposes a novel Adaptive Segmentation and Sequence Learning (ASSL) framework which aims at segmenting unlabelled observations of human activities from extracted 3D joint information. Learning from these obtained segments provides information about the underlying patterns of activity sequences needed in predicting subsequent actions.”

execute the augmentation module to: retrieve a plurality of corresponding first and second unlabeled action sequences, each including a time sequence of poses being a time sequence of object representations depicting a respective object performing a respective activity where each object representation provides locations of predefined features of the object
(page 3 column 2 section 3.1) “An activity pose J[*Examiner notes: object performing activity] as represented by; J = [j1,j2,…jm,…,jM], for J∈R3xm is a feature space which represents 3D human skeleton joints with coordinates. M represents the total number of joints in J with each joint, jm, with coordinates corresponding to horizontal, vertical and depth positions respectively[*Examiner notes: locations of predefined features].”; (page 4 column 1 definition 3) “Definition 3 Activity action sequence, S, is defined as the temporal ordering of all B key actions[*Examiner notes: time sequence of poses] obtained from activity a-n.”; (page 1 abstract) “This paper proposes a novel Adaptive Segmentation and Sequence Learning (ASSL) framework which aims at segmenting unlabelled observations of human activities from extracted 3D joint information. Learning from these obtained segments provides information about the underlying patterns of activity sequences needed in predicting subsequent actions.”

	Grill, Xu, Adama, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu with the unlabeled action sequences and unsupervised techniques of Adama because (Adama page 1 abstract) “This ASSL technique has been evaluated using both an experimental human activity dataset and a public activity dataset, and achieved a better performance when compared with other techniques including an Auto-regressive Integrated Moving Average, Support Vector Regression and Gaussian Mixture Regression Models in learning to predict patterns of activity sequences.”

And Wu teaches:
said system comprising a processor coupled to a memory storing computer-executable instructions for:
(page 2204 column 2 paragraph 1) “We trained and evaluated on a server with a 3.50-GHz Intel i7-7800K CPU[*Examiner notes: processor], 16 GB memory[*Examiner notes: memory], and NVIDIA GeForce GTX 1080 GPU.”

	Grill, Xu, Adama, Wu, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu and Adama with the non-transitory computer-readable medium taught by Wu because (Wu page 2204 column 2 paragraph 1) “We trained and evaluated on a server with a 3.50-GHz Intel i7-7800K CPU, 16 GB memory and NVIDIA GeForce GTX 1080 GPU”. That is, the computer memory and processor can be used for training and evaluating neural network models taught by Grill, Xu, and Adama.

Regarding Claim 3
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 1
(see rejection of claim 1)

And Xu further teaches:
wherein the first sub-module comprises a first set of augmentation functions which are operable on the respective first unlabeled action sequence to generate the first augmented version
wherein the second sub-module comprises a second set of augmentation functions which are operable on the respective second unlabeled action sequence to generate the second augmented version
wherein the first and second sets of augmentation functions differ by at least one augmentation function.
(page 1045 column 2 last two paragraphs) “We design a Body-part Net to extract the features of different body parts. The human body[*Examiner notes: first and second action sequences] can be divided into five parts naturally[*Examiner notes: first and second augmented versions], including two arms, two legs, and a trunk [22] . Some actions are performed by few body parts. For instance, only one or two arms participate in the action of waving hands, and the rest body parts are stationary. The Body-part Net consists of five Base-Nets and SoftMax layers as shown in Fig. 3. The sequence data of five body parts[*Examiner notes: augmentation functions differ by at least one augmentation function] are fed into five Base-Nets to model the motions of every body parts.”; (Figure 3); [*Examiner notes:  The first augmentation functions involve removing all joints except the joints of the right arm. The second augmentation functions involve removing all joints except the joints of the left arm. These are different functions. See fig. 3 annotated below.]

    PNG
    media_image10.png
    450
    915
    media_image10.png
    Greyscale



It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Adama, and Wu with Xu for the same reasons given in claim 1 above.

Regarding Claim 5
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 1
(see rejection of claim 1)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein each of the first and second unlabeled action sequences comprise a time sequence of object representations, and wherein each of the object representations comprises locations of predefined features on the respective object.

Xu further teaches:
wherein each of the first and second unlabeled action sequences comprise a time sequence of object representations, and wherein each of the object representations comprises locations of predefined features on the respective object.
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 1046 column 2 paragraph 3) “Given a joint J = (x, y, z) in 3-D coordinate[*Examiner notes: locations of predefined features], the tth frame with K joints in a video can be represented as Qt={J1,J2,…,JK}. And a skeleton video with T frames can be represented as V={Q1,Q2,…,Qt,…,QT}.”

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Adama, and Wu with Xu for the same reasons given in claim 1 above.

Regarding Claim 8
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Xu further teaches:
wherein the second sub-module, to generate the second augmented version, is operable to hide a subset of the respective object in the object representations in the respective second unlabeled action sequence.
[*Examiner notes: Adama above teaches unlabeled action sequences]; (figure 3)

    PNG
    media_image11.png
    411
    952
    media_image11.png
    Greyscale

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Adama, and Wu with Xu for the same reasons given in claim 1 above.

Regarding Claim 9
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 8
(see rejection of claim 8)

And Xu further teaches:
wherein the subset corresponds to said predefined features on one side of a geometric plane with a predefined arrangement through the respective object
[*Examiner notes: The broadest reasonable interpretation of “predefined features on one side of a geometric plane” includes only keeping joints which correspond to a particular body part (one could imagine a geometric plane or multiple planes dividing the space to only keep the arm joints)]

    PNG
    media_image12.png
    513
    813
    media_image12.png
    Greyscale


It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Adama, and Wu with Xu for the same reasons given in claim 1 above.

Regarding Claim 20
Grill in view of Xu and Adama teaches:
the method of claim 19
(see rejection of claim 19)

And Wu further teaches
A non-transitory computer-readable medium comprising computer instructions which, when executed by a processor, cause the processor to the perform 
(page 2204 column 2 paragraph 1) “We trained and evaluated on a server with a 3.50-GHz Intel i7-7800K CPU[*Examiner notes: processor], 16 GB memory[*Examiner notes: non-transitory computer-readable medium], and NVIDIA GeForce GTX 1080 GPU.”

	It would have been further obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu and Adama with the non-transitory computer-readable medium taught by Wu because (Wu page 2204 column 2 paragraph 1) “We trained and evaluated on a server with a 3.50-GHz Intel i7-7800K CPU, 16 GB memory and NVIDIA GeForce GTX 1080 GPU”. That is, the non-transitory computer-readable medium and processor can be used for training and evaluating neural network models taught by Grill and Xu.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu, and further in view of NPL reference Li et al. “Cascaded Deep Monocular 3D Human Pose Estimation with Evolutionary Training Data” herein referred to as Li.

Regarding Claim 4
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 1
(see rejection of claim 1)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module is operable to apply more augmentation functions than the first sub-module

However, Li teaches:
wherein the second sub-module is operable to apply more augmentation functions than the first sub-module
[*Examiner notes: The second submodule (children) are evolved versions of the first submodule (parents) and thus apply more augmentation functions (in this case, a mutation and crossover function).]

    PNG
    media_image13.png
    164
    713
    media_image13.png
    Greyscale


    PNG
    media_image14.png
    194
    575
    media_image14.png
    Greyscale


	Grill, Xu, Adama, Wu, Li, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the further augmentation functions of Li because (Li page 2 column 1 paragraph 2) “With an augmented training dataset after evolution, we propose a cascaded model achieving state-of-the art accuracy under various evaluation settings.”

Claims 7 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu, and further in view of NPL reference Yao et al. “A data augmentation method for human action recognition using dense joint motion images”.

Regarding Claim 7
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module, to generate the second augmented version is operable to distort the object representations in the respective second unlabeled action sequence in a selected direction

However, Yao teaches:
wherein the second sub-module, to generate the second augmented version is operable to distort the object representations in the respective second unlabeled action sequence in a selected direction
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 5 column 2 bullet point (2)) “Human beings have similar skeletons. Regardless of a person’s external shape, his/her skeleton is similar to others. Thus, strategies for scaling an image up and down[*Examiner notes: selected direction] based on the original skeleton can be used to mimic people of different sizes. Increasing the values of the R, G and B channels by a certain amount could be a simple way to perform this task[*Examiner notes: distort object representations].”

Grill, Xu, Adama, Wu, Yao, and the instant application are analogous because they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the augmentations of Yao because (Yao page 9 column 1 section 5) “Compared with random-sample-augmentation strategies, our credible-sample-generating strategies can mimic specific actions performed by people of different sizes with different speeds. The generated samples were less noisy than random samples and were very helpful for model training.”

Regarding Claim 13
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the first sub-module, to generate the first augmented version, is operable to change a time distance between the object representations in the respective first unlabeled action sequence

However, Yao teaches:
wherein the first sub-module, to generate the first augmented version, is operable to change a time distance between the object representations in the respective first unlabeled action sequence
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 5 column 1 bullet point (1)) “The y-axis of a DJMI represents the time dimension. Keeping the width of the DJMI unchanged while scaling the height of the DJMI up/down can represent slowing down or speeding up the original action.”

Grill, Xu, Adama, Wu, Yao, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the augmentations of Yao because (Yao page 9 column 1 section 5) “Compared with random-sample-augmentation strategies, our credible-sample-generating strategies can mimic specific actions performed by people of different sizes with different speeds. The generated samples were less noisy than random samples and were very helpful for model training.”

Claims 6 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu, and further in view of NPL reference Varol et al. “Long-Term Temporal Convolutions for Action Recognition” herein referred to as Varol.

Regarding Claim 6
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module, to generate the second augmented version, is operable to randomly select a coherent subset of the object representations in the respective second unlabeled action sequence

However Varol teaches:
wherein the second sub-module, to generate the second augmented version, is operable to randomly select a coherent subset of the object representations in the respective second unlabeled action sequence
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 1512 column 1 paragraph 3) “Inspired by the random spatial cropping during training, we apply the corresponding augmentation to the temporal dimension as in [6], which we call random clipping. During training, given an input video, we randomly select a point (x,y,t)[*Examiner notes: randomly select] to sample a video clip of fixed size[*Examiner notes: a coherent subset].

	Grill, Xu, Adama, Wu, Varol, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the augmentation of Varol because (Varol page 1513 column 1 section 4.2.2) “Table 1 demonstrates the contribution of data augmentation when training a large CNN with limited amount of data. Our baseline uses sliding window clips with 75% overlap and a dropout of 0.5 during training. We gain 3.1% with random clipping, 1.6% with multiscale cropping and 2% with higher dropout ratio. When combined, the data augmentation and a higher dropout results in a 4% gain for video classification on UCF101 split 1.”

Regarding Claim 11
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module, to generate the second augmented version, is operable to randomly select an object representation in the respective second unlabeled action sequence and rearrange the respective second unlabeled action sequence with the selected object representation as starting point

However Varol teaches:
wherein the second sub-module, to generate the second augmented version, is operable to randomly select an object representation in the respective second unlabeled action sequence and rearrange the respective second unlabeled action sequence with the selected object representation as starting point
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 1512 column 1 paragraph 3) “Inspired by the random spatial cropping during training, we apply the corresponding augmentation to the temporal dimension as in [6], which we call random clipping. During training, given an input video, we randomly select a point (x,y,t)[*Examiner notes: randomly select an object representation] to sample a video clip of fixed size[*Examiner notes: rearrange the respective action sequence with selected as a starting point].

	Grill, Xu, Adama, Wu, Varol, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the augmentation of Varol because (Varol page 1513 column 1 section 4.2.2) “Table 1 demonstrates the contribution of data augmentation when training a large CNN with limited amount of data. Our baseline uses sliding window clips with 75% overlap and a dropout of 0.5 during training. We gain 3.1% with random clipping, 1.6% with multiscale cropping and 2% with higher dropout ratio. When combined, the data augmentation and a higher dropout results in a 4% gain for video classification on UCF101 split 1.”



Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu, and further in view of NPL reference Mehta et al. “VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera” herein referred to as Mehta.

Regarding Claim 10
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module, to generate the second augmented version, is operable to perform a temporal smoothing of the object representations in the respective second unlabeled action sequence

However, Mehta teaches:
wherein the second sub-module, to generate the second augmented version, is operable to perform a temporal smoothing of the object representations in the respective second unlabeled action sequence
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 6 column 1 section 4.2)“We combine the 2D and 3D joint positions in a joint optimization framework, along with temporal filtering and smoothing, to obtain an accurate, temporally stable and robust result.”
	
	Grill, Xu, Adama, Wu, Mehta, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the temporal smoothing of Mehta because (Mehta page 6 column 1 section 4.2) “We combine the 2D and 3D joint positions in a joint optimization framework, along with temporal filtering and smoothing, to obtain an accurate, temporally stable and robust result.”

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Grill, Xu, Adama, Wu, and further in view of NPL reference Agahian et al. “Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition” herein referred to as Agahian.

Regarding Claim 12
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 5
(see rejection of claim 5)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the second sub-module, to generate the second augmented version, is operable to flip the respective object in the object representations in the respective second unlabeled action sequence through a mirror plane

However, Agahian teaches:
wherein the second sub-module, to generate the second augmented version, is operable to flip the respective object in the object representations in the respective second unlabeled action sequence through a mirror plane
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 600 column 1 line 6) “In CAD-60 dataset, one of the four subjects was left-handed (subject number 3). We use mirroring operations before constructing the feature vector in order to convert laterality of the actions and to make it similar to the right-handed actions.”
	Grill, Xu, Adama, Wu, Agahian, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the flipping operation taught by Agahian because (Agahian page 597 column 2 paragraph 2 line 13) “The actions are performed with different laterality as one of the subjects is left-handed. In order to compensate the effect of laterality, some of the proposed methods [40, 47, 49] also added a mirrored version of these instances to the training data to achieve invariance toward handedness of the subjects.”

	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu, and further in view of NPL reference Phonsing et al. “Multi kinect cameras setup for skeleton based action recognition” herein referred to as Phonsing.

Regarding Claim 14
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 1
(see rejection of claim 1)

Grill in view of Xu, Adama, and Wu does not explicitly teach:
wherein the augmentation module is configured to retrieve the first and second unlabeled action sequences so as to correspond to different viewing angles onto the respective object performing the respective activity.

However, Phonsing teaches:
wherein the augmentation module is configured to retrieve the first and second action sequences so as to correspond to different viewing angles onto the respective object performing the respective activity.
[*Examiner notes: Adama above teaches unlabeled action sequences]; (page 3 column 2 first paragraph) “The images in the dataset are captured using 3 synced Kinect cameras. Fig. 4 illustrated the set-up of cameras.”; (Figure 4 caption) “Shows a setting of kinect cameras with a subject; left and right camera is set at 35 degree with a distance of 2 meters away from a subject.”



    PNG
    media_image15.png
    494
    631
    media_image15.png
    Greyscale

	Grill, Xu, Adama, Wu, Phonsing, and the instant application are analogous because they are all directed to data processing and predictions.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, and Wu with the different viewing angles as taught by Phonsing because (Phonsing page 1 column 1) “From the aforementioned issues, using only one camera for image acquisition, or “one view”, is not sufficient, multi cameras setting has been proposed to capture images of objects in different perspectives.”

Claims 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu, Adama, Wu and further in view of NPL reference Lin et al. “A Framework for Fall Detection Based on OpenPose Skeleton and LSTM/GRU Models” herein referred to as Lin.

Regarding Claim 15
Grill in view of Xu, Adama, and Wu teaches:
The system of claim 1
(see rejection of claim 1)

And Xu further teaches:
which further comprises a training sub-system, 
(page 1047 column 1 paragraph 2) “For the CV protocol, videos of two view-points are used for training and the rest one for testing. For the CS protocol, 20 subjects are used for training and the rest for testing.”

which comprises: a third neural network, which is configured to operate on third input data to generate third representation data, the third neural network being initialized by use of the parameter definition, 
(page 1045 column 2 last two paragraphs) “The human body[*Examiner notes: third input data] can be divided into five parts naturally, including two arms, two legs, and a trunk [22] . […] The Body-part Net consists of five Base-Nets[*Examiner notes: third neural network] and SoftMax layers as shown in Fig. 3. The sequence data of five body parts are fed into five Base-Nets to model the motions of every body parts. And the scores produced by the SoftMax[*Examiner notes: third representation data] layers are fused”

    PNG
    media_image16.png
    435
    1039
    media_image16.png
    Greyscale


wherein the training sub-system is configured to, by the third updating module, train the third network to recognize one or more activities represented by the activity label data.
(page 1046 column 2 section F) “The subnets will be trained independently, with cross-entropy as the cost function”

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill and Adama with Xu for the same reasons given in claim 1 above.

Grill in view of Xu, Adama, and Wu does not explicitly teach:
and a third updating module, which is configured to update parameters of the third network to minimize a difference between the third representation data and activity label data associated with the third input data, 

However, Lin teaches:
and a third updating module, which is configured to update parameters of the third network to minimize a difference between the third representation data and activity label data associated with the third input data
(page 3 last paragraph) “The UR Fall Detection Dataset is produced by the Interdisciplinary Center for Computational Modeling of Rzeszow University. The content contains 70 (30 falls and 40 activities of daily living) sequences at a rate of 30 frames per second. Both fall events and other daily living activities such as standing, squatting down to pick up objects, and lying down were recorded[*Examiner notes: activity label data].”; (page 4 section 2.2 paragraph 1) “OpenPose is a supervised convolutional neural network[*Examiner nots: mapped to minimize a difference] based on Caffe for real-time multi-person 2D pose estimation developed by Carnegie Mellon University (CMU) [25]. It can realize posture estimation of human body movements, facial expressions, and finger movements. It is suitable for single- and multiple-user settings with excellent recognition effect and fast recognition speed.”; (Lin page 13 paragraph 2) “Loss function, also known as the Object function, is used to calculate the gap between the predicted value of the neural network and the target value. The smaller the loss function[*Examiner notes: minimize the loss function], the better the accuracy of the neural network. Cross Entropy, which is a loss function that describes the size of the difference between the model′s predicted value and the real value[*Examiner notes: difference between third representation data and activity label data], is often used in classification problems.”

	Grill, Xu, Adama, Wu, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of  Xu, Adama, and Wu with the updating of Lin because (Lin page 13 paragraph 2) “Loss function, also known as the Object function, is used to calculate the gap between the predicted value of the neural network and the target value. The smaller the loss function, the better the accuracy of the neural network.” That is, it is ideal to minimize the difference because it increases the accuracy of the model.

Regarding Claim 16
Grill in view of Xu, Adama, Wu, and Lin teaches:
The system of claim 15
(see rejection of claim 1)

And Xu further teaches:
wherein the training sub-system further includes a further augmentation module, wherein the processor is further configured to execute the further augmentation module to retrieve third action sequences of one or more objects performing one or more activities, generate the third input data to include third augmented versions of the third action sequences, wherein the further augmentation module is configured in correspondence with the first sub-module(page 1045 column 2 last paragraph) “The sequence data of five body parts are fed into five Base-Nets to model the motions of every body parts[*Examiner notes: augmented versions of the third action sequences]. And the scores produced by the SoftMax layers are fused[*Examiner notes: configured in correspondence with the first sub-module] based on (3)”

    PNG
    media_image17.png
    442
    802
    media_image17.png
    Greyscale


It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Adama, Wu, and Lin with Xu for the same reasons given in claim 1 above.


Regarding Claim 17
Grill in view of Xu, Adama, Wu, and Lin teaches:
The system of claim 15
(see rejection of claim 15)

And Wu further teaches:
wherein the memory further stores computer-readable instruction for a fourth neural network and a fourth updating module.
(page 2203 column 1 paragraph 2) “In our proposed multi-teacher knowledge distillation framework, the logits vector produced by the student network[*Examiner notes: mapped to fourth neural network]”; (page 2203 column 1 last paragraph) “There are two objective functions when training the student model. The first objective function L1 minimizes the cross entropy with the soft labels (qTt )i and the soft probability (qTs)i produced by the student model[*Examiner notes: fourth updating module]”

wherein the processor is further configured to execute the fourth neural network on fourth input data to generate fourth representation data
(page 2203 column 1 paragraph 2) “In our proposed multi-teacher knowledge distillation framework, the logits vector produced by the student network[*Examiner notes: mapped to fourth neural network] for an input video vi[*Examiner notes: mapped to fourth input data], i = 1, ...,N is represented by (zs)i, where the dimension of vector (zs)i = [(zs)1i, ..., (zs)Ci] is the number of categories C. The softmax layer converts the logits vector (zs)i to a probability distribution[*Examiner notes: fourth representation data] (qs)i = [(qs)1i, ..., (qs)Ci], [Equation 2]”

    PNG
    media_image18.png
    27
    284
    media_image18.png
    Greyscale


and execute the fourth updating module to update parameters of the fourth network to minimize a difference between the fourth representation data and fifth representation data,
wherein the fifth representation data is generated by the third neural network, when trained
(page 2203 column 1 last paragraph) “Distillation uses the class probabilities produced by the teacher model as “soft labels” for training the student model[*Examiner notes: fifth representation data]. There are two objective functions when training the student model. The first objective function L1 minimizes the cross entropy with the soft labels (qTt )i and the soft probability (qTs)i produced by the student model[*Examiner notes: minimize the difference between fourth and fifth representation data]. (qTs )i is computed by GSoftmax with the same temperature T as the teacher model”; [*Examiner notes: Teacher model corresponds to third neural network. Cross-entropy is one way to measure a difference.]

based on the fourth input data.
(page 2204 column 2 paragraph 3) “There are two restrictions on the input source while distilling. First, the extracted data for teachers and student must be from the same frame or from the same group of pictures (GOP)[*Examiner notes: third neural network operates on fourth input data]. Second, the extracted data for teachers and student must have the same data augmentation process which is following CoViAR [10], because different preprocessing processes may affect teachers’ observations.”

	It would have been further obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu, Adama, Wu, and Lin with the fourth neural network of Wu because (Wu page 2202 abstract) “Experiments show that we can reach a 2.4× compression rate in a number of parameters and a 1.2× computation reduction”.

Regarding Claim 18
Grill in view of Xu, Adama, Wu, and Lin teaches:
The system of claim 17, 
(see rejection of claim 17)

And Wu further teaches:
wherein the fourth neural network has a smaller number of channels than the third neural network
(page 2203 column 2 section 2.2) “For this reason, we decided to compress the spatial network to a smaller model. According to the ResNet architecture for ImageNet [12], the number of parameters of ResNet-152 is approximately 58.2 million, and for ResNet-18 is approximately 11.2 million[*Examiner notes: smaller number of channels than the third neural network]”

It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to combine Grill, Xu, Adama, and Lin with Wu for the same reasons given in claim 17 above.

Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Grill in view of Xu and Adama.

Regarding Claim 19
Grill teaches:
operating a first untrained neural network on the first input data to generate first representation data; operating a second untrained neural network on the second input data to generate second representation data
(page 1 abstract) “BYOL relies on two neural networks, referred to as online[*Examiner notes: first neural network] and target networks[*Examiner notes: second neural network], that interact and learn from each other.”; (page 3 last sentence) “BYOL produces two augmented views v =∆ t(x)[*Examiner notes: first input data] and v’ =∆ t’(x)[*Examiner notes: second input data] from x by applying respectively image augmentations t ∼ T and t’ ∼ T’.”; (page 4 first paragraph) “From the first augmented view v, the online network outputs a representation yθ =∆ fθ(v) and a projection zθ =∆ gθ(y). The target network outputs y’ξ =∆ fξ(v’) and the target projection z0ξ =∆ gξ(y’) from the second augmented view v’. We then output a prediction qθ(zθ) of z’ξ and l2-normalize both qθ(zθ) and z’ξ to qθ(zθ) =∆ qθ(zθ)/||qθ(zθ)||2[*Examiner notes: mapped to first representation] and z’ξ =∆ z’ξ/||z’ξ||2[*Examiner notes: mapped to second representation].”; [*Examiner notes: See figure 2 annotated below. The neural networks undergo pre-training, and are untrained prior to pre-training.]

    PNG
    media_image4.png
    493
    1321
    media_image4.png
    Greyscale





updating parameters of the first untrained neural network to minimize a difference between the first representation data and the second representation data
minimize a difference between the first representation data and the second representation data
(page 4 paragraph 1) “From the first augmented view v, the online network outputs a representation yθ =∆ fθ(v) and a projection zθ =∆ gθ(y). The target network outputs y’ξ =∆ fξ(v’) and the target projection z’ξ =∆ gξ(y’) from the second augmented view v’. We then output a prediction qθ(zθ) of z’ξ and l2 normalize both qθ(zθ) and z’ξ to qθ(zθ) =∆ qθ(zθ)/kqθ(zθ)||2 and 
    PNG
    media_image5.png
    18
    262
    media_image5.png
    Greyscale
.
Note that this predictor is only applied to the online branch, making the architecture asymmetric between the online and target pipeline. Finally we define the following mean squared error between the normalized predictions and target projections”; (page 4 above equation 3) “We symmetrize the loss Lθ,ξ in Eq. 2 by separately feeding v’ to the online network and v to the target network to compute Leθ,ξ. At each training step, we perform a stochastic optimization step to minimize LBYOLθ,ξ = Lθ,ξ + Leθ,ξ with respect to θ only[*Examiner notes: update parameters of first neural network to minimize difference], but not ξ, as depicted by the stop-gradient in Figure 2. BYOL’s dynamics are summarized as”; [*Examiner note: see equations 2 and 3 annotated below]

    PNG
    media_image6.png
    212
    787
    media_image6.png
    Greyscale


    PNG
    media_image7.png
    166
    466
    media_image7.png
    Greyscale


updating parameters of the second untrained neural network as a function of the parameters of the first neural network
[*Examiner notes: See annotations of equation 1 below.  θ are the parameters of the first neural network and ξ are the parameters of the second neural network]

    PNG
    media_image3.png
    330
    797
    media_image3.png
    Greyscale


and providing, after operating the first and second untrained neural networks on one or more instances of the first and second input data, at least a subset of the parameters of the first untrained neural network as a parameter definition of a pre-trained neural network, 
(page 4 below equation 1) “At the end of training, we only keep the encoder fθ[*Examiner notes: provide parameters of first neural network as parameter definition of pre-trained neural network]; as in [9]”; (page 6 paragraph 2) “We first evaluate BYOL’s representation by training a linear classifier on top of the frozen representation, following the procedure described in [48, 74, 41, 10, 8], and appendix C.1; we report top-1 and top-5 accuracies in % on the test set in Table 1.”; [*Examiner notes: The broadest reasonable interpretation of a pre-trained neural network includes a neural network which is already trained. Therefore, since fθ is already trained, it is a pre-trained neural network]

Grill does not explicitly teach:
retrieving first and second unlabeled action sequences of an object performing an activity, each includinq a time sequence of poses which is a time sequence of object representations, where each object representation provides locations of predefined features of the object;
generating the first and second unlabeled input data to include first and second augmented versions of the first and second action sequences
wherein said generating the first and second unlabeled input data comprises operating a first sub-module on the first action sequence to generate the first augmented version, operating a second sub-module, which differs from the first sub-module, on the second unlabeled action sequence to generate the second augmented version.

However, Xu teaches:
generating the first and second unlabeled input data to include first and second augmented versions of the first and second action sequences
[*Examiner notes: Adama below teaches unlabeled action sequences]; (page 1045 column 2 last two paragraphs) “We design a Body-part Net to extract the features of different body parts. The human body can be divided into five parts naturally[*Examiner notes: augmentation module having first and second sub-module], including two arms, two legs, and a trunk [22] . Some actions are performed by few body parts. For instance, only one or two arms participate in the action of waving hands, and the rest body parts are stationary.”; (Fig. 3)

    PNG
    media_image8.png
    425
    954
    media_image8.png
    Greyscale



wherein said generating the first and second unlabeled input data comprises operating a first sub-module on the first action sequence to generate the first augmented version, 
operating a second sub-module, which differs from the first sub-module, on the second unlabeled action sequence to generate the second augmented version.
[*Examiner notes: Adama below teaches unlabeled action sequences]; (page 1045 column 2 paragraph 2) “As same as RGB videos, skeleton video sequences also have both spatial and temporal information[*Examiner notes: time sequence of poses]. The spatial information represents the interaction between joints, and the temporal information records the dynamic changes of motions.”; (page 1047 column 1 paragraph 2) “NTU RGB+D Dataset. This dataset is the largest skeleton-based human action dataset captured by three Kinect V2 cameras, including more than 56 000 sequences in 60 classes of actions performed by 40 subjects[*Examiner notes: depicting a respective activity].”; (page 1045 column 2 last paragraph) “The Body-part Net consists of five Base-Nets and SoftMax layers as shown in Fig. 3. The sequence data of five body parts are fed[*Examiner notes: generate input data including augmented versions of action sequences] into five Base-Nets to model the motions of every body parts. And the scores produced by the SoftMax layers are fused based on (3), where Q equals 5 in this subnet.”

and including both of the corresponding first and second augmented versions in both of the first and second input data such that the first and second networks operate concurrently on the corresponding first and second augmented versions
(page 1045 column 2 last paragraph) “The Body-part Net consists of five Base-Nets[*Examiner notes: neural networks] and SoftMax layers as shown in Fig. 3. The sequence data of five body parts are fed into five Base-Nets to model the motions of every body parts[*Examiner notes: operating concurrently]. And the scores produced by the SoftMax layers are fused based on (3), where Q equals 5 in this subnet.”; (page 1047) “For the CV protocol, videos of two view-points are used for training and the rest one for testing. For the CS protocol, 20 subjects are used for training and the rest for testing.”; [*Examiner notes: Before the neural network is trained, it is an untrained model. The broadest reasonable interpretation of “concurrently” includes neural networks operating side-by-side in an ensemble manner (see figure 3 annotated below)]; (figure 3)

    PNG
    media_image9.png
    329
    938
    media_image9.png
    Greyscale




	Grill, Xu, and the instant application are analogous because they are all directed to machine learning.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill with the keypoint action sequences, augmentation, and neural network input process taught by Xu because (Xu page 1044 abstract) “Experimental results show that the proposed Ensem-NN performs better than state-of-the-art methods on three widely used datasets”, (Xu page 1047 column 2 last paragraph) “We design four subnets based on a Base-Net (1-D CNN with residual structure) to extract diverse and complementary features”, and (Grill page 10 last paragraph) “To generalize BYOL to other modalities (e.g., audio, video, text, ...) it is necessary to obtain similarly suitable augmentations for each of them.” The body-parts augmentations taught by Xu provides the requisite types of augmentations required to generalize the BYOL method to action recognitions, which is a stated goal of future work in Grill.

Adama teaches:
A computer-implemented method for use in pre-training of a neural network from only unlabeled data, said method comprising: 
(page 1 abstract) “This paper proposes a novel Adaptive Segmentation and Sequence Learning (ASSL) framework which aims at segmenting unlabelled observations of human activities from extracted 3D joint information. Learning from these obtained segments provides information about the underlying patterns of activity sequences needed in predicting subsequent actions.”

retrieving first and second unlabeled action sequences of an object performing an activity, each including a time sequence of poses which is a time sequence of object representations, where each object representation provides locations of predefined features of the object
(page 3 column 2 section 3.1) “An activity pose J[*Examiner notes: object performing activity] as represented by; J = [j1,j2,…jm,…,jM], for J∈R3xm is a feature space which represents 3D human skeleton joints with coordinates. M represents the total number of joints in J with each joint, jm, with coordinates corresponding to horizontal, vertical and depth positions respectively[*Examiner notes: locations of predefined features].”; (page 4 column 1 definition 3) “Definition 3 Activity action sequence, S, is defined as the temporal ordering of all B key actions[*Examiner notes: time sequence of poses] obtained from activity a-n.”; (page 1 abstract) “This paper proposes a novel Adaptive Segmentation and Sequence Learning (ASSL) framework which aims at segmenting unlabelled observations of human activities from extracted 3D joint information.”

	Grill, Xu, Adama, and the instant application are analogous because they are all directed to machine learning.
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the present invention to modify the neural networks of Grill in view of Xu with the unlabeled action sequences of Adama because (Adama page 1 abstract) “This ASSL technique has been evaluated using both an experimental human activity dataset and a public activity dataset, and achieved a better performance when compared with other techniques including an Auto-regressive Integrated Moving Average, Support Vector Regression and Gaussian Mixture Regression Models in learning to predict patterns of activity sequences.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Ezra J Baker whose telephone number is (703)756-1087. The examiner can normally be reached Monday - Friday 10:00 am - 8:00 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/E.J.B./Examiner, Art Unit 2126                                                                                                                                                                                                        

/DAVID YI/Supervisory Patent Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

May 24, 2022
Application Filed
May 14, 2025
Non-Final Rejection mailed — §103, §112
Aug 14, 2025
Response Filed
Nov 06, 2025
Final Rejection mailed — §103, §112
Jan 06, 2026
Response after Non-Final Action
Feb 02, 2026
Request for Continued Examination
Feb 09, 2026
Response after Non-Final Action
Apr 23, 2026
Non-Final Rejection mailed — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/863,840
Patent 12619886
Frozen Model Adaptation Through Soft Prompt Transfer
3y 9m to grant Granted May 05, 2026
17/559,159
Patent 12608619
SUPERSEDED FEDERATED LEARNING
4y 4m to grant Granted Apr 21, 2026
17/455,252
Patent 12585964
EXHAUSTIVE LEARNING TECHNIQUES FOR MACHINE LEARNING ALGORITHMS
4y 4m to grant Granted Mar 24, 2026
17/475,901
Patent 12579477
FEATURE SELECTION USING FEEDBACK-ASSISTED OPTIMIZATION MODELS
4y 6m to grant Granted Mar 17, 2026
17/460,373
Patent 12505379
COMPUTER-READABLE RECORDING MEDIUM STORING MACHINE LEARNING PROGRAM, MACHINE LEARNING METHOD, AND INFORMATION PROCESSING DEVICE OF IMPROVING PERFORMANCE OF LEARNING SKIP IN TRAINING MACHINE LEARNING MODEL
4y 3m to grant Granted Dec 23, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
50%
Grant Probability
99%
With Interview (+53.3%)
4y 0m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 16 resolved cases by this examiner. Grant probability derived from career allowance rate.