DETAILED ACTION
Notice of AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
2. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
3. Claims 1, 3, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and in further view of Shazeer et al (US Pub: 2020/0342316).
Regarding claim 1 (Currently Amended), Yi et al teaches: An apparatus for a sequence recognition in video, the apparatus 1. comprising: a memory including instructions; and processing circuitry that, when in operation, is configured by the instructions to: obtain video that includes a sequence of activities [page 1: Introduction]; invoke a sequence-to-sequence transformer on the video to produce a set of labels that correspond to activities in the sequence of activities [page 1: Introduction, page 4: 3], invoking the sequence-to- sequence transformer including: encoding an input sequence from the video to an encoded input sequence [page 4: 3.1]; and applying a local-chunk attention mechanism to generate queries, keys, and values [page 3: fig. 1(a)], applying the local-chunk attention mechanism including restricting an attention mechanism for a neuron in the sequence-to-sequence transformer based on a predetermined chunk size [page 5: p02 (Constraint receptive fields of self-attention layer within a local window with size w.)]; and communicate the set of labels [page 4: 3.1 (Output predictions are time indexed distribution per frame.)].
For a redundant teaching in the same field of endeavor, Shazeer et al further teaches applying a local-chunk attention mechanism to generate queries, keys, and values [p0077-p0081]. Therefore, given Shazeer et al’s clearly defined block/chunk local attention and explicit transformer mechanics with Q, K, V, and Yi et al’s encoder structure, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of the two to introduce a chunk block attention restricted mechanics for reducing computational cost and improving overall efficiency.
Regarding claim 3 (currently amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Shazeer et al further teaches: The apparatus of claim 1, wherein the sequence-to-sequence transformer is configured to divide the input sequence into chunks; and apply attention within a chunk [p0081].
Claim 12 (Currently Amended) has been analyzed and rejected with regard to claim 1 and in accordance with Yi et al and Shazeer et al’s further teaching: At least one non-transitory machine readable medium including instructions to implement a sequence-to-sequence transformer [Shazeer: abstract], the sequence-to- sequence transformer comprising: an input-embedding layer configured to encode an input sequence to an encoded input sequence; and an encoder neural network comprising one or more encoder subnetworks including a base encoder subnetwork that accepts the encoded input sequence as an input [Yi: fig. 1 (a)], an encoder subnetwork comprising: an encoder self-attention sub-layer that is configured to: receive subnetwork input; and apply a local-chunk attention mechanism over the subnetwork input to generate queries, keys, and values, the local-chunk attention mechanism restricting an attention mechanism for a neuron of the encoder subnetwork to a subset of the subnetwork input based on a predetermined chunk size [Yi: page 5: p02; Shazeer: p0079-p0081]; and a feedforward sub-layer that is configured to: apply a transformation to the subnetwork input based on the queries, keys, and values to produce encoder subnetwork output; and transmit the encoder subnetwork output to a recipient [Yi: page 4: 3.1; Shazeer: fig. 1: 134, p0053 (same feedforward layer used in a transformer w/o encoder/decoder)].
Regarding claim 19 (Original), the rationale applied to the rejection of claim 12 has been incorporated herein. Yi et al further teaches: The at least one non-transitory machine readable medium of claim 12, wherein the input sequence comprises video frames [page 1: introduction].
4. Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Liu et al (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, 08/17/2021).
Regarding claim 4 (currently amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Yi et al in view of Shazeer et al does not teach using local attention from neighboring chunks. In the same field of endeavor, Liu et al teaches: The apparatus of claim 1, wherein an input layer also uses local attention from neighboring chunks [abstract]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to apply local attention from neighboring chunks/windows for increased flexibilities.
5. Claims 5, 7, 15, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Bertasius et al (Is Space-Time Attention All You Need for Video Understanding, 06/09/2021).
Regarding claim 5 (Currently Amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Yi et al in view of Shazeer et al does not explicitly specify bi-directional with respect to time. In the same field of endeavor, Bertasius et al teaches: The apparatus of claim 1, wherein attention in the sequence-to-sequence transformer is bi-directional with respect to time for input to the sequence- to-sequence transformer [page 3: p01, p02 (t=1…F including past and future frames)]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to apply bi-directional attention for clear boundary determination based on before and after content/context.
Regarding claim 7 (Currently Amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Yi et al suggests an encoder only structure in Table 5. In the same field of endeavor, Bertasius et al also teaches: The apparatus of claim 1, wherein the sequence-to-sequence transformer does not have a decoder [page 3: Query Key Value computation]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to provide no decoder for labeling with simplified implementation.
Regarding claim 15 (Currently Amended), the rationale applied to the rejection of claim 12 has been incorporated herein. Claim 15 has been analyzed and rejected with regard to claim 5.
Regarding claim 16 (Currently Amended), the rationale applied to the rejection of claim 12 has been incorporated herein. Claim 16 has been analyzed and rejected with regard to claim 7.
6. Claims 6, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Shaw et al (Self-Attention with Relative Position Representations, 04/12/2018).
Regarding claim 6 (Currently Amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Yi et al in view of Shazeer et al does not explicitly specify relative position encodings. In the same field of endeavor, Shaw et al teaches: The apparatus of claim 1, wherein position encodings of the sequence-to-sequence transformer are relative with respect to self-attention calculations [page 1: introduction: p05]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to provide relative position encoding for easier representation of temporal distance and providing more information.
Regarding claim 17 (Original), the rationale applied to the rejection of claim 12 has been incorporated herein. Yi et al in view of Shazeer et al does not explicitly specify relative position encodings. In the same field of endeavor, Shaw et al teaches: The at least one non-transitory machine readable medium of claim 12, wherein, to encode the input sequence, the input-embedding layer is configured to apply relative positional encoding to the input sequence [page 1: introduction: p05]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to provide relative position encoding for easier representation of temporal distance and providing more information.
Regarding claim 18 (Original), the rationale applied to the rejection of claim 17 has been incorporated herein. Shaw et al further teaches: The at least one non-transitory machine readable medium of claim 17, wherein the relative positional encoding are bi-directional [page 3: 3.2 (j-i can be + or -)].
7. Claims 8, 10, 11, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Camgoz et al (Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation, 03/30/2020).
Regarding claim 8 (Currently Amended), the rationale applied to the rejection of claim 1 has been incorporated herein. Yi et al in view of Shazeer et al does not exemplify a sign language scenario. In the same field of endeavor, Camgoz et al teaches: The apparatus of claim 1, wherein the activities are gestures by a human being [abstract]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all use a transformer for sign language recognition to better handle variable length signs.
Regarding claim 10 (Original), the rationale applied to the rejection of claim 8 has been incorporated herein. Camgoz et al further teaches: The apparatus of claim 8, wherein the activities are signs in a sign language. [abstract].
Regarding claim 11 (Currently Amended), the rationale applied to the rejection of claim 10 has been incorporated herein. Camgoz et al further teaches: The apparatus of claim 10, wherein members of the set of labels are glosses for the sign language [abstract].
Regarding claim 21 (New), the rationale applied to the rejection of claim 19 has been incorporated herein. Claim 21 has been analyzed and rejected with regard to claim 10.
8. Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021), Shazeer et al (US Pub: 2020/0342316), and Camgoz et al (Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation, 03/30/2020); and in further view of Plizzari et al (Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks, 06/22/2021).
Regarding claim 9 (Currently Amended), the rationale applied to the rejection of claim 8 has been incorporated herein. Yi et al in view of Shazeer et al and Camgoz et al does not extract skeletal key points. In the same field of endeavor, Plizzari et al teaches: The apparatus of claim 8, wherein the processing circuitry is configured to: model a pose by the human being; extract skeletal key points from the pose; and provide the skeletal key points of the pose as input to the sequence-to-sequence transformer [Abstract, page 2: p02, page 3: 3.1]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to provide skeletal key points as input for sign language recognition per design choice.
9. Claims 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Vaswani et al (Scaling Local Self-Attention for Parameter Efficient Visual Backbones, 06/07/2021).
Regarding claim 13 (Currently Amended), the rationale applied to the rejection of claim 12 has been incorporated herein. Yi et al in view of Shazeer et al does not disclose expanding to include adjacent input. In the same field of endeavor, Vaswani et al teaches: The at least one non-transitory machine readable medium of claim 12, wherein the encoder self-attention sub-layer for the base encoder subnetwork is configured to expand the attention mechanism to include a portion of the subnetwork input that is adjacent to the subset of the subnetwork input [page 3: fig. 1]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to apply local attention from neighboring portion for increased flexibilities.
Regarding claim 14 (Currently Amended), the rationale applied to the rejection of claim 13 has been incorporated herein. Yi et al and Shazeer et al further teach: The at least one non-transitory machine readable medium of claim 13, wherein the portion of the subnetwork input is a predetermined fixed number of elements of the subnetwork input [Shazeer: p0081; Yi: page 5: p02].
10. Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Yi et al (ASFORMER: TRANSFORMER FOR ACTION SEGMENTATION; 10/16/2021) and Shazeer et al (US Pub: 2020/0342316); and in further view of Dzabraev et al (MDMMT: Multidomain Multimodal Transformer for Video Retrieval, 03/19/2021).
Regarding claim 20 (Currently Amended), the rationale applied to the rejection of claim 19 has been incorporated herein. Yi et al in view of Shazeer et al does not explicitly equate chunk size to frames per second. In the same field of endeavor, Dzabraev et al teaches: The at least one non-transitory machine readable medium of claim 19, wherein the predetermined chunk size is a number of the video frames that are equivalent to a second [page 11: A Pretrain experts usage]. Therefore, it would have been obvious for an ordinary skilled in the art before the effective filing date of the claimed invention to combine the teaching of all to sample chunk size based on frames per second for clear division per design choice.
Contact
11. Any inquiry concerning this communication or earlier communications from the examiner should be directed to FAN ZHANG whose telephone number is (571)270-3751. The examiner can normally be reached on Mon-Fri 9:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached on 571-272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fan Zhang/
Patent Examiner, Art Unit 2682