DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 19 and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 19 recites “the one or more processors” in line 9. There is insufficient antecedent basis for this limitation in the claim.
Claim 20 inherits the same deficiency as it depends on claim 19.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1, 4, 11-14 and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Bain et al. (“Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval”).
Regarding claim 1, Bain et al. discloses a computer system for performing video processing tasks with improved computational efficiency, the computer system comprising:
one or more processors (processor of implied computer); and
one or more non-transitory computer-readable media (memory of implied computer) that collectively store:
a machine-learned model (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9) comprising:
a video kernel configured to be applied to a plurality of data samples from a set of video data to respectively generate a plurality of video tokens (“The patches x ∈ RM×N×3×P×P are fed through a 2D convolutional layer and the output is flattened, forming a sequence of embeddings z ∈ RMN×D for input to the transformer, where D depends of the number of kernels in the convolutional layer” at section 3.1, Transformer input, line 1), wherein each data sample comprises at least a portion of multiple image frames included in the set of video data (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and
a visual transformer configured to process the plurality of video tokens to generate a model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1); and
instructions that, when executed by the one or more processors, cause the computer system to perform operations, the operations comprising:
processing the set of video data with the machine-learned model to generate the model output (“The video sequence is fed into a stack of space-time transformer blocks. We make a minor modification to the Divided Space-Time attention introduced by [6], by replacing the residual connection between the block input and the temporal attention output with a residual connection between the block input and the spatial attention output, see the appendix for details. Each block sequentially performs temporal selfattention and then spatial self-attention on the output of previous block. The video clip embedding is obtained from the [CLS] token of the final block.” at section 3.1, Space-time self-attention blocks, line 1);
wherein processing the set of video data with the machine-learned model comprises sparsely applying the video kernel to the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame).
Regarding claim 4, Bain et al. discloses a system wherein sparsely applying the video kernel to the set of video data comprises directly applying the video kernel to pixel values included in the set of video data (“(i) we propose a new end-to-end model for video retrieval that does not rely on ‘expert’ features, but instead, inspired by [6] employs a transformer architecture with a modified divided spacetime attention applied directly to pixels” at section 1, last paragraph, line 1).
Regarding claim 7, Bain et al. discloses a system wherein sparsely applying the video kernel to the set of video data comprises applying the video kernel starting at a predefined offset point that differs from an origin point of the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples frames offset from the beginning frame).
Regarding claim 11, Bain et al. discloses a system wherein the machine-learned model comprises a pre-trained vision encoder (“We jointly pretrain our model on image and video data” at section 4.1, line 1) that has been fine-tuned using a set of video training data (“large motivation for using preextracted expert models for video retrieval is to save computational cost. Finetuning our 4-frame model for 50 epochs on MSR-VTT takes 10 hours on 2 Quadro RTX 6000k GPUs (with 24GB RAM each), which is similar to other works using pre-extracted expert features” at section 4.3, last paragraph).
Regarding claim 12, Bain et al. discloses a system wherein the model output comprises a video classification output (the output constitutes a characterization of the input video; see also the Table 4 description that discusses uses with classification uses).
Regarding claim 13, Bain et al. discloses a computer-implemented method, the method comprising:
obtaining, by a computing system comprising one or more computing devices, a set of video data and a video label (“DiDeMo [3] contains 10K Flickr videos annotated with 40K sentences” at section 4.2, line 10);
processing, by the computing system, the set of video data with a machine-learned model to generate the model output (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9), wherein processing the set of video data with the machine-learned model comprises:
sparsely applying, by the computing system, a video kernel of the machine-learned model to the set of video data to generate a plurality of video tokens (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame), the video kernel having a temporal dimension size of greater than one (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and
processing, by the computing system, the plurality of video tokens with a visual transformer of the machine-learned model to generate the model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1);
evaluating, by the computing system, a loss function that generates a loss value based on the model output and the video label (“We minimise the sum of two losses, video-to-text and text-to-video” at section 3.2, line 4); and
modifying, by the computing system, one or more values of one or more parameters of the machine-learned model based on the loss function (parameters of the learner are adjusted during minimization of the training loss).
Regarding claim 14, Bain et al. discloses a method wherein modifying, by the computing system, the one or more values of the one or more parameters of the machine-learned model based on the loss function comprises updating parameter values of the video kernel based on the loss function (parameters of the learner are adjusted during minimization of the training loss).
Regarding claim 19, Bain et al. discloses a one or more non-transitory computer-readable media (memory of implied computer) that collectively store:
a machine-learned model (“We build on this flexibility by designing a curriculum learning schedule that begins with images and then gradually learns to attend to increasing temporal context when trained on video datasets through temporal embedding interpolation” at section 1, last paragraph, line 9) comprising:
a video kernel configured to be applied to a plurality of data samples from a set of video data to respectively generate a plurality of video tokens (“The patches x ∈ RM×N×3×P×P are fed through a 2D convolutional layer and the output is flattened, forming a sequence of embeddings z ∈ RMN×D for input to the transformer, where D depends of the number of kernels in the convolutional layer” at section 3.1, Transformer input, line 1), wherein each data sample comprises at least a portion of multiple image frames included in the set of video data (“Given a video containing L frames, we subdivide it into M equal segments where M is the desired number of frames for the video encoder” at section 3.2, last paragraph, line 1); and
a visual transformer configured to process the plurality of video tokens to generate a model output (“Learned temporal and spatial positional embeddings, Es ∈ RN×D, Et ∈ RM×D are added to each input token: {\boldsymbol z^{(0)}{p,m}} = \boldsymbol z{p,m} + \boldsymbol E^{s}{p} + \boldsymbol E^{t}{m}, (1) such that all patches within a given frame m (but different spatial locations) are given the same temporal positional embedding Etm , and all patches in the same spatial location (but different frames) are given the same spatial positional embedding Es p. Thus enabling the model to ascertain the temporal and spatial position of patches” at section 3.1, Transformer input, paragraph 2, line 1); and
instructions that, when executed by the one or more processors, cause the computer system to perform operations, the operations comprising:
processing the set of video data with the machine-learned model to generate the model output (“The video sequence is fed into a stack of space-time transformer blocks. We make a minor modification to the Divided Space-Time attention introduced by [6], by replacing the residual connection between the block input and the temporal attention output with a residual connection between the block input and the spatial attention output, see the appendix for details. Each block sequentially performs temporal selfattention and then spatial self-attention on the output of previous block. The video clip embedding is obtained from the [CLS] token of the final block.” at section 3.1, Space-time self-attention blocks, line 1);
wherein processing the set of video data with the machine-learned model comprises sparsely applying the video kernel to the set of video data (“At test time, we sample the ith frame in every segment, to get a video embedding vi. The values for i are determine using a stride S, resulting in an array of video embeddings v = [v0, vS, v2S, vM]” at section 3.2, last paragraph, line 5; the stride therefore samples a subset of the total number of image frames, thereby making it sparser than sampling every frame).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Wang et al. (“Transformers Meet Visual Learning Understanding: A Comprehensive Review”).
Bain et al. discloses a system as described in claim 1 above.
Bain et al. does not explicitly disclose generating a plurality of fixed sine positional embeddings respectively for the plurality of video tokens, wherein the fixed sine positional embedding for each token indicates a center of the video kernel relative to the set of video data.
Wang et al. teaches a system in the same field of endeavor of wherein processing the set of video data with the machine-learned model further comprises generating a plurality of fixed sine positional embeddings respectively for the plurality of video tokens, wherein the fixed sine positional embedding for each token indicates a center of the video kernel relative to the set of video data (“In Transformer, sine and cosine functions are mainly used for position encoding. The specific coding method is formulated as Eq. 5. PE(pos; 2i) = sin(pos=100002i=dmodel ); PE(pos; 2i + 1) = cos(pos=100002i=dmodel ); (5) where pos represents the position, and i means the dimension. Each dimension of the position code corresponds to a sine curve.” At section IIIC).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to utilize the position encoding as taught by Wang et al. for the embedding of Bain et al. as a way to represent the patch locations.
Claim(s) 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Liu et al. (“TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval”).
Bain et al. discloses a system as described in claim 1 above.
Bain et al. does not explicitly disclose that each of the plurality of data samples comprises data for only a subset of a number of channels in a channel dimension of the set of video data, and wherein at least one of the plurality of tokens is generated by concatenation along a channel dimension for two temporally-displaced data samples.
Liu et al. teaches a system in the same field of endeavor of transformer based classification, wherein each of the plurality of data samples comprises data for only a subset of a number of channels in a channel dimension of the set of video data, and wherein at least one of the plurality of tokens is generated by concatenation along a channel dimension for two temporally-displaced data samples (“In this work, we propose the token selection transformer by inserting a token selection module, which aims to select informative tokens per frame, especially those tokens containing salient semantics of objects, for video feature aggregation. As shown in Fig.4, top-K informative tokens are selected via the trainable token selection module every frame. The input of the token selection module is a sequence of tokens of each frame I = {pcls, p0, p1, . . . , pn−1} ∈ R(N+1)×C. We first apply an MLP over I for channel dimension reduction and output I′ = {p′cls, p′0, p′1, . . . , p′n−1} ∈ R(N+1)×C2 . We then use p′cls as a global frame feature and concatenate it with each local token p′i, ˆpi = [p′cls, p′i] , 0 ≤ i < N. We finally feed all the concatenated token features to another MLP followed by a Softmax layer to predict the importance scores” at section 3.2, paragraph 2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate a token selection as taught by Liu et al. in the system of Bain et al. to avoid token redundancy and preserving the most relevant information (see Liu et al. at section 3.2).
Claim(s) 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Bain et al. and Bao et al. (“BEIT: BERT Pre-Training of Image Transformers”).
Regarding claim 15, Bain et al. discloses a system as described in claim 1 above.
Bain et al. does not explicitly disclose importing the video kernel to a larger pre-trained image transformer.
Bao et al. teaches a method in the same field of endeavor of transformer based image classification, comprising importing the video kernel to a larger pre-trained image transformer (“Overview of BEIT pre-training. Before pre-training, we learn an “image tokenizer” via autoencoding-style reconstruction, where an image is tokenized into discrete visual tokens according to the learned vocabulary. During pre-training, each image has two views, i.e., image patches, and visual tokens. We randomly mask some proportion of image patches (gray patches in the figure) and replace them with a special mask embedding [M]. Then the patches are fed to a backbone vision Transformer” at Figure 1 description).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to incorporate the kernels of Bain et al. into a pre-trained transformer as taught by Bao et al. to speed up the overall training of the system.
Regarding claim 16, the Bain et al. and Bao et al. combination discloses a method wherein modifying, by the computing system, the one or more values of the one or more parameters of the machine-learned model based on the loss function comprises finetuning one or more layers of a pre-trained image transformer while holding one or more other layers of the pre-trained image transformer fixed (“After pre-training BEIT, we append a task layer upon the Transformer, and fine-tune the parameters on downstream tasks, like BERT” Bao et al. at section 2.6, line 1).
Allowable Subject Matter
Claims 2, 3, 5, 6, 8, 17 and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 20 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: the prior art does not disclose applying the video kernel with a spatial stride greater than the spatial dimension size of the video kernel to achieve spatial sparseness as required by claim 2; applying the video kernel with a temporal stride greater than the temporal dimension size of the video kernel to achieve temporal sparseness as required by claim 3; one or more image kernels configured to be applied to an individual image frame of the set of video data to generate a plurality of image tokens from the individual image frame as required by claim 5; a second kernel configured to be applied to a second set of data samples from the set of video data wherein at least one of the second set of data samples is overlapping with at least one of the plurality of data samples to which the video kernel is applied as required by claim 8; one or more image kernels configured to be applied to an individual image frame of the set of video data to generate a plurality of image tokens from the individual image frame, and wherein the machine-learned model comprises a single visual transformer configured to jointly process both the plurality of video tokens and the plurality of image tokens to generate the model output as required by claim 17; applying the video kernel with a spatial stride greater than the spatial dimension size of the video kernel to achieve spatial sparseness; or applying the video kernel with a temporal stride greater than the temporal dimension size of the video kernel to achieve temporal sparseness as required by claim 20.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATRINA R FUJITA whose telephone number is (571)270-1574. The examiner can normally be reached Monday - Friday 9:30-5:30 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at 5712723638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/KATRINA R FUJITA/Primary Examiner, Art Unit 2672