Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Prior to the issuing of this Office Action a telephone interview was held with the applicants’ representative, Mark Hagler, to discuss proposed amendments for compactly prosecuting the application. However, no agreement was reached and it agreed to issue an Office Action
Claim Objections
Claims 8, 18 are objected to because of the following informalities:
In Claim 8, line 8, “a combined patch” was probably meant to be: the combined patch, in order to negate any possible 35 USC 112 issues (see recitation in independent claim). The same objection is made for Claim 18.
Appropriate correction is required.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claims 1-2, 11-12 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yao, “Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning”, 2022, in applicant’s IDS.
Regarding Claim 1, Yao teaches:
A method comprising: receiving digital data (pp. 329-330; “we propose Wavelets block to perform invertible down-sampling through wavelet transforms, aiming to preserve the original image details for self-attention learning while reducing computational cost. Wavelet transform is a fundamental time-frequency analysis method that decomposes input signals into different frequency subbands to address the aliasing problem”. The input signals/image are the digital data and performing wavelet transform on the data indicates the data is in a digital form);
generating a plurality of embedded patches using the digital data (p. 334: “Next, this down-sampled feature map…is linearly transformed into down-sampled keys…and values…where m…is the number of patches”. And, p. 335: “given the input image (size: 224 × 224), the entire architecture of Wave-ViT consists of four stages, and each stage is comprised of a patch embedding layer, and a stack of Wavelets blocks followed by feed-forward layers”);
generating a transformed patch by applying a transformer block to an embedded patch of the plurality of embedded patches (Fig. 2(c) and p. 333: “let X… be the input 2D feature map, where H/ W/ D denote the height/ width/channel number, respectively. Here X can be reshaped as a patch sequence, consisting of n = H x W image patches and the dimension of each patch is D. We linearly transform the input patch sequence X into three groups in parallel”. And p. 334: “Such invertible down-sampling is seamlessly incorporated into the typical self-attention block, pursuing efficient multi-head self-attention learning with lossless down-sampling”. That is each patch is linearly embedded and transformed through a self-attention block);
creating a plurality of filtered patches for the transformed patch by applying a plurality of wavelet filters to the transformed patch (Table 1, Fig. 2(c) and p. 335: “Specifically, given the input image (size: 224 x 224), the entire architecture of Wave-ViT consists of four stages, and each stage is comprised of a patch embedding layer, and a stack of Wavelets blocks followed by feed-forward layers”. The Wavelets blocks constituting the wavelet filters);
creating a combined patch by combining the plurality of filtered patches (p. 331: “Swin Transformer [35] upgrades ViT by constructing hierarchical feature maps via merging image patches in deeper layers”); generating a set of training data using the combined patch (p. 336: “We evaluate the effectiveness of Wave-ViT via various empirical evidence on several mainstream CV tasks. Concretely, we consider the following evaluations to compare the quality of learnt features obtained from various vision backbones: (a) Training from scratch for image recognition task on ImageNet1K”. That is Wave-ViT is used in generating training data);
and generating a trained prediction model by applying a prediction model to the set of training data (p. 338: “we follow the standard setups in [35] to train all models on the COCO train2017 (∼118K images)”. And, p. 339: “To further verify the generalizability of the pre-trained multi-scale features via Wave-ViT for object detection, we evaluate various pre-trained vision backbones on four state-of-the-arts detectors”).
Regarding Claim 2, Yao further teaches:
The method of claim 1, wherein generating the plurality of embedded patches comprises: dividing the digital data into a plurality of patches (p. 331: “Different from ViT that solely divides input image into patches, TNT [19] further decomposes patches into sub-patches as “visual words””);
and generating the plurality of embedded patches by combining the plurality of patches with a plurality of positional embeddings for the plurality of patches (p. 331: “Swin Transformer [35] upgrades ViT by constructing hierarchical feature maps via merging image patches in deeper layers”. See also the applicants’ provided NPL of Chen, for example subsection 3.1: “Patch embedding module divides images into patches with fixed size and positions, and embeds each of them with a linear layer”).
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 3-5, 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Yao, “Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning”, 2022, in applicant’s IDS, in view of Chen, “DPT: Deformable Patch-based Transformer for Visual Recognition”, 2021, in applicant’s IDS.
Regarding Claim 3, with Yao teaching those limitations of the claim as previously pointed out, Yao may not have taught the following, however, Chen shows:
The method of claim 2, wherein the digital data comprises metadata and dividing the digital data into the plurality of patches uses the metadata to determine a patch size of a patch of the plurality of patches (Abstract: “we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches”. And, subsection 3.1: “Patch embedding module divides images into patches with fixed size and positions, and embeds each of them with a linear layer”. And, subsection 3.2: “we turn the location and the scale of each patch into predicted parameters based on input contents”, input contents being the metadata).
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the teachings of Chen with that of Yao wherein the digital data comprises metadata and dividing the digital data into the plurality of patches uses the metadata to determine a patch size of a patch of the plurality of patches.
The ordinary artisan would have been motivated to modify Yao in the manner set forth above for the purposes of adaptively learning to split images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches [Chen: Abstract].
Regarding Claim 4, with Yao teaching those limitations of the claim as previously pointed out, Chen further teaches:
The method of claim 3, wherein the metadata comprises a data type comprising at least one of text, audio, image, or video and wherein dividing the digital data into the plurality of patches uses the data type to determine the patch size (Abstract: “Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches”).
Regarding Claim 5, Yao further teaches:
The method of claim 1, further comprising: generating the plurality of wavelet filters, wherein a wavelet filter of the plurality of wavelet filters comprises a first filter dimension and a second filter dimension, the transformed patch comprises a first patch dimension and a second patch dimension (p. 333: “X can be reshaped as a patch sequence, consisting of n = H ×W image patches and the dimension of each patch is D”. And, ),
And Chen further teaches:
and a size of the second filter dimension is less than a size of the first patch dimension (Figure 2; subsections 3.1 and 3.2 and 3.3: “DePatch is a self-adaptive module to change positions and scales of patches”. The deformability of DePatch allows variable patch and filter dimensions).
Claims 10, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yao, “Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning”, 2022, in applicant’s IDS, in view of Zouridakis, US 2014/0036054 A1.
Regarding Claim 10, Yao teaches:
The method of claim 1,
the method further comprising: applying the trained prediction model to a set of execution data; and determining, by the trained prediction model, an output based on the set of execution data (p. 338: “we examine the pre-trained Wave-ViT’s behavior on COCO for two tasks that localize objects ranging from
bounding-box level to pixel level, i.e., object detection and instance segmentation. Two mainstream detectors, i.e., RetinaNet [31] and Mask R-CNN [20], are employed for each downstream task, and we replace the CNN backbones in each detector with our Wave-ViT for evaluation. Specifically, each vision backbone is first pre-trained over ImageNet1K, and the newly added layers are initialized with Xavier [17]. Next, we follow the standard setups in [35] to train all models on the COCO train2017 (∼118K images). Here the batch size is set as 16, and AdamW [39] is utilized for optimization (weight decay: 0.05, initial learning
rate: 0.0001). All models are finally evaluated on the COCO val2017 (5K images). For the downstream task of object detection, we report the Average Precision(AP) at different IoU thresholds and for three different object sizes”),
Yao may not have explicitly taught “classifiers” in the following, however, Zouridakis shows:
wherein the digital data is identified by one or more input classifiers and generating the set of training data further uses the one or more input classifiers,
wherein the output comprises one or more output classifiers identifying the set of execution data (paragraphs 15, 50, 103, 106-107, 114: “apply a classifier to the remaining pixels to classify them as object or foreground”. And, “the evaluation uses transductive inference (19), i.e., in this classifier learning method labeled training data and unlabeled testing data were employed”. And, “For each sampling strategy, square patches of size 16, 24, and 32 pixels are extracted, and the classifiers obtained from these three scales are grouped”).
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to use the teachings of Zouridakis with that of Yao for using input and output classifiers.
The ordinary artisan would have been motivated to modify Yao in the manner set forth above for the purposes of using classifiers to classify objects [Zouridakis: paragraph 57].
Claims 6, 9, 16 and 19 (and subsequently their dependent claims) are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claims 11-15 are similar to Claims 1-5 and are rejected under the same rationale as stated above for those claims.
Claim 20 is a combination of Claims 1 and 10 and is rejected under the same rationale as stated above for those claims.
Examiner’s Note:
The Examiner cites particular pages, sections, columns, line numbers, and/or paragraphs in the references as applied to the claims above for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in its entirety as potentially teaching all or part of
the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner and the additional related prior arts made of record that are considered pertinent to applicant's disclosure to further show the general state of the art. The Examiner's interpretations in parenthesis are provided with the cited references to assist the applicants to better understand how the examiner interprets the prior art to read on the claims. Such comments are entirely consistent with the intent and spirit of compact prosecution.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892 for the relevant prior art where for example Luo, US 2020/0349411 A1, teaches invertible wavelet transformation of layers of a neural network.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVE MISIR whose telephone number is (571)272-5243. The examiner can normally be reached M-R 8-5 pm, F some hours.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Al Kawsar can be reached at 5712703169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DAVE MISIR/Primary Examiner, Art Unit 2127