DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. 18419607, filed on 01/23/2024.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 01/23/2024, 02/26/2024, and 06/28/2024 are in compliance with the provisions of 37 CFR 1.97. Accordingly, they are being considered by the examiner.
Specification
The title of the invention is not descriptive. A new title is required that is clearly indicative of the invention to which the claims are directed.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, claim 1 recites “generate a feature amount”. The claim fails to sufficiently describe what a “feature amount” is, rendering the claim indefinite. In view of the full Application, the feature amount is being interpreted as a collection of tokens generated according to an input.
Further regarding claim 1, claim 1 recites “irregularly mix a plurality of tokens included in the feature amount in a spatial direction of the generated feature amount.” It appears that the bolded elements are referring to the same thing, and should therefore be referred to consistently.
Further regarding claim 1, claim 1 recites “irregularly mix a plurality of tokens included in the feature amount in a spatial direction of the generated feature amount.” The bolded limitation is indefinite because it is unclear how the generated feature amount is related to spatial directions. Accordingly, this is being interpreted such that tokens are mixed with respect to spatial relationships with each other.
Claims 12 and 16 are rejected as analogous to claim 1.
Claims 2-11 and 13-15 are rejected as dependent on rejected claims 1 and 12.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-4, and 6-16 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Guo (Hire-MLP: Vision MLP via Hierarchical Rearrangement).
Regarding claim 1, Guo teaches “An information processing apparatus comprising one or more memories storing instructions and one or more processors that execute the instructions to: obtain input data; generate a feature amount from the obtained input data;” (Guo, Figures 1-2, and Section 3.3 Paragraph 1, “An overview of the Hire-MLP-Tiny architecture is shown in Figure 1, more details and other variants of HireMLP are presented in Table A.1 in supplementary materials. We adopt a pyramid-like architecture for Hire-MLP following the commonly used design of CNNs [18, 41] and vision transformers [35,52]. It first splits the input image into patches (tokens) by a patch embedding layer [51]. Then two Hire-MLP blocks referred to as “Stage 1” are applied on the tokens above. As the network gets deeper, the number of tokens is reduced by another patch embedding layer and output channels are doubled at the same time. Especially, the whole architecture contains four stages, where the feature resolution reduces from H/4 × W/4 to H/32 × W/32 and the output dimension increases accordingly. The pyramid architecture aggregates the spatial feature for extracting semantic information, which can be applied to image classification, object detection, and semantic segmentation.” Note that the input image is mapped to the input data, and the generated feature amount is mapped to the patch creation. Additionally, the Hire-MLP architecture of Figure 1 inherently requires the use of a processing apparatus with memory and a processor.)
“and irregularly mix a plurality of tokens included in the feature amount in a spatial direction of the generated feature amount.” (Guo, Figures 1-2, Page 827, Column 1 Paragraph 1, “As for the second challenge, we build the blocks of Hire-MLP based on hierarchical rearrangements and channel-mixing MLPs. The hierarchical rearrangement operation consists of the inner-region rearrangement and the cross-region rearrangement, in which both the local and global information can be easily captured in both height and width directions. We first split the input tokens into multiple regions along the height/width directions, and leverage the inner-region rearrangement operation to shuffle all adjacent tokens belonging to the same region into a one-dimensional vector, followed by two fully connected layers to capture local information within these features. After that, this one-dimensional vector is restored back to the initial arrangement, as illustrated in Figure 1. For the communication between tokens from different regions, a cross-region rearrangement operation is implemented by shifting all the tokens along a specific direction, as shown in Figure 2(c)(d). Such hierarchical rearrangement operation enables our model to obtain both local and global information, and can easily handle the flexible input resolutions.”)
Regarding claim 2, Guo teaches “The apparatus according to claim 1,”
“wherein the one or more processors execute the instructions to: irregularly divide the plurality of tokens into a plurality of groups concerning the spatial direction of the feature amount, and mix, for each of the plurality of groups, a plurality of tokens included in each group.” (Guo, Figure 1 and Figure 2, Page 827, Column 1 Paragraph 1, “As for the second challenge, we build the blocks of Hire-MLP based on hierarchical rearrangements and channel-mixing MLPs. The hierarchical rearrangement operation consists of the inner-region rearrangement and the cross-region rearrangement, in which both the local and global information can be easily captured in both height and width directions. We first split the input tokens into multiple regions along the height/width directions, and leverage the inner-region rearrangement operation to shuffle all adjacent tokens belonging to the same region into a one-dimensional vector, followed by two fully connected layers to capture local information within these features. After that, this one-dimensional vector is restored back to the initial arrangement, as illustrated in Figure 1. For the communication between tokens from different regions, a cross-region rearrangement operation is implemented by shifting all the tokens along a specific direction, as shown in Figure 2(c)(d). Such hierarchical rearrangement operation enables our model to obtain both local and global information, and can easily handle the flexible input resolutions.”)
Regarding claim 3, Guo teaches “The apparatus according to claim 2,”
“wherein the one or more processors execute the instructions to return positions of a plurality of tokens obtained by mixing to positions in the spatial direction before dividing.” (Guo, Figure 1 and Figure 2, Page 827 Column 1 Paragraph 2, “To be specific, our Hire-MLP has a hierarchical architecture similar to conventional CNNs [18] and recently proposed transformers [35,52] to generate pyramid feature representations for downstream vision tasks. The overall architecture is shown in Figure 1. After the first projection layer, the resulting feature X ∈ R H×W×C is then fed into a sequence of Hire-MLP blocks. Hire module is a key component in Hire-MLP block, which consists of three independent branches. The first two branches consist of a crossregion rearrangement layer, an inner-region rearrangement layer, two channel-mixing fully connected (FC) layers, an inner-region restore layer and a cross-region restore layer to capture local and global information along specific direction, i.e., the height and the width direction. The last branch is built upon a simple channel-mixing FC layer to capture channel information. Compared to existing MLPbased models that spatially shift features in different directions [31,57] or leverage a new cycle fully connected operator [5], our Hire-MLP needs only the channel-mixing MLPs and rearrangement operations. Furthermore, the rearrangement operations can be easily realized by commonly used reshape and padding operations in Pytorch/Tensorflow. And our Hire-MLP is completely capable to serve as a versatile backbone for various computer vision tasks.”)
Regarding claim 6, Guo teaches “The apparatus according to claim 2,”
“wherein the one or more processors execute the instructions to irregularly divide the plurality of tokens into a plurality of groups concerning both the spatial direction” (Guo, Figure 2.)
“and a channel direction of the feature amount.” (Guo, Figure 1 element “Channel MLP”, and Section 3.1, “The proposed Hire-MLP architecture is constructed by stacking multiple Hire-MLP blocks, as detailed in Figure 1. Similar to ViT [11] and MLP-Mixer [47], each Hire-MLP block consists of two sub-blocks, i.e., the proposed hire module and channel MLP in [47], aggregating spatial information and channel information, respectively. Given the input feature X ∈ R H×W×C with height H, width W, and channel number C, a Hire-MLP block can be formulated as:
Y = Hire-Module(BN(X)) + X,
Z = Channel-MLP(BN(Y )) + Y,
where Y and Z are intermediate feature and output feature of the block, respectively. BN denotes the batch normalization [25]. The whole Hire-MLP architecture is constructed by iteratively stacking the Hire-MLP block (Eq. 1). Compared with MLP-Mixer [47], the major difference is that we replace token-mixing MLP in MLP-Mixer with the proposed hire module and have successfully managed to capture the relationship between different tokens effectively.”)
Regarding claim 7, Guo teaches “The apparatus according to claim 2,”
“wherein the one or more processors execute the instructions to divide the plurality of tokens into the plurality of groups such that each group includes the same number of tokens.” (Guo, Figure 2 shows the same number of features in each token group. For example, see Figure 2(a)(left) and Figure 2(b)(left).)
Regarding claim 8, Guo teaches “The apparatus according to claim 2,”
“wherein the one or more processors execute the instructions to include at least one of Multi-head Self Attention (MSA), Multi-Layer Perceptron (MLP), and a fully connected layer.” (Guo, Figure 1 and Section 3.1, “The proposed Hire-MLP architecture is constructed by stacking multiple Hire-MLP blocks, as detailed in Figure 1. Similar to ViT [11] and MLP-Mixer [47], each Hire-MLP block consists of two sub-blocks, i.e., the proposed hire module and channel MLP in [47], aggregating spatial information and channel information, respectively. Given the input feature X ∈ R H×W×C with height H, width W, and channel number C, a Hire-MLP block can be formulated as:
Y = Hire-Module(BN(X)) + X,
Z = Channel-MLP(BN(Y )) + Y,
where Y and Z are intermediate feature and output feature of the block, respectively. BN denotes the batch normalization [25]. The whole Hire-MLP architecture is constructed by iteratively stacking the Hire-MLP block (Eq. 1). Compared with MLP-Mixer [47], the major difference is that we replace token-mixing MLP in MLP-Mixer with the proposed hire module and have successfully managed to capture the relationship between different tokens effectively.”)
Regarding claim 9, Guo teaches “The apparatus according to claim 1,”
“wherein the one or more processors execute the instructions to perform a predetermined task using a neural network (NN) based on an obtained feature amount.” (Guo, Figure 1, and Section 3.3 Paragraph 1, “An overview of the Hire-MLP-Tiny architecture is shown in Figure 1, more details and other variants of HireMLP are presented in Table A.1 in supplementary materials. We adopt a pyramid-like architecture for Hire-MLP following the commonly used design of CNNs [18, 41] and vision transformers [35,52]. It first splits the input image into patches (tokens) by a patch embedding layer [51]. Then two Hire-MLP blocks referred to as “Stage 1” are applied on the tokens above. As the network gets deeper, the number of tokens is reduced by another patch embedding layer and output channels are doubled at the same time. Especially, the whole architecture contains four stages, where the feature resolution reduces from H/4 × W/4 to H/32 × W/32 and the output dimension increases accordingly. The pyramid architecture aggregates the spatial feature for extracting semantic information, which can be applied to image classification, object detection, and semantic segmentation.”
Regarding claim 10, Guo teaches “The apparatus according to claim 9,”
“wherein the input data is image data, and the predetermined task is one of an object detection task, a tracking task, and a class classification task.” (Guo, Section 1, last paragraph, “Experiments show that Hire-MLP can largely improve the performances of existing MLP-based models on various tasks, including image classification, object detection, instance segmentation, and semantic segmentation. For example, the Hire-MLP-Small attains an 82.1% top-1 accuracy on ImageNet, outperforming Swin-T [35] significantly with a higher throughput. Scaling up the model to larger sizes, we can further obtain 83.2% and 83.8% top1 accuracy. Using Hire-MLP-Small as backbone, Cascade Mask R-CNN achieves 50.7% box AP and 44.2% mask AP on COCO val2017. In addition, Hire-MLP-Small obtains 46.1% single-scale mIoU on ADE20K, which has an improvement of +1.6% mIoU over Swin-T, demonstrating that Hire-MLP can achieve a better accuracy-latency trade-off than prior MLP-based and transformer-based architectures.”)
Regarding claim 10, Guo teaches “The apparatus according to claim 1,”
“wherein the one or more processors execute the instructions to generate the feature amount using a convolutional neural network.” (Guo, Section I last paragraph, “Experiments show that Hire-MLP can largely improve the performances of existing MLP-based models on various tasks, including image classification, object detection, instance segmentation, and semantic segmentation. For example, the Hire-MLP-Small attains an 82.1% top-1 accuracy on ImageNet, outperforming Swin-T [35] significantly with a higher throughput. Scaling up the model to larger sizes, we can further obtain 83.2% and 83.8% top1 accuracy. Using Hire-MLP-Small as backbone, Cascade Mask R-CNN achieves 50.7% box AP and 44.2% mask AP on COCO val2017. In addition, Hire-MLP-Small obtains 46.1% single-scale mIoU on ADE20K, which has an improvement of +1.6% mIoU over Swin-T, demonstrating that Hire-MLP can achieve a better accuracy-latency trade-off than prior MLP-based and transformer-based architectures.”)
Regarding claims 12-15, these claims recite a method with steps corresponding to the elements of the system recited in Claims 1-3 and 11. Therefore, the recited steps of these claims are mapped to the analogous elements in the corresponding system claims.
Regarding claim 16, Claim 16 recites a non-transitory computer-readable recording medium storing a program with instructions corresponding to the steps recited in Claim 12. Therefore, the recited programming instructions of this claim are mapped to the analogous steps in the corresponding method claim. Lastly, the Guo reference necessarily incorporates a non-transitory computer-readable recording medium storing a program. Moreover, this amounts to a well-known element in the art that fails to distinguish the invention from the prior art.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Guo in view of Liu (TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers).
Regarding claim 4, Guo teaches “The apparatus according to claim 2,”
While Guo teaches the processor executing instructions to divide the tokens into groups, (see rejection of claim 2), Guo does not expressly disclose dividing tokens into groups based on weights set for the tokens.
Liu teaches dividing tokens into groups based on weights set for the tokens (Liu, Figure 3, Figure 3 caption and Section 5.3 Subsection 1, “Fig. 3: Visualization of the attention maps of the class token in DeiT-S to attend to patch tokens at different layers. Using CutMix distracts the attention to background areas in the several middle layers. In contrast, the proposed TokenMix helps the class token focus more on foreground objects and leads to consistent performance gain”; “TokenMix helps transformers focus on the foreground area. As discussed in Section 3, CutMix assigns targets of the mixed images based on linear combinations of labels of the pairs of mixing images, which might be inaccurate if the foreground region is cut. We find that the inaccurate labels make transformers pay incorrect attention to the input image. As shown in Figure 3, using CutMix distracts the transformer’s attention to background areas in several middle layers (layers 5-10). In comparison, TokenMix helps transformers learn to pay more attention to the foreground areas and leads to consistent performance gain.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to incorporate the attention weight-based grouping of tokens taught by Liu into the grouping of tokens of Guo.
The motivation for doing so, as described by Liu, would have been to improve attention to foreground areas and lead to consistent performance gain. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Gui and Liu to fully disclose, “wherein the one or more processors execute the instructions to divide the plurality of tokens into the plurality of groups in accordance with a weight set for each of the plurality of tokens.”
Allowable Subject Matter
Claim 5 is objected to as being dependent upon a rejected base claim, and is rejected under 35 USC 112(b), but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and amended to overcome the 35 USC 112(b) rejection.
The following is a statement of reasons for the indication of allowable subject matter: With respect to claim 5, in addition to other limitations in the claims the Prior Art of Record fails to teach, disclose or render obvious the applicant' s invention as claimed, in particular:
Claim 5 recites, “The apparatus according to claim 4, wherein the one or more processors execute the instructions to set a weight for each of the plurality of tokens in accordance with a plurality of random seeds given in advance.”
Guo teaches a new multi-layer perceptrons (MLP) strategy wherein tokens are rearranged according to inner-regions and cross-regions along spatial directions, for applications in various vision tasks including image classification, object detection, and semantic segmentation. Liu discloses the data augmentation technique, TokenMix, which mixes tokens from two images by partitioning the mixing region into multiple separated parts. Tolstikhin (US 20220375211 A1) teaches an MLP-based computer vision neural network which includes the generation of tokens from input images, and the use of mixer layers for mixing features and channels across the tokens. Tang (US 20230351163 A1) teaches an MLP-based architecture that determines a plurality of tokens for a piece of data, and mixes the plurality of tokens according to token amplitude and phase. However, none of these references expressly disclose the bolded limitation above. The Liu reference is the closest Prior Art, but it assigns weights based on foreground regions, rather than random seeds. It would not make sense to replace the foreground-based weighing with random seed-based weighing, because the rationale for performing the foreground-based weighing is expressly described as to help vision transformers focus on foreground area for image classification.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON JOSEPH SORRIN whose telephone number is (703)756-1565. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AARON JOSEPH SORRIN/
Examiner, Art Unit 2672
/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672