Office Action Analysis: 17982386 — NEURAL NETWORK COMPUTATION TECHNIQUE

Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to the amendment filed on Dec. 29th, 2025. The amendments are linked to the original application filed on Nov. 7th, 2022.

Response to Amendment
The Examiner thanks the applicant for the remarks, edits and arguments.
Regarding Claim Rejections – 35 U.S.C. 112(b)
Applicant Remarks:
	The applicant has amended claims 4, 11 and 18 to no longer recite indefinite language. For this reason, the applicant believes the claims are compliant with 112(b) and requests that the rejection be withdrawn.

Examiner Response:
	The applicant argues that the claims have been amended to no longer recite an indefinite subject matter and the examiner has considered the current claims. The claims now recite a process on how the “spatially optimal combination” is achieved. Because of this the examiner no longer believes that claims recite indefinite subject matter and is in compliance with 35 USC 112(b) therefore the examiner has withdrawn the rejection under 35 U.S.C. 112(b). 
	
Regarding Claim Rejections – 35 U.S.C. 101
Applicant Remarks:
	The applicant states that the claims and specification disclose a system which improves computer functionally or other technology and/or improves a technical field which would integrate the invention into a practical application. Because of this, even if the claims recite abstract ideas, the applicant believes the claims would be patent eligible because they pass the Alice/Mayo test at Step 2A, prong two. For this reason, the applicant argues that the rejection under 35 U.S.C. 101 should be withdrawn.

Examiner Response:
	The applicant argues that amended claims recite patent eligible subject matter because, in particular, the claims as a whole recite an improvement to technology or a technical field. The applicant points to the specification to support this improvement and the applicant believes that the claims and specification integrates the claimed subject matter into a practical application.  The examiner has considered the arguments proposed by the applicant and the amended claims. After consideration the examiner believes that the current amened claims better clarify the invention and with the support from the specification, the examiner believes that the claims recite an improvement to computing and would integrate the invention into a practical applicant. After reviewing the claims using the Alice/Mayo test the examiner believes that the claims recite patent eligible subject matter and therefore is withdrawing the rejection under 35 U.S.C. 101.

Regarding Claim Rejections – 35 U.S.C. 102
Applicant Remarks:
	The applicant has made amendments to claims 1-4, 8-11, and 15-18 to further separate the claimed subject matter from the proposed prior art Zhao. The applicant believes that the Zhao fails to anticipate each and every element of the amended claims. Therefore, the applicant believes the rejection under 35 U.S.C. 102 should be withdrawn. 

Examiner Response:
	The applicant has amended the claims and after review of Zhao, the examiner has also noted that Zhao fails to explicitly teach each and every element of the independent claims. Therefore, Zhao is no longer regarded as appropriate prior art for the current claims.
	After each amendment the examiner is required to perform a complete search of the amended claims. This was completed and the examiner has found new appropriate prior art which is able to teach each and every element of the amended independent claims. Therefore, with the new prior art proposed, the examiner believes that the claims are still rejected under 35 USC 102 and the rejection is upheld.

Regarding Claim Rejections – 35 USC 103
Applicant Remarks:
	The applicant has argued that Zhao fails to anticipate each and every element of the independent claims. Because Zhao is unable to anticipate the amended independent claims the art Liu would also fail to disclose the missing elements of the independent claims and therefore the combination of Zhao and Liu would fail to disclose the dependent claims. Therefore, the applicant believes that the rejection under 35 U.S.C. 103 should be withdrawn.

Examiner Response:
The applicant has stated that Zhao fails to teach the independent claims and therefore the art used in combination of Zhao also fails to explicitly teach the deficiencies of Zhao in the independent claims and would also fails to teach dependent claims as a whole. The examiner, as stated above, has found new prior art which is able to teach the elements of the independent claims. Further the examiner believes that the combination of Wei and Liu are appropriate and would have been obvious to a person of ordinary skill in the art to disclose or teach the claimed invention. Therefore, the examiner believes that that the rejection under 35 USC 103 is appropriate and is upheld.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-3, 8-10, and 15-17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wei et al., (Wei et al., “Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition”, May 2016, hereinafter “Wei”).

Regarding claim 1, Wei teaches, “A processor comprising: one or more circuits to cause one or more feature maps of one or more neural networks to:” (Dataset and Implementation Details, pp. 6; “The proposed Mask-CNN model and FCN used for generating masks are implemented using the open-source library MatConvNet [16]. In our experiments, after getting the learned part masks, we firstly generate the image patches of birds’ head, torso and object as described in Sec. 3.2. Then, to facilitate the convergence of four stream CNNs, each single stream corresponding to the whole image, head, torso and object is fine-tuned on its input images separately. The CNNs used in each stream is initialized by the popular VGG-16 model [15] pre-trained on ImageNet. In addition, we double the training data by horizontal flipping for all the four streams. After fine-tuning on each stream, as shown in Fig. 2, the joint training of four-stream M-CNN is performed.” The proposed system uses commonly used machine learning models to perform different actions. This would teach that this system is executed on a generic computing system containing processors which are connected to memory which contain instructions for the methods disclosed.)
“spatially concatenate a plurality of feature maps identified as input to different respective operations; and” (Training Mask-CNN, pp. 5; “After obtaining the object and part masks, we build the four-stream M-CNN for joint training. The overall architecture of the proposed model is presented in Fig. 2. We take the whole image stream as an example to illustrate the pipeline of each stream in M-CNN. The model in this article discloses concatenating multiple different feature maps and perform an operation on the concatenated map. The masks are initially identified from an original image to locate different features. After the features have been identified they are then input into the main model to perform multichannel joint training.) And (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features.” As seen in figure 2, the initial image and the masks are combined and processed by the model.)
“perform a single operation that substitutes for the different respective operations using the spatially concatenated plurality of feature maps.” (Training Mask-CNN, pp. 6; “The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The four stream M-CNN is learned end-to-end, with the parameters of four CNNs learned simultaneously. During training M-CNN, the parameters of the learned FCN segmentation network are fixed.” After the masks have been combined different, CNN operations are performed on the combined map. This article discloses that the final combined image is input into another layer which is fully connected and performs a SoftMax operation.)

Regarding claim 2, Wei teaches, “wherein the single operation uses a same one or more parameters as the different respective operations.” (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features. The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The final combined feature map contains the features of the original image along with different masks of the original image) And (Figure 2, pp. 3; The original image is seen in fig. 2(f), this would teach that the at least one of the parameters of the original image are used in the single operation stated in fig. 2(f))

Regarding claim 3, Wei teaches, “wherein the spatially concatenated plurality of feature maps are generated based, at least in part, on convolution of a matrix generated by spatial concatenation of a plurality of input matrices and a filter.” (Training Mask-CNN, pp. 5; “The input images are fed into a traditional convolutional neural network, but the fully connect layers are discarded. That is to say, the CNN model used in our proposed M-CNN only contains convolutional, ReLU and pooling layers, which greatly brings down the M-CNN model size. Specifically, we use VGG-16 [15] as the baseline model, and the layers before pool5 are kept (including pool5). We obtain a 7 x7 x 512 activation tensor in pool5 if the input image is 224 x 224. Therefore, we have 49 deep convolutional descriptors of 512-d, which also correspond to 7 x 7 spatial positions in the input images.” The original image is evaluated for different important features, these features are identified and isolated. These isolated feature maps will be used in combination with the main image to produce a result or classification.) And (Figure 3, pp. 4; As seen in figure 2, the model has a 4 stream CNN which uses different masks of an image. Figure 3 further discloses the process of how the different masks are generated using a FCN, which uses convolutional operations.)

Regarding claim 8, Wei teaches, “A system comprising: one or more processors to cause one or more feature maps of one or more neural networks to:” (Dataset and Implementation Details, pp. 6; “The proposed Mask-CNN model and FCN used for generating masks are implemented using the open-source library MatConvNet [16]. In our experiments, after getting the learned part masks, we firstly generate the image patches of birds’ head, torso and object as described in Sec. 3.2. Then, to facilitate the convergence of four stream CNNs, each single stream corresponding to the whole image, head, torso and object is fine-tuned on its input images separately. The CNNs used in each stream is initialized by the popular VGG-16 model [15] pre-trained on ImageNet. In addition, we double the training data by horizontal flipping for all the four streams. After fine-tuning on each stream, as shown in Fig. 2, the joint training of four-stream M-CNN is performed.” The proposed system uses commonly used machine learning models to perform different actions. This would teach that this system is executed on a generic computing system containing processors which are connected to memory which contain instructions for the methods disclosed.)
“spatially concatenate a plurality of feature maps identified as input to different respective operations; and” (Training Mask-CNN, pp. 5; “After obtaining the object and part masks, we build the four-stream M-CNN for joint training. The overall architecture of the proposed model is presented in Fig. 2. We take the whole image stream as an example to illustrate the pipeline of each stream in M-CNN. The model in this article discloses concatenating multiple different feature maps and perform an operation on the concatenated map. The masks are initially identified from an original image to locate different features. After the features have been identified they are then input into the main model to perform multichannel joint training.) And (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features.” As seen in figure 2, the initial image and the masks are combined and processed by the model.)
“perform a single operation that substitutes for the different respective operations using the spatially concatenated plurality of feature maps.” (Training Mask-CNN, pp. 6; “The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The four stream M-CNN is learned end-to-end, with the parameters of four CNNs learned simultaneously. During training M-CNN, the parameters of the learned FCN segmentation network are fixed.” After the masks have been combined different, CNN operations are performed on the combined map. This article discloses that the final combined image is input into another layer which is fully connected and performs a SoftMax operation.)

Regarding claim 9, Wei teaches, “wherein the single operations uses a same one or more parameters as the different respective operations.” (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features. The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The final combined feature map contains the features of the original image along with different masks of the original image) And (Figure 2, pp. 3; The original image is seen in fig. 2(f), this would teach that the at least one of the parameters of the original image are used in the single operation stated in fig. 2(f))

Regarding claim 10, Wei teaches, “wherein the spatially concatenated plurality of feature maps are generated based, at least in part, on convolution of a matrix generated by spatial concatenation of a plurality of input matrices and a filter.” (Training Mask-CNN, pp. 5; “The input images are fed into a traditional convolutional neural network, but the fully connect layers are discarded. That is to say, the CNN model used in our proposed M-CNN only contains convolutional, ReLU and pooling layers, which greatly brings down the M-CNN model size. Specifically, we use VGG-16 [15] as the baseline model, and the layers before pool5 are kept (including pool5). We obtain a 7 x7 x 512 activation tensor in pool5 if the input image is 224 x 224. Therefore, we have 49 deep convolutional descriptors of 512-d, which also correspond to 7 x 7 spatial positions in the input images.” The original image is evaluated for different important features, these features are identified and isolated. These isolated feature maps will be used in combination with the main image to produce a result or classification.) And (Figure 3, pp. 4; As seen in figure 2, the model has a 4 stream CNN which uses different masks of an image. Figure 3 further discloses the process of how the different masks are generated using a FCN, which uses convolutional operations.) 

Regarding claim 15, Wei teaches, “A method, comprising: causing one or more feature maps of one or more neural networks to:” (Dataset and Implementation Details, pp. 6; “The proposed Mask-CNN model and FCN used for generating masks are implemented using the open-source library MatConvNet [16]. In our experiments, after getting the learned part masks, we firstly generate the image patches of birds’ head, torso and object as described in Sec. 3.2. Then, to facilitate the convergence of four stream CNNs, each single stream corresponding to the whole image, head, torso and object is fine-tuned on its input images separately. The CNNs used in each stream is initialized by the popular VGG-16 model [15] pre-trained on ImageNet. In addition, we double the training data by horizontal flipping for all the four streams. After fine-tuning on each stream, as shown in Fig. 2, the joint training of four-stream M-CNN is performed.” The proposed system uses commonly used machine learning models to perform different actions. This would teach that this system is executed on a generic computing system containing processors which are connected to memory which contain instructions for the methods disclosed.)
“spatially concatenate a plurality of feature maps identified as input to different respective operations; and” (Training Mask-CNN, pp. 5; “After obtaining the object and part masks, we build the four-stream M-CNN for joint training. The overall architecture of the proposed model is presented in Fig. 2. We take the whole image stream as an example to illustrate the pipeline of each stream in M-CNN. The model in this article discloses concatenating multiple different feature maps and perform an operation on the concatenated map. The masks are initially identified from an original image to locate different features. After the features have been identified they are then input into the main model to perform multichannel joint training.) And (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features.” As seen in figure 2, the initial image and the masks are combined and processed by the model.)
“perform a single operation that substitutes for the different respective operations using the spatially concatenated plurality of feature maps.” (Training Mask-CNN, pp. 6; “The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The four stream M-CNN is learned end-to-end, with the parameters of four CNNs learned simultaneously. During training M-CNN, the parameters of the learned FCN segmentation network are fixed.” After the masks have been combined different, CNN operations are performed on the combined map. This article discloses that the final combined image is input into another layer which is fully connected and performs a SoftMax operation.)

Regarding claim 16, Wei teaches, “wherein the single operation uses a same one or more parameters as the different respective operations.” (Training Mask-CNN, pp. 6; “In the classification step shown in Fig. 2 (f), the final 4,096-d image representation is the concatenation of the whole image, the head, the torso and the object features. The last layer of M-CNN is a 200-way classification (fc+softmax) layer for classification on the CUB200-2011 dataset. The final combined feature map contains the features of the original image along with different masks of the original image) And (Figure 2, pp. 3; The original image is seen in fig. 2(f), this would teach that the at least one of the parameters of the original image are used in the single operation stated in fig. 2(f)) 

Regarding claim 17, Wei teaches, “wherein the spatially concatenated plurality of feature maps are generated based, at least in part, on convolution of a matrix generated by spatial concatenation of a plurality of input matrices and a filter.” (Training Mask-CNN, pp. 5; “The input images are fed into a traditional convolutional neural network, but the fully connect layers are discarded. That is to say, the CNN model used in our proposed M-CNN only contains convolutional, ReLU and pooling layers, which greatly brings down the M-CNN model size. Specifically, we use VGG-16 [15] as the baseline model, and the layers before pool5 are kept (including pool5). We obtain a 7 x7 x 512 activation tensor in pool5 if the input image is 224 x 224. Therefore, we have 49 deep convolutional descriptors of 512-d, which also correspond to 7 x 7 spatial positions in the input images.” The original image is evaluated for different important features, these features are identified and isolated. These isolated feature maps will be used in combination with the main image to produce a result or classification.) And (Figure 3, pp. 4; As seen in figure 2, the model has a 4 stream CNN which uses different masks of an image. Figure 3 further discloses the process of how the different masks are generated using a FCN, which uses convolutional operations.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 4-7, 11-14, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wei in view of Liu et al., (Liu et al., “Automatic Building Extraction on High-Resolution Remote Sensing Imagery Using Deep Convolutional Encoder-Decoder With Spatial Pyramid Pooling”, Sept. 2019, hereinafter “Liu”).

Regarding claim 4, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more circuits to spatially concatenate a plurality of input matrices by at least computing a combination of the input matrices in at least one spatial dimension of one or more input matrices of the plurality of input matrices, wherein the combination reduces an amount of padding added to the spatially concatenated plurality of input matrices.” (Spatial Pyramid Pooling Module, pp. 128777; “The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features. For the pooling operation, we adopt adaptive average pooling as illustrated in Figure 2. Four levels with bin sizes of 1 x 1, 2 x 2, 3 x 3, and 6 x 6 are used in the spatial pyramid pooling module. Note that the number and size of pyramid levels can be modified. They are related to the size of the feature map fed into the pyramid pooling layer [33].” In the spatial pyramid pooling module used in this article the different bins are designed to match the size of the input image. This would teach that the combination of feature maps is combined in a way which would reduce the amount of padding needed in the final concatenated output.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Wei and Liu. Wei teaches a CNN model which is able to evaluate an image using masks and can combine different masks to further evaluate an image. Liu teaches a CNN model which is able to use spatial pooling of feature maps to further evaluate input images. One of ordinary skill would have motivation to combine the different components of the different CNN models proposed in Wei and Liu to evaluate images using different methods to concatenate input images to produce more accurate classification and labeling results, “The quantitative comparison of the different networks on the whole testing dataset is presented in Table 3. It demonstrates that the proposed USPP delivers improvements on all performance indicators over the other models except for `Precision'. In the testing case, the USPP model is the best among all models on Overall Accuracy score with a gain of 1.0% (0.913 vs. 0.904) over the next best model U-Net. As for Precision, the FRRN model holds the highest values and gains 2.2% over USPP (0.928 vs. 0.908). USPP still performs better than U-Net by 1.0% (0.908 vs. 0.899) over the entire testing dataset. For Recall, the U-Net, FCN, and USPP method shows significantly better performance over the other three methods while USPP achieves the best value being 2.6% ahead of the U-Net method (0.892 vs. 0.869).” (Liu, Results: Comparison on the Massachusetts Dataset, pp. 128780).

Regarding claim 5, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more circuits to store information indicative of an arrangement of input matrices in a matrix generated by spatial concatenation.” (Spatial Pyramid Pooling Module, pp. 128777; "Through pyramid pooling, spatial features on four different spatial scales can be identified. In order to enhance the nonlinear learning ability of the multiscale features, 1 x 1 convolution is added to maintain the size of features and to reduce the number of each features channels by an N-th of the number of channels of the input feature map; N is the number of pyramid pooling scales, typically chosen to be four following the works by Zhao et al. [33] and Yu et al. [43]. The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features." This article discloses a method for image encoding and decoding using a spatial pyramid pooling module based on Zhao's article. This will take multiple feature maps and concatenate them together. The location of the feature maps within the final feature map are stored and identifiable.)

Regarding claim 6, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more circuits to separate the spatially concatenated plurality of feature maps into a plurality of feature maps.” (Figure 3, pp. 128778; This figure teaches the structure of the proposed method. The sections which perform the feature map concatenation is outlined with a red box. After the pyramid pooling module is applied the output feature map is further operated on and is split in to multiple other feature maps as seen.)

    PNG
    media_image1.png
    502
    945
    media_image1.png
    Greyscale


Regarding claim 7, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more circuits to spatially concatenate an input matrix by determining a spatially efficient arrangement of two or more matrices.” (Spatial Pyramid Pooling Module, pp. 128777; "Through pyramid pooling, spatial features on four different spatial scales can be identified. In order to enhance the nonlinear learning ability of the multiscale features, 1 x 1 convolution is added to maintain the size of features and to reduce the number of each features channels by an N-th of the number of channels of the input feature map; N is the number of pyramid pooling scales, typically chosen to be four following the works by Zhao et al. [33] and Yu et al. [43]. The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map." This article teaches a method which uses a spatial pyramid pooling module. This system will take in a feature map, subdivide that map and then concatenate all of the feature maps, along with the original feature map, into a final feature map in one spatial location. This system will subdivide the original feature map in a particular way in order to maintain the final size of features in the final concatenated feature map.) 

Regarding claim 11, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more processors to spatially concatenate a plurality of input matrices by at least computing a combination of the input matrices in at least one spatial dimension of one or more input matrices of the plurality of input matrices, wherein the combination reduces an amount of padding added to the spatially concatenated plurality of input matrices.” (Spatial Pyramid Pooling Module, pp. 128777; “The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features. For the pooling operation, we adopt adaptive average pooling as illustrated in Figure 2. Four levels with bin sizes of 1 x 1, 2 x 2, 3 x 3, and 6 x 6 are used in the spatial pyramid pooling module. Note that the number and size of pyramid levels can be modified. They are related to the size of the feature map fed into the pyramid pooling layer [33].” In the spatial pyramid pooling module used in this article the different bins are designed to match the size of the input image. This would teach that the combination of feature maps is combined in a way which would reduce the amount of padding needed in the final concatenated output.)

Regarding claim 12, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more processors to store information indicative of an arrangement of input matrices in a matrix generated by spatial concatenation.” (Spatial Pyramid Pooling Module, pp. 128777; "Through pyramid pooling, spatial features on four different spatial scales can be identified. In order to enhance the nonlinear learning ability of the multiscale features, 1 x 1 convolution is added to maintain the size of features and to reduce the number of each features channels by an N-th of the number of channels of the input feature map; N is the number of pyramid pooling scales, typically chosen to be four following the works by Zhao et al. [33] and Yu et al. [43]. The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features." This article discloses a method for image encoding and decoding using a spatial pyramid pooling module based on Zhao's article. This will take multiple feature maps and concatenate them together. The location of the feature maps within the final feature map are stored and identifiable.)

Regarding claim 13, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more processors to separate the spatially concatenated plurality of feature maps into a plurality of feature maps.” (Figure 3, pp. 128778; This figure teaches the structure of the proposed method. The sections which perform the feature map concatenation is outlined with a red box. After the pyramid pooling module is applied the output feature map is further operated on and is split in to multiple other feature maps as seen.)

    PNG
    media_image1.png
    502
    945
    media_image1.png
    Greyscale


Regarding claim 14, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “the one or more processors to spatially concatenate an input matrix by determining a spatially efficient arrangement of two or more matrices having dimensions of unequal size.” (Spatial Pyramid Pooling Module, pp. 128777; "Through pyramid pooling, spatial features on four different spatial scales can be identified. In order to enhance the nonlinear learning ability of the multiscale features, 1 x 1 convolution is added to maintain the size of features and to reduce the number of each features channels by an N-th of the number of channels of the input feature map; N is the number of pyramid pooling scales, typically chosen to be four following the works by Zhao et al. [33] and Yu et al. [43]. The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map." This article teaches a method which uses a spatial pyramid pooling module. This system will take in a feature map, subdivide that map and then concatenate all of the feature maps, along with the original feature map, into a final feature map in one spatial location. This system will subdivide the original feature map in a particular way in order to maintain the final size of features in the final concatenated feature map.)

Regarding claim 18, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “spatially concatenating a plurality of input matrices by at least computing a combination of the input matrices in at least one spatial dimension of one or more input matrices of the plurality of input matrices, wherein the combination reduces an amount of padding added to the spatially concatenated plurality of input matrices.” (Spatial Pyramid Pooling Module, pp. 128777; “The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features. For the pooling operation, we adopt adaptive average pooling as illustrated in Figure 2. Four levels with bin sizes of 1 x 1, 2 x 2, 3 x 3, and 6 x 6 are used in the spatial pyramid pooling module. Note that the number and size of pyramid levels can be modified. They are related to the size of the feature map fed into the pyramid pooling layer [33].” In the spatial pyramid pooling module used in this article the different bins are designed to match the size of the input image. This would teach that the combination of feature maps is combined in a way which would reduce the amount of padding needed in the final concatenated output.)

Regarding claim 19, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “storing information indicative of an arrangement of input matrices in a matrix generated by spatial concatenation.” (Spatial Pyramid Pooling Module, pp. 128777; "Through pyramid pooling, spatial features on four different spatial scales can be identified. In order to enhance the nonlinear learning ability of the multiscale features, 1 x 1 convolution is added to maintain the size of features and to reduce the number of each features channels by an N-th of the number of channels of the input feature map; N is the number of pyramid pooling scales, typically chosen to be four following the works by Zhao et al. [33] and Yu et al. [43]. The convoluted feature maps are further interpolated using bilinear filtering to match the size of the input feature map. The input feature map is finally concatenated with four up-sampled feature maps so that global context features can be maintained with multiscale features." This article discloses a method for image encoding and decoding using a spatial pyramid pooling module based on Zhao's article. This will take multiple feature maps and concatenate them together. The location of the feature maps within the final feature map are stored and identifiable.) 

Regarding claim 20, Wei fails to explicitly disclose he elements of this claim, however, Liu discloses, “separating the spatially concatenated plurality of feature maps into a plurality of feature maps.” (Figure 3, pp. 128778; This figure teaches the structure of the proposed method. The sections which perform the feature map concatenation is outlined with a red box. After the pyramid pooling module is applied the output feature map is further operated on and is split in to multiple other feature maps as seen.)

    PNG
    media_image1.png
    502
    945
    media_image1.png
    Greyscale


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL MICHAEL GALVIN-SIEBENALER whose telephone number is (571)272-1257. The examiner can normally be reached Monday - Friday 8AM to 5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Viker Lamardo can be reached at (571) 270-5871. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/PAUL M GALVIN-SIEBENALER/Examiner, Art Unit 2147                                                                                                                                                                                                        
/VIKER A LAMARDO/Supervisory Patent Examiner, Art Unit 2147
Read full office action
NEURAL NETWORK COMPUTATION TECHNIQUE

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

NEURAL NETWORK COMPUTATION TECHNIQUE

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email