Last updated: May 29, 2026
Application No. 18/557,233
METHOD AND SYSTEM FOR IMAGE PROCESSING BASED ON CONVOLUTIONAL NEURAL NETWORK

Non-Final OA §103
Filed
Oct 25, 2023
Priority
Oct 14, 2021 — nonprovisional of PCTSG2021050623
Examiner
CHEN, XUEMEI G
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Exo Imaging Inc.
OA Round
1 (Non-Final)
Interview Optional

— +25.4% interview lift. Examiner has a relatively high allowance rate (77%); +25.4% interview lift. A written response may suffice.
Based on 579 resolved cases, 2023–2026
Examiner Intelligence

CHEN, XUEMEI G View full profile →
Grants 77% — above average
Career Allowance Rate
446 granted / 579 resolved
+15.0% vs TC avg
Strong +25% interview lift
Without
With
+25.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
12 currently pending
Career history
594
Total Applications
across all art units
Statute-Specific Performance

§101
2.5%
-37.5% vs TC avg
§103
88.4%
+48.4% vs TC avg
§102
3.3%
-36.7% vs TC avg
§112
3.8%
-36.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 579 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-31 are pending in the application. Claims 32-34 have been canceled.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-7, 25, 27 and 29-30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (Wang J, Xiao H, Chen L, Xing J, Pan Z, Luo R, Cai X. Integrating weighted feature fusion and the spatial attention module with convolutional neural networks for automatic aircraft detection from SAR images. Remote Sensing. 2021 Feb 28;13(5):910. Hereafter Wang), in view of Liu et al. (Liu, Rosanne, et al. "An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution." arXiv preprint arXiv:1807.03247 (2018). Hereafter Liu).

As per claim 1, Wang teaches the invention substantially including a method of image processing based on a convolutional neural network (CNN) (Abstract; Title), using at least one processor (page 11-12 bridging para. section 4.2. “Hyperparameter Settings”: “each model is trained on two 2080Ti GPUs for 100 epochs”), the method comprising:
receiving an input image (Fig. 2 “Input”); 
performing a plurality of feature extraction operations using a plurality of convolution layers of the CNN to produce a plurality of output feature maps, wherein a respective feature extraction operation of the plurality of feature extraction operations is performed by a respective convolution layer of the plurality of convolution layers (Fig. 2 P3-P7 including 5 convolution layers for producing a plurality of output feature maps; page 7 section 3.3. “EWFAN for Aircraft Detection” second para.: “First, the image is input to the backbone network and down-sampled, then five feature maps of different sizes are obtained. The sizes are 64 × 64, 32 × 32, 16 × 16, 8 × 8, and 4 × 4 respectively”) and includes: 
receiving, by the respective convolution layer, a respective input feature map 
generating, by the respective convolution layer, a respective spatial attention map based on the respective input feature map (Fig. 4 “Conv” representing convolution operation; “sigmoid” being an activation function for generating a spatial attention map based on the respective input feature map;  page 7 first para.: “Furthermore, the sigmoid function is utilized to normalize the feature map to obtain the spatial attention feature”);
generating, by the respective convolution layer, 
outputting, by the respective convolution layer, a respective output feature map of the respective convolution layer based on the respective input feature map and the                         
                            ⊕
                        
                    ” operation)); and 
producing an output image corresponding to the input image based on the plurality of output feature maps of the plurality of convolution layers (Fig. 2 “output; Fig. 4).

	Wang teaches generating weighted feature map by multiplying the spatial attention map with the input feature map, rather than generating a plurality of weighted coordinate maps by multiplying the spatial attention map with a plurality of coordinate maps.
	Liu in an analogous field discloses an improved CNN “CoordConv” for object detection (Abstract). Specifically, the CNN is improved by concatenating input feature map with coordinate maps (Fig. 3 right column “CoordConv Layers” showing 2 extra channels “i coordinate” and “j coordinate” being added to the feature maps by combining 2 coordinate maps). 
	It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang’s teaching by incorporating Liu’s teaching to generate a plurality of weighted coordinate maps. Incorporating coordinate maps would allow networks to learn either complete translation invariance or varying degrees of translation dependence as suggested by Liu (Abstract).
	
As per claim 2, dependent upon claim 1, Wang in view of Liu teaches wherein generating, by the respective convolution layer, the respective spatial attention map based on the respective input feature map comprises: 
performing a first convolution operation based on the respective input feature map received by the respective convolution layer to produce a respective convolved feature map (Wang Fig. 2 layers P3-P7); and 
applying an activation function based on the respective convolved feature map to generate the respective spatial attention map (Wang Fig. 4 applying a sigmoid function (an activation function) to generate the respective spatial attention map).

As per claim 3, dependent upon claim 2, Wang in view of Liu teaches wherein the activation function is a sigmoid activation function (Wang Fig. 4 “sigmoid”).

As per claim 4, dependent upon claim 2, Wang in view of Liu teaches wherein generating, by the respective convolution layer, the plurality of weighted coordinate maps comprises multiplying each of the plurality of coordinate maps with the respective spatial attention map so as to modify coordinate information in each of the plurality of coordinate maps (Wang Fig. 4 “                        
                            ⊗
                        
                    ” function; page 7 first para.: “Furthermore, the sigmoid function is utilized to normalize the feature map to obtain the spatial attention feature, which is multiplied by the input”; Liu Fig. 3 “i coordinate” and “j coordinate”).

As per claim 5, dependent upon claim 2, Wang in view of Liu teaches wherein the plurality of coordinate maps comprises a first coordinate map comprising coordinate information with respect to a first dimension and a second coordinate map comprising coordinate information with respect to a second dimension (Liu Fig. 3 “i coordinate” and “j coordinate”), the first and second dimensions being two dimensions over which the first convolution operation is configured to perform (Wang Fig. 2 layers P3-P7 performing 2D convolution, i.e., i and j directions, for generating 2D feature maps).

As per claim 6, dependent upon claim 1, Wang in view of Liu teaches wherein outputting, by the respective convolution layer, the respective output feature map of the respective convolution layer comprises: 
concatenating the respective input feature map received by the respective convolution layer and the plurality of weighted coordinate maps channel-wise to form a respective concatenated feature map (Wang Fig. 4 “                        
                            ⊕
                        
                    ” operation; Liu Fig. 3 showing the coordinate maps are concatenated with the feature maps; Liu Fig. 3 caption “A CoordConv layer has the same functional signature, but accomplishes the mapping by first concatenating extra channels to the incoming representation”); and 
performing a second convolution operation based on the respective concatenated feature map to produce the respective output feature map of the respective convolution layer (Wang Fig. 2 “Class and box prediction net”; Fig. 6; Liu Fig. 3 right col “Conv”).

As per claim 7, dependent upon claim 1, Wang in view of Liu teaches wherein: 
the CNN comprises a prediction sub-network comprising at least one convolution layer of the plurality of convolution layers of the CNN (Wang Fig. 2 “Efficientnet backbone and Downsampling”); and the method further comprises: 
producing a set of predicted feature maps using the prediction sub-network based on the input image (Wang Fig. 2 P3-P7 including 5 convolution layers for producing a plurality of output feature maps; page 7 section 3.3. “EWFAN for Aircraft Detection” second para.: “First, the image is input to the backbone network and down-sampled, then five feature maps of different sizes are obtained. The sizes are 64 × 64, 32 × 32, 16 × 16, 8 × 8, and 4 × 4 respectively”), including: 
performing at least one feature extraction operation, of the plurality of feature extraction operations, using the at least one convolution layer of the prediction sub-network, wherein the set of predicted feature maps include a plurality of predicted feature maps having different spatial resolution levels (Wang page 7 section 3.3. “EWFAN for Aircraft Detection” second para.: “First, the image is input to the backbone network and down-sampled, then five feature maps of different sizes are obtained. The sizes are 64 × 64, 32 × 32, 16 × 16, 8 × 8, and 4 × 4 respectively”).

As per claim 25, dependent upon claim 1, Wang in view of Liu teaches wherein:
receiving the input image comprises receiving a plurality of input images, each of the plurality of input images being a labeled image so as to train the CNN to obtain a trained CNN (Wang page 11 section 4.1 “Data Usage”: “We use multiple large-scale SAR images including airports and aircrafts to perform training and testing. First, we use the RSlabel tool to label the aircrafts in the SAR images, which are confirmed by the SAR expert. Then, we use the generated tag files and the original SAR images to automatically generate a data set, which contains 5480 aircraft slices with a size of 500 × 500 and the corresponding label files”), and 
the method further includes, for each of the plurality of input images: 
performing the plurality of feature extraction operations using the plurality of convolution layers of the CNN to produce the plurality of output feature maps (Wang Fig. 2 P3-P7); and 
producing the output image corresponding to the input image based on the plurality of output feature maps of the plurality of convolution layers (Wang Fig. 2 “output”).

As per claim 27, dependent upon claim 1, Wang in view of Liu teaches wherein the output image is a result of an inference on the input image using the CNN (Wang Fig. 2 “Test stage”, “output”).

As per claim 29, Wang in view of Liu teaches a system for image processing based on a convolutional neural network (CNN) (Wang page 11-12 bridging para. section 4.2. “Hyperparameter Settings” “each model is trained on two 2080Ti GPUs for 100 epochs”), the system comprising: 
a memory (Wang teaches a computer-implemented method by using a processor. Therefore a memory communicatively coupled to the processor is inherently taught); and 
at least one processor (Wang page 11-12 bridging para. section 4.2. “Hyperparameter Settings” “each model is trained on two 2080Ti GPUs for 100 epochs”) communicatively coupled to the memory and configured to perform a set of operations, comprising: 
receiving an input image; 
performing a plurality of feature extraction operations using a plurality of convolution layers of the CNN to produce a plurality of output feature maps, wherein a respective feature extraction operation of the plurality of feature extraction operations is performed by a respective convolution layer of the plurality of convolution layers and includes: 
receiving, by the respective convolution layer, a respective input feature map and a plurality of coordinate maps; 
generating, by the respective convolution layer, a respective spatial attention map based on the respective input feature map; 
generating, by the respective convolution layer, a plurality of weighted coordinate maps based on the plurality of coordinate maps and the respective spatial attention map; and 
outputting, by the respective convolution layer, a respective output feature map of the respective convolution layer based on the respective input feature map and the plurality of weighted coordinate maps; and
 producing an output image corresponding to the input image based on the plurality of output feature maps of the plurality of convolution layers 
(Claim 29 recites a system with elements corresponding to the steps recited in claim 1. Therefore, the recited elements of this claim are mapped to Wang in view of Liu in the same manner as the corresponding steps in its corresponding method claim, claim 1. Additionally, the rationale and motivation to combine Wang and Liu presented in rejection of claim 1 apply to this claim). 

As per claim 30, Wang in view of Liu teaches a computer program product, embodied in one or more non-transitory computer-readable storage media, comprising instructions executable by at least one processor to perform a set of operations (Wang page 11-12 bridging para. section 4.2. “Hyperparameter Settings” “each model is trained on two 2080Ti GPUs for 100 epochs”; Since Wang teaches a computer-implemented method using at least a processor, memory and computer-readable storage media are inherently taught) using a convolutional neural network (CNN), the set of operations comprising: 
receiving an input image; 
performing a plurality of feature extraction operations using a plurality of convolution layers of the CNN to produce a plurality of output feature maps, wherein a respective feature extraction operation of the plurality of feature extraction operations is performed by a respective convolution layer of the plurality of convolution layers and includes: 
receiving, by the respective convolution layer, a respective input feature map and a plurality of coordinate maps; 
generating, by the respective convolution layer, a respective spatial attention map based on the respective input feature map; 
generating, by the respective convolution layer, a plurality of weighted coordinate maps based on the plurality of coordinate maps and the respective spatial attention map; and 
outputting, by the respective convolution layer, a respective output feature map of the respective convolution layer based on the respective input feature map and the plurality of weighted coordinate maps; and 
producing an output image corresponding to the input image based on the plurality of output feature maps of the plurality of convolution layers
(Claim 30 recites a computer program product with elements corresponding to the steps recited in claim 1. Therefore, the recited elements of this claim are mapped to Wang in view of Liu in the same manner as the corresponding steps in its corresponding method claim, claim 1. Additionally, the rationale and motivation to combine Wang and Liu presented in rejection of claim 1 apply to this claim). 
Claim(s) 8-9, 11-15 and 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Liu, as applied above to claim 7, and further in view of Qin et al. (Qin X, Zhang Z, Huang C, Gao C, Dehghan M, Jagersand M. Basnet: Boundary-aware salient object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019 (pp. 7479-7489). Hereafter Qin_1).
As per claim 8, Wang in view of Liu does not teach the recited limitations.
Qin_1 in an analogous field discloses a deep convolutional neural network, BASNet, for salient object detection (Abstract). Specifically, the architecture is composed of a densely supervised Encoder-Decoder network and a residual refinement module, which are respectively in charge of saliency prediction and saliency map refinement (Abstract). The Encoder-Decoder network corresponds to a prediction sub-network. Specifically Qin_1 teaches: 
the prediction sub-network has an encoder-decoder structure comprising a plurality of first encoder blocks and a plurality of first decoder blocks, each first encoder block of the plurality of first encoder blocks corresponding to one respective first decoder block of the plurality of first decoder blocks (Qin_1 Fig. 2 “Predict Module (En-De)” shows a deep neural network comprising an encoder-decoder structure, which has a plurality of first encoder blocks (Fig. 2 left panel leftmost 6 blue blocks) and a plurality of corresponding first decoder blocks (Fig. 2 left panel 6 blocks (5 green+ 1 pink) connecting to the blue blocks), each first encoder block of the plurality of first encoder blocks corresponding to one respective first decoder block of the plurality of first decoder blocks; Page 7481-7482 section 3.2. “Predict Module”), and 
the method further comprises: 
producing, by a respective first encoder block of the plurality of first encoder blocks, a respective downsampled feature map based on a respective input feature map received by the respective first encoder block (Fig. 2 left panel 6 blue encoder blocks sequentially producing downsampled feature maps at respective resolutions 224X224X84, 112X112X128 etc.); and 
producing, by a respective first decoder block, of the plurality of first decoder blocks, corresponding to the respective first encoder block, a respective upsampled feature map based on the respective input feature map and the respective downsampled feature map produced by the respective first encoder block corresponding to the respective first decoder block (Fig. 2 left panel 6 corresponding blocks (5 green+ 1 pink); page 7481 right col. last para. “Our decoder is almost symmetrical to the encoder. Each stage consists of three convolution layers followed by a batch normalization and a ReLU activation function. The input of each stage is the concatenated feature maps of the upsampled output from its previous stage and its corresponding stage in the encoder”).

It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang and Liu’s teaching by incorporating Qin_1’s teaching to include a prediction sub-network having an encoder-decoder structure as specified in claim 8. Incorporating such a prediction sub-network with the specific structure would effectively segment the salient object regions and accurately predict the fine structures with clear boundaries as suggested by Qin_1 (Abstract).

As per claim 9, dependent upon claim 8, Wang in view of Liu and Qin_1 teaches
wherein producing the set of predicted feature maps using the prediction sub-network comprises producing the plurality of predicted feature maps based on a plurality of upsampled feature maps produced by the plurality of first decoder blocks (Qin_1 Fig. 2 left panel “Predict Module (En-De)” showing an encoder-decoder structure. In the decoder side, a plurality of predicted feature maps is produced based on a plurality of upsampled feature maps (Qin_1 page 7481 right col. last para. “Our decoder is almost symmetrical to the encoder. Each stage consists of three convolution layers followed by a batch normalization and a ReLU activation function. The input of each stage is the concatenated feature maps of the upsampled output from its previous stage and its corresponding stage in the encoder”)).

As per claim 11, dependent upon claim 8, Wang in view of Liu and Qin_1 teaches wherein: 
each of the plurality of first encoder blocks of the prediction sub-network comprises at least one convolution layer of the plurality of convolution layers of the CNN (Qin_1 Fig. 2 left panel each blue block comprises at least one convolutional layer. See below description from page 7481, especially “ResNet-34” and “res-blocks”, both comprising convolutional layers); and 

    PNG
    media_image1.png
    665
    590
    media_image1.png
    Greyscale

producing, by the respective first encoder block of the plurality of first encoder blocks, the respective downsampled feature map includes: 
performing at least one feature extraction operation of the plurality of feature extraction operations using the at least one convolution layer of the respective first encoder block (Qin_1 Fig. 2 left panel blue blocks perform feature extraction and produces feature maps. Page 7481-7482 section 3.2. “Predict Module); and 
each of the plurality of first decoder blocks of the prediction sub-network comprises at least one convolution layer of the plurality of convolution layers of the CNN (Qin_1 Fig. 2 left panel pink layers “Conv+BN+ReLU”); and 
producing, by the respective first decoder block of the plurality of first decoder blocks, the respective upsampled feature map includes: 
performing at least one feature extraction operation of the plurality of feature extraction operations using the at least one convolution layer of the respective first decoder block (Qin_1 Fig. 2 left panel pink layers “Conv+BN+ReLU”).

As per claim 12, dependent upon claim 11, Wang in view of Liu and Qin_1 teaches wherein: 
each convolution layer of each of the plurality of first encoder blocks of the prediction sub-network is one of the plurality of convolution layers of the CNN (Qin_1 Fig. 2 left panel blue blocks “ResNet-34” and “res-blocks” comprising convolutional layers); and 
each convolution layer of each of the plurality of first decoder blocks of the prediction sub-network is one of the plurality of convolution layers of the CNN (Qin_1 Fig. 2 left panel pink layers “Conv+BN+ReLU”).

As per claim 13, dependent upon claim 8, Wang in view of Liu and Qin_1 teaches wherein: 
each of the plurality of first encoder blocks of the prediction sub-network is configured as a residual block (Qin_1 Fig. 2 left panel blue blocks comprising “ResNet-34” and “res-blocks”), and each of the plurality of first decoder blocks of the prediction sub-network is configured as a residual block (Qin_1 Fig. 2 left panel pink layers “Conv+BN+ReLU” with skip connections constitute residual blocks).
 
As per claim 14, dependent upon claim 7, Wang in view of Liu and Qin_1 teaches wherein: 
the CNN further comprises a refinement sub-network comprising at least one convolution layer of the plurality of convolution layers of the CNN (Qin_1 Fig. 2 right panel “Residual Refinement Module (RRM)”), 
the method further comprises producing a set of refined feature maps using the refinement sub-network based on a fused feature map (Qin_1 Fig. 2 “refined map” (black and white) being a fused feature map; See description in page 7482 section 3.3. “Refine Module”), the producing including: 
performing at least one feature extraction operation of the plurality of feature extraction operations using the at least one convolution layer of refinement sub-network, wherein the set of refined feature maps includes a plurality of refined feature maps having different spatial resolution levels (Qin_1 Fig. 2 showing feature maps with different spatial resolution levels, such as 28X28X64, 56X56X64 etc.; Fig. 4 “(c) RRM Ours”).

As per claim 15, dependent upon claim 14, Wang in view of Liu and Qin_1 further teaches concatenating the set of predicted feature maps to produce the fused feature map (Qin_1 Fig. 2 “refined map” is produced by concatenating (the “                        
                            ⊕
                        
                    ” operation) the set of predicted feature maps).

As per claim 23, dependent upon claim 14, Wang in view of Liu and Qin_1 teaches wherein the output image is produced based on the set of refined feature maps (Qin_1 Fig. 2 “refined map” is produced by using the set of predicted refined feature maps).

Claim(s) 10 and 16-22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Liu and Qin_1, as applied above respectively to claims 8 and 14 respectively, and further in view of Qin et al. (Qin, Xuebin, et al. "U2-Net: Going deeper with nested U-structure for salient object detection." Pattern recognition 106 (2020): 107404. Hereafter Qin_2).
As per claim 10, dependent upon claim 8, Wang in view of Liu and Qin_1 does not teach the recited limitations.
Qin_2 in an analogous field discloses a deep network architecture,                         
                            
                                
                                    U
                                
                                
                                    2
                                
                            
                        
                    -Net, for salient object detection (SOD) (Abstract). Qin_2’s deep network includes an encoder-decoder structure comprising a plurality of encoder blocks and corresponding decoding blocks. Specifically, Qin_2 teaches: 
for a respective first encoder block of the plurality of first encoder blocks, producing the respective downsampled feature map comprises: 
extracting first multi-scale features based on the respective input feature map received by the respective first encoder block (Qin_2 Fig. 5 showing 5 encoder blocks and corresponding decoder blocks. Each of the encoder blocks further comprises an encoder-decoder structure for extracting multi-scale features); and 
producing the respective downsampled feature map based on the extracted first multi-scale features (Qin_2 the 5 encoder blocks sequentially producing downsampled feature maps based on the extracted first multi-scale features), and 
for a respective first decoder block of the plurality of first decoder blocks, producing the respective upsampled feature map comprises: 
extracting second multi-scale features based on the respective input feature map and the respective downsampled feature map produced by the respective first encoder block corresponding to the respective first decoder block received by the decoder block (Qin_2 Fig. 5 showing 5 encoder blocks and corresponding decoder blocks. Each of the decoder blocks further comprises an encoder-decoder structure for extracting multi-scale features); and 
producing the respective upsampled feature map based on the extracted second multi-scale features extracted by the respective decoder block (Qin_2 the 5 decoder blocks sequentially producing upsampled feature maps based on the extracted second multi-scale features).
It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang, Liu and Qin_1’s teaching by incorporating Qin_2’s teaching to configure each encoder block and each decoder block in such a manner that each encoder block and each decoder block can extract respective multi-scale features. Incorporating such a two-level nested U-structure would allow training a deep network from scratch without using backbones from image classification tasks, and achieving comparable or better performance than those based on existing pre-trained backbones as recognized by Qin_2 (Abstract; section 1 “Introduction”).

As per claim 16, dependent upon claim 14, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein the refinement sub-network comprises a plurality of refinement blocks configured to produce the plurality of refined feature maps (Qin_2 Fig. 5), each of the plurality of refinement blocks having an encoder-decoder structure comprising a plurality of second encoder blocks a plurality of second decoder blocks (Qin_2 Fig. 5 En_1, …, En_5, De_1, …De_5), wherein a respective second encoder block in the plurality of second encoder blocks corresponds to one respective second decoder block in the plurality of second decoder blocks (Qin_2 Fig. 5 En_1, …, En_5 corresponding to De_1, …De_5 respectively), and 
the method further comprises, for each refinement block of the plurality of refinement blocks: 
producing, by each second encoder block of the plurality of second encoder blocks, a respective downsampled feature map using the respective second encoder block based on an input feature map received by the respective second encoder block (Qin_2 Fig. 5 En_1, …, En_5 producing respective downsampled feature maps); and 
producing, by each second decoder block of the plurality of second decoder blocks, a respective upsampled feature map using the respective second decoder block based on the respective input feature map and the respective downsampled feature map produced by the respective second encoder block corresponding to the respective second decoder block and received by the respective second decoder block (Qin_2 Fig. 5 De_5, …, En_1 producing respective upsampled feature maps, the skip connections between respective encoder block and decoder block transmitting respective downsampled feature map produced by the respective encoder block to corresponding respective decoder block ).
It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang, Liu and Qin_1’s teaching by incorporating Qin_2’s teaching as noted above. Incorporating such a feature would allow training a deep network from scratch without using backbones from image classification tasks, and achieving comparable or better performance than those based on existing pre-trained backbones as recognized by Qin_2 (Abstract; section 1 “Introduction”).


As per claim 17, dependent upon claim 16, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein the plurality of refinement blocks comprises a plurality of encoder-decoder structures having different heights (Qin_2 Fig. 5 the plurality of refinement blocks comprises a plurality of encoder-decoder structures having different heights, i.e., different scale).

As per claim 18, dependent upon claim 16, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein the plurality of refinement blocks is configured to produce the plurality of refined feature maps by: 
producing, for each refinement block of the plurality of refinement blocks, a respective refined feature map of the plurality of refined feature maps based on the fused feature map received by the respective refinement block and a respective upsampled feature map produced by a respective second decoder block, of the plurality of second decoder blocks, corresponding to the respective refinement block (Qin_1 Fig. 5 the skip connection and operation “                        
                            ⊕
                        
                    ” (concatenation) shows a respective refined feature map is produced based on the fused feature map received by the respective refinement block and a respective upsampled feature map produced by a respective second decoder block .

As per claim 19, dependent upon claim 16, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein: 
producing, for each second encoder block of the plurality of second encoder blocks, the respective downsampled feature map comprises: 
extracting first multi-scale features based on the respective input feature map received by the respective second encoder block (Qin_2 Fig. 5 showing 5 encoder blocks and corresponding decoder blocks. Each of the encoder blocks further comprises an encoder-decoder structure for extracting multi-scale features); and 
producing the respective downsampled feature map based on the extracted first multi-scale features extracted by the respective second encoder block (Qin_2 the 5 encoder blocks sequentially producing downsampled feature maps based on the extracted first multi-scale features), and 
producing, for each second decoder block of the plurality of second decoder blocks, the respective upsampled feature map comprises: 
extracting second multi-scale features based on the respective input feature map and the respective downsampled feature map produced by the respective second encoder block corresponding to the respective second decoder block and received by the respective second decoder block (Qin_2 Fig. 5 showing 5 encoder blocks and corresponding decoder blocks. Each of the decoder blocks further comprises an encoder-decoder structure for extracting multi-scale features); and 
producing the respective upsampled feature map based on the extracted second multi-scale features extracted by the respective second decoder block (Qin_2 the 5 decoder blocks sequentially producing upsampled feature maps based on the extracted second multi-scale features).

As per claim 20, dependent upon claim 16, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein, for a respective refinement block of the plurality of refinement blocks: 
each of the plurality of second encoder blocks corresponding to the respective refinement block comprises at least one convolution layer of the plurality of convolution layers of the CNN (Qin_1 Fig. 5 right panel blue blocks being second encoder blocks (4 stages); page 7482 left col. last 6 lines and right col. first 2 lines: “Our RRM employs the residual encoder-decoder architecture, RRM Ours (see Figs. 2 and 4(c)). Its main architecture is similar but simpler to our predict module. It contains an input layer, an encoder, a bridge, a decoder and an output layer. Different from the predict module, both encoder and decoder have four stages. Each stage only has one convolution layer. Each layer has 64 filters of size 3 × 3 followed by a batch normalization and a ReLU activation function”); and 
producing, by each second encoder block of the plurality of second encoder blocks, the respective downsampled feature map using the respective second encoder block of the respective refinement block (Qin_1 Fig. 5 right panel 4 blue blocks) comprises: 
performing at least one feature extraction operation of the plurality of feature extraction operations using the at least one convolution layer of the respective second encoder block (Qin_1 Fig. 5 right panel 4 blue blocks “Conv+BN+MaxPool”); and 
each of the plurality of second decoder blocks corresponding to the respective refinement block comprises at least one convolution layer of the plurality of convolution layers of the CNN (Qin_1 Fig. 5 right panel right side 3 rightmost green blocks and 1 pink block; page 7482 left col. last 6 lines and right col. first 2 lines: “Our RRM employs the residual encoder-decoder architecture, RRM Ours (see Figs. 2 and 4(c)). Its main architecture is similar but simpler to our predict module. It contains an input layer, an encoder, a bridge, a decoder and an output layer. Different from the predict module, both encoder and decoder have four stages. Each stage only has one convolution layer. Each layer has 64 filters of size 3 × 3 followed by a batch normalization and a ReLU activation function”); and 
producing, by each second decoder block of the plurality of second decoder blocks, the respective upsampled feature map using the respective second decoder block of the respective refinement block (Qin_1 Fig. 5 right panel right side blocks) comprises: 
performing at least one feature extraction operation of the plurality of feature extraction operations using the at least one convolution layer of the respective second decoder block (Qin_1 Fig. 5 right panel right side 4 blocks extracting respective feature maps).

As per claim 21, dependent upon claim 20, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein: 
each convolution layer of each of the plurality of second encoder blocks of the refinement block is one of the plurality of convolution layers of the CNN, and each convolution layer of each of the plurality of second decoder blocks of the refinement block is one of the plurality of convolution layers of the CNN (Qin_1 page 7482 left col. last 6 lines and right col. first 2 lines: “Our RRM employs the residual encoder-decoder architecture, RRM Ours (see Figs. 2 and 4(c)). Its main architecture is similar but simpler to our predict module. It contains an input layer, an encoder, a bridge, a decoder and an output layer. Different from the predict module, both encoder and decoder have four stages. Each stage only has one convolution layer. Each layer has 64 filters of size 3 × 3 followed by a batch normalization and a ReLU activation function”).

As per claim 22, dependent upon claim 16, Wang in view of Liu, Qin_1 and Qin_2  teaches wherein: 
for each of the plurality of refinement blocks: 
each of the plurality of second encoder blocks of the refinement block is configured as a residual block, and each of the plurality of second decoder blocks of the refinement block is configured as a residual block (Qin_1 Fig. 2 right panel; page 7482 left col. last para. “To refine both region and boundary drawbacks in coarse saliency maps, we develop a novel residual refinement module. Our RRM employs the residual encoder-decoder architecture, RRM Ours (see Figs. 2 and 4(c))”).

Claim(s) 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Liu and Qin_1, as applied above to claim 23, and further in view of DINERSTEIN et al. (US 20200356827 A1, hereafter DINERSTEIN).
As per claim 24, Wang in view of Liu and Qin_1 teaches the output image is produced based on the set of refined feature maps (Qin_1 Fig. 2 “refined map”), but does not teach the output image is produced based on an average of the set of refined feature maps. 
	DINERSTEIN discloses a convolutional neural networks (CNNs) that synthesize middle non-existing frames from pairs of input frames (Abstract). Specifically, a coarse CNN includes a feature extraction sub-network that generates a pair of feature maps that correspond to each image of the pair of images at each level of resolution, an encoder-decoder sub-network that concatenates the pair of feature maps at each level of resolution into a single feature map and processes the single feature map to produce a new feature map with downscaled spatial resolution, a fusion sub-network that merges the new single feature maps at each level of resolution into a single merged feature map by performing a weighted average of the feature maps for each level of resolution (Abstract; Fig. 3; para, [0008], [0056]).
It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang, Liu and Qin_1’s teaching by incorporating DINERSTEIN’s teaching to produce the output image based on an average of the set of refined feature maps. Doing so would allow outputs of all decoders are merged into a coarse IMVF in a locally adaptive fashion as recognized by DINERSTEIN (para. [0056]).  

Claim(s) 26, 28 and 31 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Liu, and further in view of Xie et al. (US 20210177373 A1, hereafter Xie)
	As per claim 26, Wang in view of Liu teaches label images, but does not teach the label image is a labeled ultrasound image including a tissue structure.
	Xie discloses a method for training a neural network for ultrasound image classification and/or segmentation (Abstract). Specially, Xie teaches a fully-convolutional neural network (FIG. 9) that is trained using training data that includes labeled images (e.g., images of the liver and/or kidney overlaid with the boundaries, delineated either manually or with computer assistance, of the different types of tissue). The network may thus be trained to segment an input ultrasound image to produce a segmentation map, in which groups of adjacent pixels determined by the machine-learning algorithm to be associated with a same category (e.g., a same tissue type) are assigned to the same semantic category. The segmentation map may be constructed from a plurality of segmentation masks, each of which corresponds to one of the plurality of categories (para. [0060]). 	
It would have been obvious to a person with ordinary skill in the art before the effective filing date of the claimed invention to have modified Wang and Liu’s teaching by incorporating Xie’s teaching to consider including a labeled ultrasound image to train the CNN.  Doing so would allow tissue structure in an ultrasound image to be classified and segmented for measurement purpose as recognized by Xie (para. [0002]).  
	
As per claim 28, dependent upon claim 27, Wang in view of Liu and Xie teaches wherein the input image is an ultrasound image including a tissue structure (Xie para. [0060]; Fig. 11).

As per claim 31, dependent upon claim 1, Wang in view of Liu and Xie further teaches: segmenting a tissue structure in an ultrasound image using the CNN (CNN) (Xie FIG. 9; FIG. 11-12; para. [0060]), using at least one processor (Xie para. [0064]), the method comprising: 
wherein: 
the input image is the ultrasound image including the tissue structure (Xie Abstract; para. [0060]); and
the output image has the tissue structure segmented and is a result of an inference on the input image using the CNN (Xie para. [0060]-[0062]; FIG. 11-12).

Contact

Any inquiry concerning this communication or earlier communications from the examiner should be directed to XUEMEI G CHEN whose telephone number is (571)270-3480. The examiner can normally be reached Monday-Friday 9am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John M Villecco can be reached at (571) 272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/XUEMEI G CHEN/Primary Examiner, Art Unit 2661
Read full office action
Prosecution Timeline

Oct 25, 2023
Application Filed
Feb 06, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/130,165
Patent 12632945
IMAGE PROCESSING METHOD AND APPARATUS, SYSTEM, AND STORAGE MEDIUM
3y 1m to grant Granted May 19, 2026
18/325,976
Patent 12633095
AUTOMATIC EFFICIENT SMALL MODEL SELECTION FOR MONOCULAR DEPTH ESTIMATION
2y 11m to grant Granted May 19, 2026
18/536,424
Patent 12626339
IMAGE PROCESSING METHOD AND IMAGE PROCESSING DEVICE
2y 5m to grant Granted May 12, 2026
18/561,478
Patent 12626386
BENDING ANGLE DETERMINING METHOD AND DETERMINING DEVICE
2y 5m to grant Granted May 12, 2026
18/600,128
Patent 12620251
SYSTEMS AND METHODS FOR AUTOMATIC DETECTION OF FEATURES ON A SHEET
2y 1m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+25.4%)
2y 7m (~0m remaining)
Median Time to Grant
Low
PTA Risk
Based on 579 resolved cases by this examiner. Grant probability derived from career allowance rate.