Last updated: April 19, 2026
Application No. 18/594,113
IMAGE PROCESSING METHOD AND RELATED APPARATUS

Non-Final OA §102§103
Filed
Mar 04, 2024
Examiner
PARK, EDWARD
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Tencent Technology (Shenzhen) Company Limited
OA Round
1 (Non-Final)
Interview Optional

— +18.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 704 resolved cases, 2023–2026
Examiner Intelligence

PARK, EDWARD View full profile →
Grants 82% — above average
Career Allow Rate
576 granted / 704 resolved
+19.8% vs TC avg
Strong +18% interview lift
Without
With
+18.4%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
27 currently pending
Career history
731
Total Applications
across all art units
Statute-Specific Performance

§101
16.9%
-23.1% vs TC avg
§103
47.3%
+7.3% vs TC avg
§102
21.3%
-18.7% vs TC avg
§112
6.3%
-33.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 704 resolved cases
Office Action

§102 §103
DETAILED ACTION

Contents
Notice of Pre-AIA  or AIA  Status	2
Claim Rejections - 35 USC § 102	2
Claim Rejections - 35 USC § 103	6
Allowable Subject Matter	14
Conclusion	16


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 


This action is responsive to applicant’s claim set received on 3/4/24.  Claims 1-20 are currently pending.



Claim Rejections - 35 USC § 102
	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:

A person shall be entitled to a patent unless - (a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed  invention.

Claim 1 is rejected under 35 U.S.C. 102(a)(1) as being anticipated by Dai (CV: “CoAtNet: Marrying Convolution and Attention for All Data Sizes”).           Regarding claim 1, Dai discloses an image processing method, performed by a computer device, the method comprising: performing vectorization on a to-be-processed image, to obtain an image representation vector (see 2.2, fig. 4, appendix a.1; As we have discuss above, the global context has a quadratic complexity w.r.t. the spatial size. Hence, if we directly apply the relative attention in Eqn. (3) to the raw image input, the computation will be excessively slow due to the large number of pixels in any image of common sizes….. The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively); 
performing feature mapping on the image representation vector by using a network block comprised in a feature mapping module in an image classification model (see 2.2; When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation). The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively), to obtain an image feature, including: performing, at a same network layer of a network block, global feature mapping on input content by using the network layer to obtain a global feature (see 2.1; In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp), and performing local feature mapping on the input content by using the network layer to obtain a local feature, the input content being obtained based on the image representation vector (see 2.1; For convolution, we mainly focus on the MBConv block [27] which employs depthwise convolution [28] to capture the spatial interaction. A key reason of this choice is that both the FFN module in Transformer and MBConv employ the design of “inverted bottleneck”, which first expands the channel size of the input by 4x and later project the the 4x-wide hidden state back to the original channel size to enable residual connection. Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing); 
performing feature fusion on the global feature and the local feature by the network layer, to obtain a fused feature corresponding to the network layer (see 2.1; Given the comparison above, an ideal model should be able to combine the 3 desirable properties in Table 1. With the similar form of depthwise convolution in Eqn. (1) and self-attention in Eqn. (2), a straightforward idea that could achieve this is simply to sum a global static convolution kernel with the adaptive attention matrix, either after or before the Softmax normalization, i.e., y post i = X j∈G exp …Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing. 2 In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp); 
obtaining an image feature based on the fused feature corresponding to the network layer (see 2.1, 2.2; 2) a multi-stage network with gradual pooling as in ConvNets. With these choices, we derive a search space of 5 variants and compare them in controlled experiments. • When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation); and performing category prediction by using a classification module in the image classification model based on the image feature, to obtain a classification result of the to-be-processed image (see pg. 14-15; we apply global average pooling to the last-stage output to get the representation for simplicity.).







Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

           The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimedinvention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 


Claims 10, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Dai (CV: “CoAtNet: Marrying Convolution and Attention for All Data Sizes”) in view of Xin et al (US 2024/0037911 A1).
Regarding claim 10, Dai teaches a method comprising: performing vectorization on a to-be-processed image, to obtain an image representation vector (see 2.2, fig. 4, appendix a.1; As we have discuss above, the global context has a quadratic complexity w.r.t. the spatial size. Hence, if we directly apply the relative attention in Eqn. (3) to the raw image input, the computation will be excessively slow due to the large number of pixels in any image of common sizes….. The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively); 
performing feature mapping on the image representation vector by using a network block comprised in a feature mapping module in an image classification model (see 2.2; When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation). The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively), to obtain an image feature, including: performing, at a same network layer of a network block, global feature mapping on input content by using the network layer to obtain a global feature (see 2.1; In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp), and performing local feature mapping on the input content by using the network layer to obtain a local feature, the input content being obtained based on the image representation vector (see 2.1; For convolution, we mainly focus on the MBConv block [27] which employs depthwise convolution [28] to capture the spatial interaction. A key reason of this choice is that both the FFN module in Transformer and MBConv employ the design of “inverted bottleneck”, which first expands the channel size of the input by 4x and later project the the 4x-wide hidden state back to the original channel size to enable residual connection. Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing); 
performing feature fusion on the global feature and the local feature by the network layer, to obtain a fused feature corresponding to the network layer (see 2.1; Given the comparison above, an ideal model should be able to combine the 3 desirable properties in Table 1. With the similar form of depthwise convolution in Eqn. (1) and self-attention in Eqn. (2), a straightforward idea that could achieve this is simply to sum a global static convolution kernel with the adaptive attention matrix, either after or before the Softmax normalization, i.e., y post i = X j∈G exp …Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing. 2 In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp); 
obtaining an image feature based on the fused feature corresponding to the network layer (see 2.1, 2.2; 2) a multi-stage network with gradual pooling as in ConvNets. With these choices, we derive a search space of 5 variants and compare them in controlled experiments. • When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation); and performing category prediction by using a classification module in the image classification model based on the image feature, to obtain a classification result of the to-be-processed image (see pg. 14-15; we apply global average pooling to the last-stage output to get the representation for simplicity.).  Dai does not teach expressly a computer device, comprising a processor and a memory, the memory being configured to store program code and transmit the program code to the processor; and the processor being configured to perform a method according to instructions in the program code, comprising.
Xin, in the same field of endeavor, teaches a computer device, comprising a processor and a memory, the memory being configured to store program code and transmit the program code to the processor; and the processor being configured to perform a method according to instructions in the program code, comprising (see 0067, 0007;  second aspect of the disclosure, provided is an image classification apparatus, including: a first obtaining module, configured to extract a first image feature of a target image by using a first network model, where the first network model includes a convolutional neural network module; a second obtaining module, configured to extract a second image feature of the target image by using a second network model, where the second network model includes a deep self-attention transformer network (Transformer) module; a feature fusion module, configured to fuse the first image feature and the second image feature to obtain a target feature to be recognized; and a classification module, configured to classify the target image based on the target feature to be recognized.[0007] According to a third aspect of the disclosure, provided is provided is an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor. The memory stores an instruction executable by the at least one processor, and the instruction, when executed by the at least one processor, enables the at least one processor to execute the method of the first aspect of the present disclosure.).  
It would have been obvious (before the effective filing date of the claimed invention) or (at the time the invention was made) to one of ordinary skill in the art to modify Dai to utilize the cited limitations as suggested by Xin.  The suggestion/motivation for doing so would have been to enhance the image classification accuracy (see 0010).  Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Dai, while the teaching of Xin continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result.    It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.  
Regarding claim 19, Dai teaches a method comprising: performing vectorization on a to-be-processed image, to obtain an image representation vector (see 2.2, fig. 4, appendix a.1; As we have discuss above, the global context has a quadratic complexity w.r.t. the spatial size. Hence, if we directly apply the relative attention in Eqn. (3) to the raw image input, the computation will be excessively slow due to the large number of pixels in any image of common sizes….. The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively); 
performing feature mapping on the image representation vector by using a network block comprised in a feature mapping module in an image classification model (see 2.2; When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation). The first stage S0 is a simple 2-layer convolutional Stem and S1 always employs MBConv blocks with squeeze-excitation (SE), as the spatial size is too large for global attention. Starting from S2 through S4, we consider either the MBConv or the Transformer block, with a constraint that convolution stages must appear before Transformer stages. The constraint is based on the prior that convolution is better at processing local patterns that are more common in early stages. This leads to 4 variants with increasingly more Transformer stages, C-C-C-C, C-C-C-T, C-C-T-T and C-T-T-T, where C and T denote Convolution and Transformer respectively), to obtain an image feature, including: performing, at a same network layer of a network block, global feature mapping on input content by using the network layer to obtain a global feature (see 2.1; In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp), and performing local feature mapping on the input content by using the network layer to obtain a local feature, the input content being obtained based on the image representation vector (see 2.1; For convolution, we mainly focus on the MBConv block [27] which employs depthwise convolution [28] to capture the spatial interaction. A key reason of this choice is that both the FFN module in Transformer and MBConv employ the design of “inverted bottleneck”, which first expands the channel size of the input by 4x and later project the the 4x-wide hidden state back to the original channel size to enable residual connection. Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing); 
performing feature fusion on the global feature and the local feature by the network layer, to obtain a fused feature corresponding to the network layer (see 2.1; Given the comparison above, an ideal model should be able to combine the 3 desirable properties in Table 1. With the similar form of depthwise convolution in Eqn. (1) and self-attention in Eqn. (2), a straightforward idea that could achieve this is simply to sum a global static convolution kernel with the adaptive attention matrix, either after or before the Softmax normalization, i.e., y post i = X j∈G exp …Besides the similarity of inverted bottleneck, we also notice that both depthwise convolution and self-attention can be expressed as a per-dimension weighted sum of values in a pre-defined receptive field. Specifically, convolution relies on a fixed kernel to gather information from a local receptive field yi = X j∈L(i) wi−j  xj (depthwise convolution), (1) where xi , yi ∈ R D are the input and output at position i respectively, and L(i) denotes a local neighborhood of i, e.g., a 3x3 grid centered at i in image processing. 2 In comparison, self-attention allows the receptive field to be the entire spatial locations and computes the weights based on the re-normalized pairwise similarity between the pair (xi , xj ): 2 yi = X j∈G exp); 
obtaining an image feature based on the fused feature corresponding to the network layer (see 2.1, 2.2; 2) a multi-stage network with gradual pooling as in ConvNets. With these choices, we derive a search space of 5 variants and compare them in controlled experiments. • When the ViT Stem is used, we directly stack L Transformer blocks with relative attention, which we denote as VITREL. • When the multi-stage layout is used, we mimic ConvNets to construct a network of 5 stages (S0, S1, S2, S3 & S4), with spatial resolution gradually decreased from S0 to S4. At the beginning of each stage, we always reduce the spatial size by 2x and increase the number of channels (see Appendix A.1 for the detailed down-sampling implementation); and performing category prediction by using a classification module in the image classification model based on the image feature, to obtain a classification result of the to-be-processed image (see pg. 14-15; we apply global average pooling to the last-stage output to get the representation for simplicity.).  Dai does not teach expressly a non-transitory computer-readable storage medium, configured to store program code, the program code, when executed by a processor, causing the processor to perform the method comprising.
Xin, in the same field of endeavor, teaches a non-transitory computer-readable storage medium, configured to store program code, the program code, when executed by a processor, causing the processor to perform the method comprising (see 0067, 0007;  second aspect of the disclosure, provided is an image classification apparatus, including: a first obtaining module, configured to extract a first image feature of a target image by using a first network model, where the first network model includes a convolutional neural network module; a second obtaining module, configured to extract a second image feature of the target image by using a second network model, where the second network model includes a deep self-attention transformer network (Transformer) module; a feature fusion module, configured to fuse the first image feature and the second image feature to obtain a target feature to be recognized; and a classification module, configured to classify the target image based on the target feature to be recognized.[0007] According to a third aspect of the disclosure, provided is provided is an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor. The memory stores an instruction executable by the at least one processor, and the instruction, when executed by the at least one processor, enables the at least one processor to execute the method of the first aspect of the present disclosure.).  
It would have been obvious (before the effective filing date of the claimed invention) or (at the time the invention was made) to one of ordinary skill in the art to modify Dai to utilize the cited limitations as suggested by Xin.  The suggestion/motivation for doing so would have been to enhance the image classification accuracy (see 0010).  Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Dai, while the teaching of Xin continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result.    It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.  


Allowable Subject Matter
Claims 2-9, 11-18, 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding claims 2-4, 11-13, 20, none of the references of record alone or in combination suggest or fairly teach wherein the performing feature fusion on the global feature and the local feature by using the network layer, to obtain a fused feature corresponding to the network layer comprises: determining weight values respectively corresponding to the global feature and the local feature, wherein a sum of a weight value of the global feature and a weight value of the local feature is 1; and performing weighted summation on the global feature and the local feature based on the weight values, to obtain the fused feature.  
Regarding claims 5, 14, none of the references of record alone or in combination suggest or fairly teach wherein the method further comprises: performing convolutional feature mapping on the image representation vector by using the network layer, to obtain a convolutional feature; and the performing feature fusion on the global feature and the local feature by using the network layer, to obtain a fused feature corresponding to the network layer comprises: performing feature fusion on the global feature, the local feature, and the convolutional feature by using the network layer, to obtain the fused feature corresponding to the network layer.
Regarding claims 6, 15, none of the references of record alone or in combination suggest or fairly teach wherein the feature mapping module comprises a plurality of network blocks, each network block is connected to a down-sampling module, the plurality of network blocks comprises a first network block and a second network block, and the performing feature mapping on the image representation vector by using a network block comprised in a feature mapping module in an image classification model, to obtain an image feature of the to-be-processed image comprises: performing feature mapping on the image representation vector by using the first network block, to obtain a first feature map of the to-be-processed image; down-sampling the first feature map by using a down-sampling module connected to the first network block, to obtain a first-scale feature map; performing feature mapping on the first-scale feature map by using the second network block, to obtain a second feature map of the to-be-processed image; down-sampling the second feature map by using a down-sampling module connected to the second network block, to obtain a second-scale feature map; and obtaining the image feature of the to-be-processed image based on the second-scale feature map.
Regarding claims 7-8, 16-17, none of the references of record alone or in combination suggest or fairly teach wherein the performing vectorization based on the to-be-processed image, to obtain an image representation vector of the to-be-processed image comprises: cropping and dividing the to-be-processed image based on a patch size, to obtain a plurality of image patches; performing data structure mapping on the plurality of image patches, to obtain one-dimensional structured data of the to-be-processed image; and performing vectorization on the one-dimensional structured data, to obtain the image representation vector.
Regarding claims 9, 18, none of the references of record alone or in combination suggest or fairly teach wherein the image classification model further comprises a fully connected layer, and the performing category prediction by using a classification module in the image classification model based on the image feature, to obtain a classification result of the to-be-processed image comprises: performing fully connected calculation on the image feature by using the fully connected layer, and mapping the image feature into a classification quantity length; and performing category prediction by using the classification module based on the image feature of the classification quantity length, to obtain the classification result.




Conclusion
Claims 1, 10, 19 are rejected.  Claims 2-9, 11-18, 20 are objected to as being dependent upon a rejected base claim.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWARD PARK.  The examiner’s contact information is as follows:
Telephone: (571)270-1576 | Fax: 571.270.2576 | Edward.Park@uspto.gov
For email communications, please notate MPEP 502.03, which outlines procedures pertaining to communications via the internet and authorization.  A sample authorization form is cited within MPEP 502.03, section II.
The examiner can normally be reached on M-F 9-6 CST.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Moyer, can be reached on (571) 272-9523.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/EDWARD PARK/
Primary Examiner, Art Unit 2666
Read full office action
Prosecution Timeline

Mar 04, 2024
Application Filed
Jan 06, 2026
Non-Final Rejection — §102, §103
Jan 21, 2026
Interview Requested
Feb 03, 2026
Examiner Interview Summary
Feb 03, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/243,512
Patent 12602911
SYSTEMS AND METHODS FOR HANDWRITING RECOGNITION USING OPTICAL CHARACTER RECOGNITION
2y 5m to grant Granted Apr 14, 2026
18/411,030
Patent 12602815
WEAKLY PAIRED IMAGE STYLE TRANSFER METHOD BASED ON POSE SELF-SUPERVISED GENERATIVE ADVERSARIAL NETWORK
2y 5m to grant Granted Apr 14, 2026
17/573,148
Patent 12597173
AUTOMATIC GENERATION OF AN IMAGE HAVING AN ATTRIBUTE FROM A SUBJECT IMAGE
2y 5m to grant Granted Apr 07, 2026
18/547,342
Patent 12594023
METHOD AND DEVICE FOR PROVIDING ALOPECIA INFORMATION
2y 5m to grant Granted Apr 07, 2026
18/060,358
Patent 12592000
SYSTEMS AND METHODS FOR PROCESSING DIGITAL IMAGES TO ADAPT TO COLOR VISION DEFICIENCY
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
82%
Grant Probability
99%
With Interview (+18.4%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 704 resolved cases by this examiner. Grant probability derived from career allow rate.