Last updated: April 19, 2026
Application No. 17/342,486
MULTI-RESOLUTION NEURAL NETWORK ARCHITECTURE SEARCH SPACE FOR DENSE PREDICTION TASKS

Non-Final OA §103§112
Filed
Jun 08, 2021
Examiner
TRAN, AMY NMN
Art Unit
2126
Tech Center
2100 — Computer Architecture & Software
Assignee
Lemon Inc.
OA Round
3 (Non-Final)
This examiner grants 36% of cases after interview

— +47.9% interview lift. A telephonic interview to clarify the technical implementation could significantly improve the outcome.
Based on 28 resolved cases, 2023–2026
Examiner Intelligence

TRAN, AMY NMN View full profile →
Grants only 36% of cases
Career Allow Rate
10 granted / 28 resolved
-19.3% vs TC avg
Strong +48% interview lift
Without
With
+47.9%
Interview Lift
resolved cases with interview
Typical timeline
5y 2m
Avg Prosecution
24 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
32.5%
-7.5% vs TC avg
§103
44.2%
+4.2% vs TC avg
§102
6.0%
-34.0% vs TC avg
§112
15.6%
-24.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 28 resolved cases
Office Action

§103 §112
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 08/05/2025 has been entered.
Response to Amendment
The amendments filed on 08/05/2025 has been entered. The status of claims is as follows:
Claims 1-20 remain pending in the application.
Claims 1, 3, 9 and 15 are amended.
Response to Arguments
In reference to the Rejections under 35 U.S.C 112(a) – New Subject Matter:
Applicant’s amendments reciting that “the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” does not overcome the 35 U.S.C 112 rejections because the limitation lacks written description support in the originally filed Specification. Paragraph ¶[0038] of the Instant Specification discloses that the searching blocks may be the same or different, which means the searching blocks are not required to be structurally or functionally separate or distinct. Accordingly, the amendments introduce subject matter that was not originally disclosed and the rejection under 35 U.S.C 112(a) is maintained.
In reference to the Rejections under 35 U.S.C 112(b):
Applicant’s amendments reciting that “the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” does not overcome the rejection under 35 U.S.C 112(b), as the claim additionally recite in Claim 1-Line 14, Claim 9 – Lines 20-21, Claim 15 – Lines 17 that “wherein the plurality of searching blocks are interconnected with one another”. The recitation that the searching blocks are both “separate and distinct” while simultaneously being “interconnected” renders the scope of the claim unclear because these limitations are logically inconsistent without further clarification as to the nature and extent of the separation and the interconnection. It is therefore unclear whether the searching blocks are required to be structurally independent, functionally independent, or merely logically partitioned while still performing a single integrated component. Accordingly, one of the ordinary skill in the art would not be reasonably apprised of the metes and bounds of the claimed invention, and the rejection under 35 U.S.C 112(b) is therefore maintained.
In reference to the Rejections under 35 U.S.C 103:
Applicant asserts in Remarks pg. 10-11 that Zhang, Zoph and Mao, whether taken individually or in combination fail to disclose or render obvious at least "a first parallel module including a first plurality of stacked searching blocks and a second plurality of stacked searching blocks, wherein the first plurality of stacked searching blocks is configured to output first feature maps of a first resolution and the second plurality of stacked searching blocks is configured to output second feature maps of a second resolution, wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another," as recited in claim 1.
Examiner respectfully notes that Applicant’s remarks do not constitute a substantive traversal of the rejection. The response merely summarizes what Zhang, Zoph and Mao disclose, but fails to address the Office Action’s particular findings as to how the cited references cure the noted deficiencies, and provides no explanation as to why the articulated rationale for reliance on Zhang, Zoph and Mao is in error. Specifically, Applicant merely characterizes Mao as a conventional DS-FPN implementation and notes example feature-map constructions (e.g., C2-C4 to F2-F4), but does not explain how this purported conventionality negates the Office Action’s mappings. The rejection does not rely on Mao for novelty of FPN per se, but for its disclosure of dual-stream, parallel DS-blocks at feature pyramid scales, i.e., separate and distinct from one another that generate different resolution feature maps. Mao expressly teaches DS-Blocks that generate two feature maps with different resolutions, which reasonably corresponds, under the broadest reasonable interpretation, to plurality of blocks that are separate and distinct from one another. Therefore the rejection is maintained as shown in the previous office action.
Applicant’s arguments filed on 08/05/2025 have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 112(a) – New Matter
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
The recitation of “wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” in claims 1, 9 and 15 is not supported by the specification. Examiner respectfully notes that the originally filed specification and drawings fail to support this newly amended limitation. In particular, paragraph ¶[0038] of the Instant Specification discloses that the searching blocks may be the same or different, which means the searching blocks are not required to be structurally or functionally separate or distinct. Without support or a structural distinction in the drawings, the assertion of independence constitutes new matter and therefore the claims are rejected under 35 U.S.C 112(a) as new matter. 
The dependent claims 2-8, 10-14 and 16-20 inherit the deficiencies of the independent claims and therefore they are also rejected under the same rationale.
Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claim 1, claim 1 recites “wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” and “wherein the plurality of searching blocks are interconnected with one another” (line 14). It is unclear how the searching blocks can be both separate and distinct from one another and interconnected with one another, as these terms are logically inconsistent without further clarification. The claims fail to provide sufficient context to distinguish whether these limitations refer to the same searching blocks, or different searching blocks. As a result, a person of ordinary skill in the art cannot reasonably ascertain the scope of the claimed invention. As a result, it remains unclear how the searching blocks can simultaneously be separate, distinct and interconnected. For the purpose of examination, the “searching blocks” is being interpreted as being interconnected because FIG.3 clearly shows the arrows between these searching blocks which suggests a data flow or sequential operation. Applicant is advised to amend the claim to clarify the distinction and avoid internal inconsistency.
Regarding claim 9, claim 9 recites “wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” and “wherein the plurality of searching blocks are interconnected with one another” (line 20-21). It is unclear how the searching blocks can be both separate and distinct from one another and interconnected with one another, as these terms are logically inconsistent without further clarification. The claims fail to provide sufficient context to distinguish whether these limitations refer to the same searching blocks, or different searching blocks. As a result, a person of ordinary skill in the art cannot reasonably ascertain the scope of the claimed invention. As a result, it remains unclear how the searching blocks can simultaneously be separate, distinct and interconnected. For the purpose of examination, the “searching blocks” is being interpreted as being interconnected because FIG.3 clearly shows the arrows between these searching blocks which suggests a data flow or sequential operation. Applicant is advised to amend the claim to clarify the distinction and avoid internal inconsistency.
Regarding claim 15, claim 15 recites “wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another” and “wherein the plurality of searching blocks are interconnected with one another” (line 17). It is unclear how the searching blocks can be both separate and distinct from one another and interconnected with one another, as these terms are logically inconsistent without further clarification. The claims fail to provide sufficient context to distinguish whether these limitations refer to the same searching blocks, or different searching blocks. As a result, a person of ordinary skill in the art cannot reasonably ascertain the scope of the claimed invention. As a result, it remains unclear how the searching blocks can simultaneously be separate, distinct and interconnected. For the purpose of examination, the “searching blocks” is being interpreted as being interconnected because FIG.3 clearly shows the arrows between these searching blocks which suggests a data flow or sequential operation. Applicant is advised to amend the claim to clarify the distinction and avoid internal inconsistency.
The dependent claims 2-8, 10-14 and 16-20 inherit the deficiencies of the independent claims and therefore they are also rejected under the same rationale.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-2, and 5-8 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al (“DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation”) in view of Mao et al (“Dual-stream Network for Visual Recognition”) and further in view of Zoph et al (US 2021/0081796 A1)
Regarding claim 1, Zhang explicitly discloses:
a first parallel module including a first plurality of stacked searching blocks and a second plurality of stacked searching blocks, (Zhang, Page 2, Figure 1: “Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Upsample block unchanged. Dashed lines represent candidate connections in DCSS, to keep clarity, we only demonstrate several connections among all the candidates.”

    PNG
    media_image1.png
    157
    858
    media_image1.png
    Greyscale

Page 4, Col. 1, ¶[2]: “As illustrated by the bottom left part in Figure 1, the fusion module consists of the shape-alignment layer besides the mixture layer. The shape alignment layer is in a multi-branch parallel form.”) [Examiner’s note: Figure 1 discloses a search space with a multi-branch parallel module including multiple mixture layers (i.e., the stacked searching blocks)]
wherein the first plurality of stacked searching blocks is configured to output first feature maps of a first resolution and the second plurality of stacked searching blocks is configured to output second feature maps of a second resolution; (Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 2, ¶[2]: “The shape-alignment layer is in a multi-branch parallel form. Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape.”, Page 6, Col. 1, Section 4.1, ¶[1]: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F, 2F, 4F, 8F, where we set F to be 64 for our best model”.) [Examiner’s note: “output first feature maps of a first resolution” and “output second feature maps of a second resolution” i.e., dispatching each feature-map to the corresponding branches, which have respective resolution {1/4; 1/8; 1/16; 1/32}]
a fusion module including a plurality of searching blocks, wherein the fusion module is configured to generate multiscale feature maps by fusing one or more feature maps of the first resolution received from the first parallel module with one or more feature maps of the second resolution received from the first parallel module, and (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
“Figure 1. Framework of DCNAS... Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”) [Examiner’s note: “a fusion module including a plurality of searching blocks” i.e., Figure 1 discloses a fusion module which includes multiple mixture layers (i.e., the stacked searching blocks). The highlight shows that the fusion module performing multi-scale features fusion by aggregating (i.e., fusing) feature maps of corresponding resolution (i.e., dispatches each feature-map to the corresponding branches)]
wherein the fusion module is configured to output the multiscale feature maps and output third feature maps of a third resolution; and (Zhang, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”, Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
”) [Examiner’s note: The fusion module (bottom left in Fig. 1) aggregates feature maps from multiple scales (e.g., resolution ¼, 1/8, 1/16, 1/32) using element-wise addition 
    PNG
    media_image4.png
    44
    154
    media_image4.png
    Greyscale
, creating a unified multiscale representation. The third feature map of the third resolution (e.g., 1/16) is produced by selecting and operating on the fused multiscale features at the desired resolution.]
a second parallel module configured to receive the multiscale feature maps and the third feature maps of the third resolution from the fusion module, and output fourth feature maps of the first resolution, fifth feature maps of the second resolution, and sixth feature maps of the third resolution.  (Zhang, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
 “Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged. Dashed lines represent candidate connections in DCSS, to keep clarity, we only demonstrate several connections among all the candidates. Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones. Bottom Right: Mixture layer may further save GPU memory consumption and accelerating the searching process by sampling and operating on a portion of features while bypassing the others.”) [Examiner’s note: The fusion module (bottom left of Fig. 1) aggregates feature maps from various scales, combining multiscale information into a unified feature map (ft/st). The densely connected pathways in the search space enable the fused feature map of the third resolution (e.g., 1/16) to be passed to a parallel mixture layer. In the mixture layer (bottom right of Fig. 1), the received multiscale feature maps are split and processed through separable convolutions of varying kernel sizes. These processed features are then concatenated to form new output feature maps. Through this process, the mixture layer dynamically generates the fourth feature maps at the first resolution (e.g., 1/4), the fifth feature maps at the second resolution (e.g., 1/8) and the sixth feature maps at the third resolution by operating on both the fused multiscale maps and resolution-specific inputs from the fusion module.]
	Zhang fails to disclose:
A system for neural architecture search, the system comprising: a processor; and
memory storing instructions that, when executed by the processor, cause the system to implement a search space comprising:
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another;
a fusion module including a plurality of searching blocks, wherein the plurality of searching blocks are interconnected with one another,
and a second parallel module including a third plurality of stacked searching blocks and a fourth plurality of stacked searching blocks, the second parallel module configured to receive the multiscale feature maps and the third feature maps of the third resolution from the fusion module, and
wherein each of the searching blocks of the first plurality of stacked searching
blocks, the second plurality of stacked searching blocks, the third plurality of stacked searching blocks, the fourth plurality of stacked searching blocks, and the plurality of searching blocks includes a plurality of neural network layers and a transformer.
However, Zoph explicitly discloses:
A system for neural architecture search, the system comprising: a processor; and (Zoph, ¶[0088]: “Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both.”)
memory storing instructions that, when executed by the processor, cause the system to implement a search space comprising: (Zoph, ¶0089]: “Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Zoph. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Zoph teaches a method of neural architecture search for dense image prediction tasks. One of ordinary skill would have motivation to combine Zhang and Zoph because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Mao explicitly discloses:
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another; (Mao, Page 2, ¶[2]: “In this paper, we address the these issues by introducing a Dual-stream Network (DS-Net). Instead of single stream architecture as previous works, our DS-Net adopts Dual-stream Blocks (DS-Blocks), which generates two feature maps with different resolutions, and retains both local and global information of the image via two parallel branches.”, Page 6, Section 3.4, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Section 3.4, ¶[1]: “we apply our Dual-stream design into FPN, named Dual-stream Feature Pyramid Networks(DS-FPN), by simply adding DS-Blocks to every feature pyramid scale. In this way, DS-FPN is able to better attain non-local patterns and local details with marginal increased costs at all scales, which further enhance the performance of subsequent object detection and segmentation heads.”) [Examiner’s note: “dual-stream blocks” is being interpreted as the “stacked searching blocks” because they have the similar function of generating feature maps with different solutions. The term “simply adding DS-Blocks to every feature pyramid scale” indicates parallel and independent blocks]
a fusion module including a plurality of searching blocks, wherein the plurality of searching blocks are interconnected with one another, (Mao, Page 6, ¶[1]: “Such a bidirectional information flow is able to identify cross-scale relations between local and global tokens, by which dual-scale features are highly aligned and coupled with each other. After this, we could safely up sample low-resolution representation hG, concatenate it with high-resolution hL and perform 1 by 1 convolution for channel-wise dual-scale information fusion.”) [Examiner’s note: the highlight discloses the concept of a fusion module that includes a plurality of components  which are interconnected through bidirectional information flow. It explains that local and global features are aligned and combined via concatenation and 1 x 1 convolution, indicating that the blocks are interconnected when performing channel-wise dual scale fusion]
and a second parallel module including a third plurality of stacked searching blocks and a fourth plurality of stacked searching blocks, the second parallel module configured to receive the multiscale feature maps and the third feature maps of the third resolution from the fusion module, and (Mao, Page 2, ¶[2]: “In this paper, we address the these issues by introducing a Dual-stream Network (DS-Net). Instead of single stream architecture as previous works, our DS-Net adopts Dual-stream Blocks (DS-Blocks), which generates two feature maps with different resolutions, and retains both local and global information of the image via two parallel branches. We propose a Intra-scale Propagation module here to process two feature maps.”, Page 4, Figure 2: 
    PNG
    media_image6.png
    674
    953
    media_image6.png
    Greyscale
)
wherein each of the searching blocks of the first plurality of stacked searching
blocks, the second plurality of stacked searching blocks, the third plurality of stacked searching blocks, the fourth plurality of stacked searching blocks, and the plurality of searching blocks includes a plurality of neural network layers and a transformer. (Mao, Page 6, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Page 6, Section 3.4, ¶[2-3]: “Fig 3(b) shows our DS-FPN pipeline in RetinaNet. Similar to FPN in Fig 3(a), we take image features from various scales from the backbone as input, and output corresponding refined feature maps of fixed channels number by a top-down aggregation methods. Our structure is composed of bottom-up pathways, Dual-stream lateral connections, and top-down pathways… Specifically, we denote feature maps of bottom-up pathways as {C2, C3, C4}, where C1 is ignored due to its low semantics and high resolutions. After processed by lateral DS-Blocks and aggregated with the upsampled features from top-down pathway, we obtain final {F2, F3, F4}. When extra feature outputs are needed, FPN utilizes 3 x 3 convolution with stride 2 to obtain F5 from C4, while DS-FPN adopts 2 x 2 patches embedding to down sample the image and follows a DS-Block to generate dual-scale features.”, Page 8, Section 4.4, ¶[1]: “Settings. Experiments of DS-FPN are implemented on DS-Net-T and Swin Transformer [22] for object detection and instance segmentation on MSCOCO 2017 [20], by replacing the original FPN with DS-FPN.”) [Examiner’s note: The highlight indicates that features are extracted from at least three different layers in the backbone (C2, C3, C4) which are typical outputs of intermediate layers in CNNs]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Mao. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Mao teaches a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. One of ordinary skill would have motivation to combine Zhang and Mao to improve the model’s ability to identify optimal sub-architectures for high-resolution tasks by combining local feature refinement with long-range contextual awareness.
Regarding claim 2, Zhang in view of Mao and Zoph further discloses:
wherein the memory stores instructions that, when executed by the processor, cause 
the system to (Zoph, ¶0089]: “Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.”)
wherein at least one searching block of the plurality of searching blocks of the fusion module is configured to down-sample feature maps, and (Zhang, Page 8, Col. 1, ¶[3]: “our DCNAS inherently and repeatedly applying top-down and bottom-up multi-scale features fusion, hence results in leading performance.”, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
. “Figure 1. Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged.”) [Examiner’s note: Figure 1 show that at least one searching block (i.e., the parallel mixture layer) is configured to down-sample and up-sample feature maps, where the dashed green lines represent down-sampling connections and the dashed purple lines represent up-sampling connections.]
wherein at least one searching block of the first plurality of searching blocks of the fusion module is configured to up-sample feature maps.  (Zhang, Page 8, Col. 1, ¶[3]: “our DCNAS inherently and repeatedly applying top-down and bottom-up multi-scale features fusion, hence results in leading performance.”, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
. “Figure 1. Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged.”) [Examiner’s note: Figure 1 show that at least one searching block (i.e., the parallel mixture layer) is configured to down-sample and up-sample feature maps, where the dashed green lines represent down-sampling connections and the dashed purple lines represent up-sampling connections.]
Regarding claim 5, Zhang in view of Mao and Zoph further discloses:
wherein the first resolution is greater than the second resolution. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
, Page 6, Col. 1, Section 4.1: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F; 2F; 4F; 8F, where we set F to be 64 for our best model.”) [Examiner’s note: “first resolution” is being interpreted as ¼, “second resolution” is being interpreted as 1/8. ]
Regarding claim 6, Zhang in view of Mao and Zoph further discloses:
wherein the memory stores instructions that, when executed by the processor, cause the search space to further comprise  (Zoph, ¶0089]: “Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.”)
a second fusion module including a second plurality of searching blocks, wherein the second fusion module is configured to generate multiscale feature maps of the second resolution by combining a down-sampled feature map received from the second parallel module with an up-sampled feature map received from the second parallel module, and 
wherein each of the searching blocks of the second plurality of searching blocks includes a plurality of neural network layers and a transformer. (Mao, Page 6, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Page 6, Section 3.4, ¶[2-3]: “Fig 3(b) shows our DS-FPN pipeline in RetinaNet. Similar to FPN in Fig 3(a), we take image features from various scales from the backbone as input, and output corresponding refined feature maps of fixed channels number by a top-down aggregation methods. Our structure is composed of bottom-up pathways, Dual-stream lateral connections, and top-down pathways… Specifically, we denote feature maps of bottom-up pathways as {C2, C3, C4}, where C1 is ignored due to its low semantics and high resolutions. After processed by lateral DS-Blocks and aggregated with the upsampled features from top-down pathway, we obtain final {F2, F3, F4}. When extra feature outputs are needed, FPN utilizes 3 x 3 convolution with stride 2 to obtain F5 from C4, while DS-FPN adopts 2 x 2 patches embedding to down sample the image and follows a DS-Block to generate dual-scale features.”, Page 8, Section 4.4, ¶[1]: “Settings. Experiments of DS-FPN are implemented on DS-Net-T and Swin Transformer [22] for object detection and instance segmentation on MSCOCO 2017 [20], by replacing the original FPN with DS-FPN.”) [Examiner’s note: The highlight indicates that features are extracted from at least three different layers in the backbone (C2, C3, C4) which are typical outputs of intermediate layers in CNNs]
Regarding claim 7, Zhang in view of Mao and Zoph further discloses:
wherein the fusion module is configured to fuse feature maps from searching blocks of three different resolutions. (Zhang, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”, Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
”) [Examiner’s note: The fusion module (bottom left in Fig. 1) aggregates feature maps from multiple scales (e.g., resolution ¼, 1/8, 1/16, 1/32) using element-wise addition 
    PNG
    media_image4.png
    44
    154
    media_image4.png
    Greyscale
, creating a unified multiscale representation. The third feature map of the third resolution (e.g., 1/16) is produced by selecting and operating on the fused multiscale features at the desired resolution]
Regarding claim 8, Zhang explicitly discloses:
further comprising another fusion module configured to receive a convolution stream and output feature maps of the first resolution to the first parallel module and output feature maps of the second resolution to the first parallel module. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: “another fusion module configured to receive a convolution stream and output feature maps” is being interpreted as the concat module of the mixture layer. The mixture layer receives input feature maps, which are processed through multiple parallel convolution streams with different kernel sizes (3 x 3, 5 x 5, 7 x 7). After processing through the convolution streams, the resulting feature maps are concatenated. This fusion combines the information from all streams into a unified representation. The concatenated feature maps are then further processed or adjusted to produce output feature maps at the first resolution (e.g., 1/4) and the second resolution (e.g., 1/8)]
Zhang fails to disclose:
wherein the memory stores instructions that, when executed by the processor, cause the search space to further comprise
wherein the other fusion module comprises a second plurality of searching blocks that are interconnected with each other, and wherein each of the searching blocks of the second plurality of searching blocks includes a plurality of neural network layers and a transformer.
However, Zoph explicitly discloses:
wherein the memory stores instructions that, when executed by the processor, cause the search space to further comprise (Zoph, ¶0089]: “Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Zoph. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Zoph teaches a method of neural architecture search for dense image prediction tasks. One of ordinary skill would have motivation to combine Zhang and Zoph because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
However, Mao explicitly discloses:
wherein the other fusion module comprises a second plurality of searching blocks that are interconnected with each other, and wherein each of the searching blocks of the second plurality of searching blocks includes a plurality of neural network layers and a transformer. (Mao, Page 6, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Page 6, Section 3.4, ¶[2-3]: “Fig 3(b) shows our DS-FPN pipeline in RetinaNet. Similar to FPN in Fig 3(a), we take image features from various scales from the backbone as input, and output corresponding refined feature maps of fixed channels number by a top-down aggregation methods. Our structure is composed of bottom-up pathways, Dual-stream lateral connections, and top-down pathways… Specifically, we denote feature maps of bottom-up pathways as {C2, C3, C4}, where C1 is ignored due to its low semantics and high resolutions. After processed by lateral DS-Blocks and aggregated with the upsampled features from top-down pathway, we obtain final {F2, F3, F4}. When extra feature outputs are needed, FPN utilizes 3 x 3 convolution with stride 2 to obtain F5 from C4, while DS-FPN adopts 2 x 2 patches embedding to down sample the image and follows a DS-Block to generate dual-scale features.”, Page 8, Section 4.4, ¶[1]: “Settings. Experiments of DS-FPN are implemented on DS-Net-T and Swin Transformer [22] for object detection and instance segmentation on MSCOCO 2017 [20], by replacing the original FPN with DS-FPN.”) [Examiner’s note: The highlight indicates that features are extracted from at least three different layers in the backbone (C2, C3, C4) which are typical outputs of intermediate layers in CNNs]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Mao. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Mao teaches a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. One of ordinary skill would have motivation to combine Zhang and Mao to improve the model’s ability to identify optimal sub-architectures for high-resolution tasks by combining local feature refinement with long-range contextual awareness.
Claim(s) 3-4, 9-14 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al (“DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation”) (hereafter referred to as “Zhang”) in view of Mao et al (“Dual-stream Network for Visual Recognition”) (hereafter referred to as “Mao”), Zoph et al (US 2021/0081796 A1)  (hereafter referred to as “Zoph”), and further in view of Dosovitskiy et al (“AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE”) (hereafter referred to as “Dosovitskiy”)
Regarding Claim 3, Zhang in view of Mao and Zoph discloses all the limitation of claim 1 (as shown in the rejections above).
Zhang in view of Mao and Zoph further discloses: 
wherein the transformer of one or more searching blocks of the first plurality of stacked searching blocks is configured to (Mao, Page 6, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Page 6, Section 3.4, ¶[2-3]: “Fig 3(b) shows our DS-FPN pipeline in RetinaNet. Similar to FPN in Fig 3(a), we take image features from various scales from the backbone as input, and output corresponding refined feature maps of fixed channels number by a top-down aggregation methods. Our structure is composed of bottom-up pathways, Dual-stream lateral connections, and top-down pathways… Specifically, we denote feature maps of bottom-up pathways as {C2, C3, C4}, where C1 is ignored due to its low semantics and high resolutions. After processed by lateral DS-Blocks and aggregated with the upsampled features from top-down pathway, we obtain final {F2, F3, F4}. When extra feature outputs are needed, FPN utilizes 3 x 3 convolution with stride 2 to obtain F5 from C4, while DS-FPN adopts 2 x 2 patches embedding to down sample the image and follows a DS-Block to generate dual-scale features.”, Page 8, Section 4.4, ¶[1]: “Settings. Experiments of DS-FPN are implemented on DS-Net-T and Swin Transformer [22] for object detection and instance segmentation on MSCOCO 2017 [20], by replacing the original FPN with DS-FPN.”) [Examiner’s note: The highlight indicates that features are extracted from at least three different layers in the backbone (C2, C3, C4) which are typical outputs of intermediate layers in CNNs]
Zhang in view of Mao and Zoph fails to disclose:
provide an attention map based on feature maps received from another searching block of the first plurality of stacked searching blocks.
However, Dosovitskiy explicitly discloses:
provide an attention map based on feature maps received from another searching block of the first plurality of stacked searching blocks. (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
	It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang, Mao, Zoph and Dosovitskiy. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Zoph teaches a method of neural architecture search for dense image prediction tasks. Mao teaches a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. Dosovitskiy teaches experiment with applying a standard Transformer directly to images, with the fewest possible modifications, which is inspired by the Transformer scaling successes in NPL. One of ordinary skill would have motivation to combine Zhang, Mao, Zoph and Dosovitskiy to enhance the network’s ability to capture long-range dependencies and contextual relationships within the feature maps received from other searching blocks, thereby improving feature representation and enabling more effective multiscale learning.
Regarding Claim 4, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 3 (as shown in the rejections above).
Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
wherein one or more searching blocks of the first plurality of stacked searching blocks includes a plurality of convolution layers arranged in a depth-wise manner, each convolution layer of the plurality of convolution layers having a different kernel size.  (Zhang, Page 3, Col. 2, Section 3.1, ¶[2]: “The mixture layer is the elementary structure in search space represents a collection of available operations. Similar to [2], we construct the operator space O with various configurations of mobileNet-v3 [23], i.e., kernel sizes k ϵ {3; 5; 7}, expansion ratios r ϵ {3; 6}.”, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: Figure 1 discloses the Mixture Layer (bottom right) dividing the input feature maps into multiple parallel paths where each path uses a different kernel size, such as 3 x 3, 5 x 5, and 7 x 7, to perform depth-wise separable convolutions.]
Regarding Claim 9, Zhang explicitly discloses:
a first branch including a first plurality of stacked searching blocks for image features of a first resolution, (Zhang, Page 2, Figure 1: “Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Upsample block unchanged. Dashed lines represent candidate connections in DCSS, to keep clarity, we only demonstrate several connections among all the candidates.”

    PNG
    media_image1.png
    157
    858
    media_image1.png
    Greyscale

Page 4, Col. 1, ¶[2]: “As illustrated by the bottom left part in Figure 1, the fusion module consists of the shape-alignment layer besides the mixture layer. The shape alignment layer is in a multi-branch parallel form.”) [Examiner’s note: 
Figure 1 discloses a search space with a multi-branch parallel module including multiple mixture layers (i.e., the stacked searching blocks)]
one or more searching blocks of the first plurality of stacked searching blocks including a plurality of convolution layers (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: plurality of stacked searching blocks i.e., the multiple mixture layers, plurality of convolution layers i.e., convolution layers with various sizes (3 x 3, 5 x 5, 7 x 7)]
a second branch including a second plurality of stacked searching blocks for image features of a second resolution, (Zhang, Page 2, Figure 1: “Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Upsample block unchanged. Dashed lines represent candidate connections in DCSS, to keep clarity, we only demonstrate several connections among all the candidates.”

    PNG
    media_image1.png
    157
    858
    media_image1.png
    Greyscale

Page 4, Col. 1, ¶[2]: “As illustrated by the bottom left part in Figure 1, the fusion module consists of the shape-alignment layer besides the mixture layer. The shape alignment layer is in a multi-branch parallel form.”) [Examiner’s note: Figure 1 discloses a search space with a multi-branch parallel module including multiple mixture layers (i.e., the stacked searching blocks)]
one or more searching blocks of the second plurality of stacked searching blocks 
including a plurality of convolution and (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: plurality of stacked searching blocks i.e., the multiple mixture layers, plurality of convolution layers i.e., convolution layers with various sizes (3 x 3, 5 x 5, 7 x 7)]
a fusion module configured to fuse image features output by the one or more searching blocks of the first plurality of stacked searching blocks and image features output by the one or more searching blocks of the second plurality of stacked searching blocks, wherein the fusion module is configured to output image features of the first resolution and image features of the second resolution. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: “another fusion module configured to receive a convolution stream and output feature maps” is being interpreted as the concat module of the mixture layer. The mixture layer receives input feature maps, which are processed through multiple parallel convolution streams with different kernel sizes (3 x 3, 5 x 5, 7 x 7). After processing through the convolution streams, the resulting feature maps are concatenated. This fusion combines the information from all streams into a unified representation. The concatenated feature maps are then further processed or adjusted to produce output feature maps at the first resolution (e.g., 1/4) and the second resolution (e.g., 1/8)]
	Zhang fails to disclose:
	A system for neural architecture search, the system comprising:
a processor; and
	memory storing instructions that, when executed by the processor, cause the system to implement a search space comprising:
at least one transformer configured to provide an attention map based on image features from another searching block of the first branch;
at least one transformer configured to provide an attention map based on image features from another searching block of the second branch; and
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another;
a fusion module including a plurality of searching blocks, wherein each searching block of the plurality of searching blocks includes a plurality of neural network layers and a transformer,
wherein the plurality of searching blocks are interconnected with one another
However, Zoph explicitly discloses:
A system for neural architecture search, the system comprising:
a processor; and (Zoph, ¶[0088]: “Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both.”)
memory storing instructions that, when executed by the processor, cause the system to implement a search space comprising: (Zoph, ¶0089]: “Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Zoph. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Zoph teaches a method of neural architecture search for dense image prediction tasks. One of ordinary skill would have motivation to combine Zhang and Zoph because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E): “Obvious to try” choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of the ordinary skill in the art.
	However, Dosovitskyi explicitly discloses:
at least one transformer configured to provide an attention map based on image features from another searching block of the first branch; (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
at least one transformer configured to provide an attention map based on image features from another searching block of the second branch; and (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Dosovitskiy. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Dosovitskiy teaches experiment with applying a standard Transformer directly to images, with the fewest possible modifications, which is inspired by the Transformer scaling successes in NPL. One of ordinary skill would have motivation to combine Zhang and Dosovitskiy to enhance the network’s ability to capture long-range dependencies and contextual relationships within the feature maps received from other searching blocks, thereby improving feature representation and enabling more effective multiscale learning.
However, Mao explicitly discloses:
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another; (Mao, Page 2, ¶[2]: “In this paper, we address the these issues by introducing a Dual-stream Network (DS-Net). Instead of single stream architecture as previous works, our DS-Net adopts Dual-stream Blocks (DS-Blocks), which generates two feature maps with different resolutions, and retains both local and global information of the image via two parallel branches.”, Page 6, Section 3.4, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Section 3.4, ¶[1]: “we apply our Dual-stream design into FPN, named Dual-stream Feature Pyramid Networks(DS-FPN), by simply adding DS-Blocks to every feature pyramid scale. In this way, DS-FPN is able to better attain non-local patterns and local details with marginal increased costs at all scales, which further enhance the performance of subsequent object detection and segmentation heads.”) [Examiner’s note: “dual-stream blocks” is being interpreted as the “stacked searching blocks” because they have the similar function of generating feature maps with different solutions. The term “simply adding DS-Blocks to every feature pyramid scale” indicates parallel and independent blocks]
a fusion module including a plurality of searching blocks, wherein each searching block of the plurality of searching blocks includes a plurality of neural network layers and a transformer,  (Mao, Page 6, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Page 6, Section 3.4, ¶[2-3]: “Fig 3(b) shows our DS-FPN pipeline in RetinaNet. Similar to FPN in Fig 3(a), we take image features from various scales from the backbone as input, and output corresponding refined feature maps of fixed channels number by a top-down aggregation methods. Our structure is composed of bottom-up pathways, Dual-stream lateral connections, and top-down pathways… Specifically, we denote feature maps of bottom-up pathways as {C2, C3, C4}, where C1 is ignored due to its low semantics and high resolutions. After processed by lateral DS-Blocks and aggregated with the upsampled features from top-down pathway, we obtain final {F2, F3, F4}. When extra feature outputs are needed, FPN utilizes 3 x 3 convolution with stride 2 to obtain F5 from C4, while DS-FPN adopts 2 x 2 patches embedding to down sample the image and follows a DS-Block to generate dual-scale features.”, Page 8, Section 4.4, ¶[1]: “Settings. Experiments of DS-FPN are implemented on DS-Net-T and Swin Transformer [22] for object detection and instance segmentation on MSCOCO 2017 [20], by replacing the original FPN with DS-FPN.”) [Examiner’s note: The highlight indicates that features are extracted from at least three different layers in the backbone (C2, C3, C4) which are typical outputs of intermediate layers in CNNs]
wherein the plurality of searching blocks are interconnected with one another (Mao, Page 6, ¶[1]: “Such a bidirectional information flow is able to identify cross-scale relations between local and global tokens, by which dual-scale features are highly aligned and coupled with each other. After this, we could safely up sample low-resolution representation hG, concatenate it with high-resolution hL and perform 1 by 1 convolution for channel-wise dual-scale information fusion.”) [Examiner’s note: the highlight discloses the concept of a fusion module that includes a plurality of components  which are interconnected through bidirectional information flow. It explains that local and global features are aligned and combined via concatenation and 1 x 1 convolution, indicating that the blocks are interconnected when performing channel-wise dual scale fusion]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Mao. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Mao teaches a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. One of ordinary skill would have motivation to combine Zhang and Mao to improve the model’s ability to identify optimal sub-architectures for high-resolution tasks by combining local feature refinement with long-range contextual awareness.
	Regarding claim 10, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 9 (as shown in the rejections above).
	Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
wherein the fusion module is configured to initiate a third branch and output image features of a third resolution. (Zhang, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”, Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
”) [Examiner’s note: The fusion module (bottom left in Fig. 1) aggregates feature maps from multiple scales (e.g., resolution ¼, 1/8, 1/16, 1/32) using element-wise addition 
    PNG
    media_image4.png
    44
    154
    media_image4.png
    Greyscale
, creating a unified multiscale representation. The third feature map of the third resolution (e.g., 1/16) is produced by selecting and operating on the fused multiscale features at the desired resolution.]
Regarding claim 11, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 10 (as shown in the rejections above).
	Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
wherein the first resolution is greater than the second resolution. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
, Page 6, Col. 1, Section 4.1: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F; 2F; 4F; 8F, where we set F to be 64 for our best model.”) [Examiner’s note: “first resolution” is being interpreted as ¼, “second resolution” is being interpreted as 1/8. ] 
Regarding claim 12, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 10 (as shown in the rejections above).
	Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
wherein the plurality of searching blocks of the fusion module includes a searching block configured to down-sample image features of the first branch and up-sample image features of the third branch, (Zhang, Page 8, Col. 1, ¶[3]: “our DCNAS inherently and repeatedly applying top-down and bottom-up multi-scale features fusion, hence results in leading performance.”, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
. “Figure 1. Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged.”) [Examiner’s note: Figure 1 show that at least one searching block (i.e., the parallel mixture layer) is configured to down-sample and up-sample feature maps, where the dashed green lines represent down-sampling connections and the dashed purple lines represent up-sampling connections.]
the fusion module configured to generate multiscale image features by fusing the down-sampled image features and the up-sampled image features to output multiscale image features of the second resolution. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
“Figure 1. Framework of DCNAS... Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”) [Examiner’s note: “a fusion module including a plurality of searching blocks” i.e., Figure 1 discloses a fusion module which includes multiple mixture layers (i.e., the stacked searching blocks). The highlight shows that the fusion module performing multi-scale features fusion by aggregating (i.e., fusing) down-sampled feature maps and up-sampled feature map (i.e., dispatches each feature-map to the corresponding branches)]
Regarding claim 13, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 9 (as shown in the rejections above).
	Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
wherein one or more searching blocks of the first plurality of stacked searching blocks includes a plurality of convolution layers arranged in a depth- wise manner, each convolution layer of the plurality of convolution layers having a different kernel size. (Zhang, Page 3, Col. 2, Section 3.1, ¶[2]: “The mixture layer is the elementary structure in search space represents a collection of available operations. Similar to [2], we construct the operator space O with various configurations of mobileNet-v3 [23], i.e., kernel sizes k ϵ {3; 5; 7}, expansion ratios r ϵ {3; 6}.”, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: Figure 1 discloses the Mixture Layer (bottom right) dividing the input feature maps into multiple parallel paths where each path uses a different kernel size, such as 3 x 3, 5 x 5, and 7 x 7, to perform depth-wise separable convolutions.]  
Regarding claim 14, the combination of Zhang, Mao, Zoph and Dosovitskiy discloses all the limitation of claim 9 (as shown in the rejections above).
	Zhang in view of Mao, Zoph and Dosovitskiy further discloses:
a third branch including a third plurality of stacked searching blocks for image features of a third resolution, (Zhang, Page 2, Figure 1: “Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Upsample block unchanged. Dashed lines represent candidate connections in DCSS, to keep clarity, we only demonstrate several connections among all the candidates.”

    PNG
    media_image1.png
    157
    858
    media_image1.png
    Greyscale

Page 4, Col. 1, ¶[2]: “As illustrated by the bottom left part in Figure 1, the fusion module consists of the shape-alignment layer besides the mixture layer. The shape alignment layer is in a multi-branch parallel form.”) [Examiner’s note: Figure 1 discloses a search space with a multi-branch parallel module including multiple mixture layers (i.e., the stacked searching blocks)] 
wherein one or more searching blocks of the third plurality of stacked searching blocks includes a transformer.  (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
Claim(s) 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al (“DCNAS: Densely Connected Neural Architecture Search for Semantic Image Segmentation”) (hereafter referred to as “Zhang”) in view of Mao et al (“Dual-stream Network for Visual Recognition”) (hereafter referred to as “Mao”), and further in view of Dosovitskiy et al (“AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE”) (hereafter referred to as “Dosovitskiy”)

	Regarding claim 15, Zhang explicitly discloses:
A method of searching a search space, the method comprising: generating image features of a first resolution using a first parallel module including a first plurality of stacked searching blocks, (Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 2, ¶[2]: “The shape-alignment layer is in a multi-branch parallel form. Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape.”, Page 6, Col. 1, Section 4.1, ¶[1]: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F, 2F, 4F, 8F, where we set F to be 64 for our best model”.) [Examiner’s note: “output first feature maps of a first resolution” and “output second feature maps of a second resolution” i.e., dispatching each feature-map to the corresponding branches, which have respective resolution {1/4; 1/8; 1/16; 1/32}]
wherein one or more searching blocks of the first plurality of stacked searching blocks includes a plurality of convolution layers and (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: plurality of stacked searching blocks i.e., the multiple mixture layers, plurality of convolution layers i.e., convolution layers with various sizes (3 x 3, 5 x 5, 7 x 7)]
generating image features of a second resolution using the first parallel module, (Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 2, ¶[2]: “The shape-alignment layer is in a multi-branch parallel form. Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape.”, Page 6, Col. 1, Section 4.1, ¶[1]: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F, 2F, 4F, 8F, where we set F to be 64 for our best model”.) [Examiner’s note: “output first feature maps of a first resolution” and “output second feature maps of a second resolution” i.e., dispatching each feature-map to the corresponding branches, which have respective resolution {1/4; 1/8; 1/16; 1/32}]
wherein the first parallel module includes a second plurality of stacked searching blocks  and one or more searching blocks of the second plurality of stacked searching blocks includes a plurality of convolution layers and (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: plurality of stacked searching blocks i.e., the multiple mixture layers, plurality of convolution layers i.e., convolution layers with various sizes (3 x 3, 5 x 5, 7 x 7)]
fusing one or more image features received from the first plurality of stacked searching blocks with one or more image features received from the second plurality of stacked searching blocks to output multiscale image features of the first resolution and multiscale image features of the second resolution. (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
“Figure 1. Framework of DCNAS... Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.”, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”) [Examiner’s note: “a fusion module including a plurality of searching blocks” i.e., Figure 1 discloses a fusion module which includes multiple mixture layers (i.e., the stacked searching blocks). The highlight shows that the fusion module performing multi-scale features fusion by aggregating (i.e., fusing) feature maps of corresponding resolution (i.e., dispatches each feature-map to the corresponding branches)]
	Zhang fails to disclose:
at least one transformer configured to provide an attention map based on image features from another searching block;
at least one transformer configured to provide an attention map based on image features from a different searching block; and
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another; and
via a plurality of searching blocks that are interconnected with one another,
	However, Dosovitskiy explicitly discloses:
at least one transformer configured to provide an attention map based on image features from another searching block; (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
at least one transformer configured to provide an attention map based on image features from a different searching block; and (Dosovitshiy, Page 3, Figure 1: “Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.” 
    PNG
    media_image7.png
    407
    773
    media_image7.png
    Greyscale
 , Page 3, Section 3.1, ¶[4]: “The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self attention (MSA, see Appendix A) and MLP blocks”, Page 4, ¶[2]: “As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN”, Page 8, Figure 6: “Representative examples of attention from the output token to the input space.” 
    PNG
    media_image8.png
    403
    232
    media_image8.png
    Greyscale
) [Examiner’s note: feature map i.e., “the input sequence can be formed from feature maps”, Fig. 6 shows attention maps is provided based on feature maps (i.e., input sequence)]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Dosovitskiy. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Dosovitskiy teaches experiment with applying a standard Transformer directly to images, with the fewest possible modifications, which is inspired by the Transformer scaling successes in NPL. One of ordinary skill would have motivation to combine Zhang and Dosovitskiy to enhance the network’s ability to capture long-range dependencies and contextual relationships within the feature maps received from other searching blocks, thereby improving feature representation and enabling more effective multiscale learning.
However, Mao explicitly discloses:
wherein the first plurality of stacked searching blocks and the second plurality of stacked searching blocks are separate and distinct from one another; and (Mao, Page 2, ¶[2]: “In this paper, we address the these issues by introducing a Dual-stream Network (DS-Net). Instead of single stream architecture as previous works, our DS-Net adopts Dual-stream Blocks (DS-Blocks), which generates two feature maps with different resolutions, and retains both local and global information of the image via two parallel branches.”, Page 6, Section 3.4, Figure 3: 
    PNG
    media_image5.png
    203
    799
    media_image5.png
    Greyscale
Section 3.4, ¶[1]: “we apply our Dual-stream design into FPN, named Dual-stream Feature Pyramid Networks(DS-FPN), by simply adding DS-Blocks to every feature pyramid scale. In this way, DS-FPN is able to better attain non-local patterns and local details with marginal increased costs at all scales, which further enhance the performance of subsequent object detection and segmentation heads.”) [Examiner’s note: “dual-stream blocks” is being interpreted as the “stacked searching blocks” because they have the similar function of generating feature maps with different solutions. The term “simply adding DS-Blocks to every feature pyramid scale” indicates parallel and independent blocks]
via a plurality of searching blocks that are interconnected with one another, (Mao, Page 6, ¶[1]: “Such a bidirectional information flow is able to identify cross-scale relations between local and global tokens, by which dual-scale features are highly aligned and coupled with each other. After this, we could safely up sample low-resolution representation hG, concatenate it with high-resolution hL and perform 1 by 1 convolution for channel-wise dual-scale information fusion.”) [Examiner’s note: the highlight discloses the concept of a fusion module that includes a plurality of components  which are interconnected through bidirectional information flow. It explains that local and global features are aligned and combined via concatenation and 1 x 1 convolution, indicating that the blocks are interconnected when performing channel-wise dual scale fusion]
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Zhang and Mao. Zhang teaches a novel Densely Connected NAS (DCNAS) framework, which directly searches the optimal network structures for the multi-scale representations of visual information, over a large-scale target dataset without proxy. Mao teaches a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. One of ordinary skill would have motivation to combine Zhang and Mao to improve the model’s ability to identify optimal sub-architectures for high-resolution tasks by combining local feature refinement with long-range contextual awareness.
	Regarding Claim 16, the combination of Zhang, Mao and Dosovitskiy explicitly discloses all the limitation of Claim 15 (as shown in the rejection above)
	Zhang in view of Mao and Dosovitskiy further discloses:
generating down-sampled image features of the second resolution using a searching block that receives image features from a searching block of the first plurality of stacked searching blocks. (Zhang, Page 8, Col. 1, ¶[3]: “our DCNAS inherently and repeatedly applying top-down and bottom-up multi-scale features fusion, hence results in leading performance.”, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
. “Figure 1. Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged.”) [Examiner’s note: Figure 1 show that at least one searching block (i.e., the parallel mixture layer) is configured to down-sample and up-sample feature maps, where the dashed green lines represent down-sampling connections and the dashed purple lines represent up-sampling connections.] 
Regarding Claim 17, the combination of Zhang, Mao and Dosovitskiy explicitly discloses all the limitation of Claim 16 (as shown in the rejection above)
	Zhang in view of Mao and Dosovitskiy further discloses:
further comprising generating up-sampled image features of the second resolution using a searching block that receives image features from a searching block of a third plurality of stacked searching blocks.  (Zhang, Page 8, Col. 1, ¶[3]: “our DCNAS inherently and repeatedly applying top-down and bottom-up multi-scale features fusion, hence results in leading performance.”, Page 2, Figure 1: 
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
. “Figure 1. Framework of DCNAS. Top: The densely connected search space (DCSS) with layer L and max down sampling rate 32. To stable the searching procedure, we keep the beginning STEM and final Up sample block unchanged.”) [Examiner’s note: Figure 1 show that at least one searching block (i.e., the parallel mixture layer) is configured to down-sample and up-sample feature maps, where the dashed green lines represent down-sampling connections and the dashed purple lines represent up-sampling connections.]
Regarding Claim 18, the combination of Zhang, Mao and Dosovitskiy explicitly discloses all the limitation of Claim 15 (as shown in the rejection above)
	Zhang in view of Mao and Dosovitskiy further discloses:
generating, by a fusion module, multiscale image features of a third resolution. (Zhang, Page 4, Col. 1, ¶[2]: “Fusion Module. To explore various paths in DCSS, we introduce the fusion module with the ability of aggregating semantic features from preceding fusion modules besides the feature pyramids and attaching transformed semantic features to succeeding ones… Given an array of feature-maps with different shapes (e.g., spatial resolutions and channel widths), the shape-alignment layer dispatches each feature-map to the corresponding branches to align them to the target shape. Semantic features are well-aligned and fully aggregated, then feed into the mixture layer to perform efficient multi-scale features fusion.”, Zhang, Page 2, Figure 1: “Bottom Left: Fusion module targets at aggregating feature-maps derived by previous layers, and solving the intensive GPU memory requirement problem by sampling a portion of connections from all possible ones.
    PNG
    media_image3.png
    426
    1029
    media_image3.png
    Greyscale
”) [Examiner’s note: The fusion module (bottom left in Fig. 1) aggregates feature maps from multiple scales (e.g., resolution ¼, 1/8, 1/16, 1/32) using element-wise addition 
    PNG
    media_image4.png
    44
    154
    media_image4.png
    Greyscale
, creating a unified multiscale representation. The third feature map of the third resolution (e.g., 1/16) is produced by selecting and operating on the fused multiscale features at the desired resolution.]
the fusion module comprising the plurality of searching blocks. (Mao, Page 6, ¶[1]: “Such a bidirectional information flow is able to identify cross-scale relations between local and global tokens, by which dual-scale features are highly aligned and coupled with each other. After this, we could safely up sample low-resolution representation hG, concatenate it with high-resolution hL and perform 1 by 1 convolution for channel-wise dual-scale information fusion.”) [Examiner’s note: the highlight discloses the concept of a fusion module that includes a plurality of components  which are interconnected through bidirectional information flow. It explains that local and global features are aligned and combined via concatenation and 1 x 1 convolution, indicating that the blocks are interconnected when performing channel-wise dual scale fusion]
Regarding Claim 19, the combination of Zhang, Mao and Dosovitskiy explicitly discloses all the limitation of Claim 15 (as shown in the rejection above)
	Zhang in view of Mao and Dosovitskiy further discloses:
wherein at least one searching block of the first parallel module includes a plurality of depth-wise convolution layers, each convolution layer of the plurality of depth-wise convolution layers generating an output using a different kernel size. (Zhang, Page 3, Col. 2, Section 3.1, ¶[2]: “The mixture layer is the elementary structure in search space represents a collection of available operations. Similar to [2], we construct the operator space O with various configurations of mobileNet-v3 [23], i.e., kernel sizes k ϵ {3; 5; 7}, expansion ratios r ϵ {3; 6}.”, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
) [Examiner’s note: Figure 1 discloses the Mixture Layer (bottom right) dividing the input feature maps into multiple parallel paths where each path uses a different kernel size, such as 3 x 3, 5 x 5, and 7 x 7, to perform depth-wise separable convolutions.] 
Regarding Claim 20, the combination of Zhang, Mao and Dosovitskiy explicitly discloses all the limitation of Claim 15 (as shown in the rejection above)
	Zhang in view of Mao and Dosovitskiy further discloses:
wherein the first resolution is greater than the second resolution.  (Zhang, Page 2, Figure 1: 
    PNG
    media_image2.png
    438
    1060
    media_image2.png
    Greyscale
, Page 6, Col. 1, Section 4.1: “In our experiments, for the shape of feature maps, we set the spatial resolution space S to be {1/4; 1/8; 1/16; 1/32}, and the corresponding widths are set to F; 2F; 4F; 8F, where we set F to be 64 for our best model.”) [Examiner’s note: “first resolution” is being interpreted as ¼, “second resolution” is being interpreted as 1/8. ]
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY TRAN whose telephone number is (571)270-0693. The examiner can normally be reached Monday - Friday 7:30 am - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/AMY TRAN/Examiner, Art Unit 2126                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126
Read full office action
Prosecution Timeline

Jun 08, 2021
Application Filed
Nov 27, 2024
Non-Final Rejection — §103, §112
Feb 25, 2025
Response Filed
Jun 04, 2025
Final Rejection — §103, §112
Aug 05, 2025
Response after Non-Final Action
Aug 29, 2025
Request for Continued Examination
Sep 08, 2025
Response after Non-Final Action
Feb 09, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/226,399
Patent 12602582
DYNAMIC DISTRIBUTED TRAINING OF MACHINE LEARNING MODELS
2y 5m to grant Granted Apr 14, 2026
17/137,588
Patent 12468932
IDENTIFYING RELATED MESSAGES IN A NATURAL LANGUAGE INTERACTION
2y 5m to grant Granted Nov 11, 2025
16/996,310
Patent 12462185
SCENE GRAMMAR BASED REINFORCEMENT LEARNING IN AGENT TRAINING
2y 5m to grant Granted Nov 04, 2025
17/111,611
Patent 12423589
TRAINING DECISION TREE-BASED PREDICTIVE MODELS
2y 5m to grant Granted Sep 23, 2025
16/261,092
Patent 12288074
GENERATING AND PROVIDING PROPOSED DIGITAL ACTIONS IN HIGH-DIMENSIONAL ACTION SPACES USING REINFORCEMENT LEARNING MODELS
2y 5m to grant Granted Apr 29, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
36%
Grant Probability
84%
With Interview (+47.9%)
5y 2m
Median Time to Grant
High
PTA Risk
Based on 28 resolved cases by this examiner. Grant probability derived from career allow rate.