Last updated: April 19, 2026
Application No. 18/881,216
METHOD, APPARATUS AND SYSTEM FOR ENCODING AND DECODING A TENSOR

Non-Final OA §103§112
Filed
Jan 03, 2025
Examiner
RETALLICK, KAITLIN A
Art Unit
2482
Tech Center
2400 — Computer Networks
Assignee
Canon Kabushiki Kaisha
OA Round
1 (Non-Final)
Interview Optional

— +10.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 515 resolved cases, 2023–2026
Examiner Intelligence

RETALLICK, KAITLIN A View full profile →
Grants 75% — above average
Career Allow Rate
388 granted / 515 resolved
+17.3% vs TC avg
Moderate +11% lift
Without
With
+10.7%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
27 currently pending
Career history
542
Total Applications
across all art units
Statute-Specific Performance

§101
5.8%
-34.2% vs TC avg
§103
58.4%
+18.4% vs TC avg
§102
7.0%
-33.0% vs TC avg
§112
8.6%
-31.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 515 resolved cases
Office Action

§103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of the Application
	Claims 1-14 are currently pending in this application.
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/03/2025 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
The information disclosure statement (IDS) submitted on 04/08/2025 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1 and 8-14 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	In regards to claims 1 and 8-14, the claim limitations “a first unit of information” and “a second unit of information” lack clarity and one of ordinary skill in the art would not understand what is being claimed. The specification states, “Each of the compressed tensors 557 and 537 provides a unit of information of feature maps of the frame data 113 as obtained using convolutional operations of either (i) the MSFF 510 and the SSFC encoder 550 on the tensors 505 and 504, or (ii) the MSFF 510 and the SSFC encoder 530 on the tensors 503 and 502.” [0103]. Further, the specification states, “Steps 1110, 1120 and 1130 operate to decode a frame of the bitstream to obtain two units of information for the frame. The tensor 1011 provides a first unit of information and the tensor 1021 the second unit of information. Each unit of information corresponds to feature maps of the frame encoded in the bitstream.” [0160]. Thus, one of ordinary skill in the art would not understand what is being claimed by the “unit of information” is in the claim language in view of the specification.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-4 and 6-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over CRICRI FRANCESCO et al. (Hereafter, “Cricri”) [EP 3934254 A1] in view of IKONIN SERGEY YURIEVICH et al. (Hereafter, “Ikonin”) [WO 2022/139617 A1].
In regards to claim 1, Cricri discloses a method of decoding at least a plurality of tensors forming a hierarchical representation of feature maps for a single frame from a bitstream ([0010] According to an aspect, a method comprises: obtaining an encoded set of features of input data; decoding the encoded set of features to obtain a plurality of reconstructed feature maps; and causing execution of at least one task neural network based on the plurality of reconstructed feature maps. [0048] A convolutional layer performs convolutional operations to extract information from input data, for example image 502, to form a plurality of feature maps 506. A feature map may be generated by applying a convolutional filter or a convolutional kernel to a subset of input data, for example block 504 in image 502, and sliding the filter through the input data to obtain a value for each element of the feature map. The filter may comprise a matrix or a tensor, which may be for example multiplied with the input data to extract features corresponding to that filter. A plurality of feature maps may be generated based on applying a plurality of filters.), the method comprising: decoding a first unit of information from the bitstream ([0119] video decoder may receive or decode an indication of at least one first coding parameter); decoding a second unit of information from the bitstream ([0119] an indication of at least one second coding parameter); determining a first plurality of tensors from the first unit of information, feature maps of at least one tensor of the first plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the first plurality of tensors ([0119] According to an example embodiment, the video decoder 1021 may receive or decode an indication of at least one first coding parameter associated with a first subset of the plurality of feature maps. This enables different subsets of feature maps to be decoded at different qualities, for example at different resolutions. [0120] The first subset of feature maps or tiles may be decoded based on the at least one first coding parameter. [0122] When the feature adapter 1012 of the VCM encoder 1010 includes a projection neural network, the feature adapter 1022 of the VCM decoder 1020 may comprise a corresponding re-projection network to compensate for the adaptation of the extracted features. Therefore, the re-projection neural network may be configured to re-project the plurality of reconstructed feature maps to obtain at least one re-projected feature tensor. For example, the re-projection neural network may re-project a projected feature tensor of shape (3, H_feat_proj, W_feat_proj), which was suitable for decoding by video decoder 1021, into a re-projected feature tensor having the original shape (num_channels, H_feat, W_feat). The task-NN(s) may be then executed based on the re-projected feature tensor. This enables the task-NN(s) to be executed without any modification due to the projection at VCM encoder 1010. [0141] According to an example embodiment, the adaptation of the plurality of feature maps may comprise projecting, for example with a projection neural network, the plurality of feature maps to obtain a plurality of projected feature tensors. The set of adapted features may comprise the plurality of projected feature tensors.); and determining a second plurality of tensors from the second unit of information, feature maps of at least one tensor of the second plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the second plurality of tensors ([0119] The feature adapter 1022 may further receive an indication of at least one second coding parameter associated with at least one second subset of the feature maps, or tiles. The at least one first coding parameter may be associated with a first quality level and the at least one second coding parameter may be associated with at least one second quality level. The first quality level may be higher than the at least one second quality level. [0120] The at least one second subset of feature maps or tiles may be decoded based on the at least one second coding parameter. [0122] When the feature adapter 1012 of the VCM encoder 1010 includes a projection neural network, the feature adapter 1022 of the VCM decoder 1020 may comprise a corresponding re-projection network to compensate for the adaptation of the extracted features. Therefore, the re-projection neural network may be configured to re-project the plurality of reconstructed feature maps to obtain at least one re-projected feature tensor. For example, the re-projection neural network may re-project a projected feature tensor of shape (3, H_feat_proj, W_feat_proj), which was suitable for decoding by video decoder 1021, into a re-projected feature tensor having the original shape (num_channels, H_feat, W_feat). The task-NN(s) may be then executed based on the re-projected feature tensor. This enables the task-NN(s) to be executed without any modification due to the projection at VCM encoder 1010. [0141] According to an example embodiment, the adaptation of the plurality of feature maps may comprise projecting, for example with a projection neural network, the plurality of feature maps to obtain a plurality of projected feature tensors. The set of adapted features may comprise the plurality of projected feature tensors.), wherein feature maps of each tensor of the first plurality of tensors have different spatial resolution from feature maps of each tensor of the second plurality of tensors ([0119] Different subsets of feature maps to be decoded at different qualities, for example at different resolutions. The at least one first coding parameter may be associated with a first quality level and the at least one second coding parameter may be associated with at least one second quality level. The first quality level may be higher than the at least one second quality level.), and the tensors of the first plurality of tensors and the second plurality of tensors correspond to the hierarchical representation of feature maps for the single frame ([0086] Video codecs may be targeted for encoding and decoding predetermined formats of video data. The input video may be for example provided in the form of YUV frames, for example with 4:2:0 or 4:4:4 channel-sampling format. Thus, each video frame may include three channels. However, features extracted by an FX-NN may comprise a tensor, i.e. a multidimensional array, with more than three channels.). 
Ikonin discloses a method of decoding at least a plurality of tensors forming a hierarchical representation of feature maps for a single frame from a bitstream, the method comprising: decoding a first unit of information from the bitstream ([Page 75] In Fig. 28, the signal feeding logic 2800 of the decoder uses the segmentation information (LayerFlag) to obtain and utilize selected information (LayerMv) transmitted in the bitstream. In particular, at each layer, bitstream is parsed to obtain the segmentation information (LayerFlag) and possibly also the selected information (LayerMv) in the respective syntax interpretation units 2823, 2822, and 2821 (in this order).); decoding a second unit of information from the bitstream ([Page 75] In Fig. 28, the signal feeding logic 2800 of the decoder uses the segmentation information (LayerFlag) to obtain and utilize selected information (LayerMv) transmitted in the bitstream. In particular, at each layer, bitstream is parsed to obtain the segmentation information (LayerFlag) and possibly also the selected information (LayerMv) in the respective syntax interpretation units 2823, 2822, and 2821 (in this order).); determining a first plurality of tensors from the first unit of information ([Page 13] When programming a CNN for processing images, as shown in Fig. 1 , the input is a tensor with shape (number of images) x (image width) x (image height) x (image depth). [Page 75] During interpreting the segmentation information (LayerFlag) at each resolution layer, the tensor TakeFromCurrent is obtained (generated). This tensor TakeFromCurrent contains flags indicating whether or not feature map information (LayerMv) is present in the bitstream for each particular position of the current resolution layer. The decoder reads the values of the feature map LayerMv from the bitstream, and places them at the positions where the flags of the TakeFromCurrent tensor are equal to 1.), feature maps of at least one tensor of the first plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the first plurality of tensors ([Page 13] Then after passing through a convolutional layer, the image becomes abstracted to a feature map, with shape (number of images) x (feature map width) x (feature map height) x (feature map channels). [Page 73] In Fig. 27, there is a starting layer (layer 0) for decoding the segmentation information, e.g. layer of the lowest resolution, i.e. latent representation layer.); and determining a second plurality of tensors from the second unit of information, feature maps of at least one tensor of the second plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the second plurality of tensors ([Page 75-76] Fig. 29 shows another possible and exemplary implementation of the signal feeding logic 2900. This implementation generates a Layerldx tensor (called LayerldxUp in Fig. 29), containing indices of layers of different resolutions indicating which layer should be used to take motion information transferred (included at the encoder, parsed at the decoder) in the bitstream. At each syntax interpretation block (2923, 2922, 2923), Layerldx tensor is updated by adding TakeFromCurrent tensor multiplied by the upsampling layer index numbered from the highest resolution to the lowest resolution. Then the Layerldx tensor is upsampled and transferred (passed) to the next layer in the processing order, e.g. from 2923 to2922, and from 2922 to 2921. In order to make the processing in all layers similar, tensor Layerldx is zero initialized in 2920 and passed to the syntax interpretation 2923 of the first layer.), wherein feature maps of each tensor of the first plurality of tensors have different spatial resolution from feature maps of each tensor of the second plurality of tensors ([Page 75] During interpreting the segmentation information (LayerFlag) at each resolution layer, the tensor TakeFromCurrent is obtained (generated).), and the tensors of the first plurality of tensors and the second plurality of tensors correspond to the hierarchical representation of feature maps for the single frame ([Page 13] When programming a CNN for processing images, as shown in Fig. 1 , the input is a tensor with shape (number of images) x (image width) x (image height) x (image depth). Then after passing through a convolutional layer, the image becomes abstracted to a feature map, with shape (number of images) x (feature map width) x (feature map height) x (feature map channels). A convolutional layer within a neural network should have the following attributes. Convolutional kernels defined by a width and height (hyper-parameters). The number of input channels and output channels (hyper-parameter). The depth of the convolution filter (the input channels) should be equal to the number channels (depth) of the input feature map.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cricri with the teachings of Ikonin in order to improve the encoding and decoding using trained network architectures [See Ikonin].

In regards to claim 2, the limitations of claim 1 have been addressed. Cricri fails to explicitly disclose wherein respective tensors of the first and second pluralities of tensors have resolutions forming an exponential sequence with a doubling in width and height between successive tensors. 
Ikonin discloses wherein respective tensors of the first and second pluralities of tensors have resolutions forming an exponential sequence with a doubling in width and height between successive tensors ([Pages 37-38] Figure 11 illustrates an exemplary implementation, in which the feature map 1110 is a dense optical flow of motion vectors with a width W and a height H. In this example, the output (L1-L3) of each layer is a feature map with a gradually lower resolution. The input to L1 is the dense optical flow 1110. In this example, one element of a feature map output from L1 is determined from sixteen (4x4) elements of the dense optical flow 1110. Each square in the L1 output (bottom right of Fig. 11) corresponds to a motion vector obtained by downsampling (downspl4) from the sixteen motion vectors of the dense optical flow. Such downsampling may be for instance an average pooling or another operation, as discussed above. In this exemplary implementation, only a part of the feature map L1 of that layer is included in the information 1120. Layer L1 is selected and the part, corresponding to four motion vectors (feature map elements) related to the selected layer, is signaled within the selected information 1120. Then the output L1 of the first layer is input to the second layer (downspl2). An output L2 feature map element of the second layer is determined from four elements of L1 . However, in other examples, each element of feature map with a lower resolution may also be determined by a group consisting of any other number of elements of the feature map with the next higher resolution. For instance the number of elements in a group that determine one element in the next layer may also be any power of 2.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cricri with the teachings of Ikonin in order to improve the encoding and decoding using trained network architectures [See Ikonin].

In regards to claim 3, the limitations of claim 1 have been addressed. Cricri discloses wherein the first and second pluralities of tensors have a different number of channels ([0086] Video codecs may be targeted for encoding and decoding predetermined formats of video data. The input video may be for example provided in the form of YUV frames, for example with 4:2:0 or 4:4:4 channel-sampling format. Thus, each video frame may include three channels. However, features extracted by an FX-NN may comprise a tensor, i.e. a multidimensional array, with more than three channels. The number of channels may be in the order of hundreds, for example 128. Also, due to downsampling operations possibly performed in the FX-NN, the spatial size (number of rows and columns) of feature maps may be small compared to the original image or video data.).

In regards to claim 4, the limitations of claim 1 have been addressed. Cricri fails to explicitly disclose wherein the plurality of tensors of the first plurality of tensors and the second plurality of tensors with higher spatial resolutions has a smaller number of channels than the other plurality of tensors.
Ikonin discloses wherein the plurality of tensors of the first plurality of tensors and the second plurality of tensors with higher spatial resolutions has a smaller number of channels than the other plurality of tensors ([Page 33] In other words, the resolutions of two or more of the cascaded layers may mutually differ. Here, when referring to a resolution of a layer, what is meant is a resolution of the feature map processed by the layer. In an exemplary implementation it is the resolution of the feature map output by the layer. A feature map comprising a resolution means that at least a part of the feature map has said resolution. In some implementation, the entire feature map may have the same resolution. Resolution of a feature map may be given, for example, by a number of feature map elements in the feature map. However, it may also be more specifically defined by number of feature map elements in one or more dimensions (such as x, y; alternatively or in addition, number of channels may be considered). [Page 34] Lower resolution of a feature map may mean e.g. less feature elements per feature map. Higher resolution of a feature map may mean e.g. more feature elements per feature map.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cricri with the teachings of Ikonin in order to improve the encoding and decoding using trained network architectures [See Ikonin].

In regards to claim 6, the limitations of claim 1 have been addressed. Cricri discloses wherein determination of the first plurality of tensors and determination of the second plurality of tensors are independent from each other ([0099] For example, VCM encoder 1010 may determine a first subset of feature maps based on importance scores associated with the feature maps. The VCM encoder 1010 may further determine a second subset of the feature maps based on the importance scores. The importance scores of the first subset may be higher than the importance scores of the second subset. Even though in the present example there is one second subset, it is appreciated that the VCM encoder 1010 may alternatively determine more than one second subsets, where each second subset may be for example associated with a dedicated range of importance scores. The VCM encoder 1010 may determine an importance level of a feature map for example based on an L1 norm of a feature tensor. A feature map may be determined to belong to the first subset of feature maps if the L1 norm of the feature map exceeds a threshold.).
In regards to claim 7, the limitations of claim 1 have been addressed. Cricri discloses wherein the first plurality of tensors and the second plurality of tensors are determined using neural network layers ([0048] A convolutional layer performs convolutional operations to extract information from input data, for example image 502, to form a plurality of feature maps 506. A feature map may be generated by applying a convolutional filter or a convolutional kernel to a subset of input data, for example block 504 in image 502, and sliding the filter through the input data to obtain a value for each element of the feature map. The filter may comprise a matrix or a tensor, which may be for example multiplied with the input data to extract features corresponding to that filter. [0105] According to an example embodiment, the feature adapter 1012 may comprise a projection neural network. The projection neural network may be pre-trained and configured to project the extracted features to obtain a plurality of projected feature tensors, which are more suitable for encoding by video encoder 1013.).

In regards to claim 8, the limitations of claim 1 have been addressed. Cricri discloses wherein the first unit of information is used to determine the smallest tensor in the first plurality of tensors ([0097] Therefore, according to an example embodiment, the VCM encoder 1010 may encode and/or transmit signaling data. The signaling data may be provided by the feature adapter 1012, which may generate the signaling data during packing of the feature maps. The signaling data may for example comprise an indication of the size of the at least one target tile. The signaling data may further comprise an indication of a packing order of the feature maps in the image or the at least one video frame. The packing order may be for example provided as list of feature map indexes in the target order, for example by assuming a certain reading order (e.g., a raster scan). For example, packing order [3,1,2,0] may indicate that the first read feature map in the reconstructed frame has index 3 in the target order, the second has index 1, the third has index 2, and the fourth has index 0.).

In regards to claim 9, the limitations of claim 1 have been addressed. Cricri discloses wherein the second unit of information is used to determine the smallest tensor in the second plurality of tensors ([0097] Therefore, according to an example embodiment, the VCM encoder 1010 may encode and/or transmit signaling data. The signaling data may be provided by the feature adapter 1012, which may generate the signaling data during packing of the feature maps. The signaling data may for example comprise an indication of the size of the at least one target tile. The signaling data may further comprise an indication of a packing order of the feature maps in the image or the at least one video frame. The packing order may be for example provided as list of feature map indexes in the target order, for example by assuming a certain reading order (e.g., a raster scan). For example, packing order [3,1,2,0] may indicate that the first read feature map in the reconstructed frame has index 3 in the target order, the second has index 1, the third has index 2, and the fourth has index 0.).

In regards to claim 10, Cricri discloses a method of encoding at least a plurality of tensors to a bitstream, the plurality of tensors forming a hierarchical representation of feature maps for a single frame ([0006] According to an aspect, a method comprises obtaining a plurality of feature maps extracted from input data; performing adaptation of the plurality of feature maps to obtain a set of adapted features; and encoding the set of adapted features with an encoder.), the method comprising: using a convolutional operation to determine a first unit of information from a first plurality of tensors, feature maps of at least one tensor of the first plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the first plurality of tensors ([0048] A convolutional layer performs convolutional operations to extract information from input data, for example image 502, to form a plurality of feature maps 506. A feature map may be generated by applying a convolutional filter or a convolutional kernel to a subset of input data, for example block 504 in image 502, and sliding the filter through the input data to obtain a value for each element of the feature map. The filter may comprise a matrix or a tensor, which may be for example multiplied with the input data to extract features corresponding to that filter. A plurality of feature maps may be generated based on applying a plurality of filters. [0100] The VCM encoder 1010 may encode the first subset of feature maps with at least one first coding parameter, which may be associated with a first quality level, and encode the at least one second subset of feature maps with at least one second coding parameter, which may be associated with at least one second quality level. The video encoder 1013 may for example encode tiles associated with the first subset of feature maps with the first coding parameter and/or encode tiles associated with the second subset of feature maps with the second coding parameter. The first quality level may be higher than the at least one second quality level.); using a convolutional operation to determine a second unit of information from a second plurality of tensors, feature maps of at least one tensor of the second plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the second plurality of tensors, and wherein feature maps of each tensor of the first plurality of tensors have different spatial resolution from feature maps of each tensor of the second plurality of tensors, and the tensors of the first plurality of tensors and the second plurality of tensors correspond to the hierarchical representation of feature maps for the single frame ([0048] A convolutional layer performs convolutional operations to extract information from input data, for example image 502, to form a plurality of feature maps 506. A feature map may be generated by applying a convolutional filter or a convolutional kernel to a subset of input data, for example block 504 in image 502, and sliding the filter through the input data to obtain a value for each element of the feature map. The filter may comprise a matrix or a tensor, which may be for example multiplied with the input data to extract features corresponding to that filter. A plurality of feature maps may be generated based on applying a plurality of filters. [0100] The VCM encoder 1010 may encode the first subset of feature maps with at least one first coding parameter, which may be associated with a first quality level, and encode the at least one second subset of feature maps with at least one second coding parameter, which may be associated with at least one second quality level. The video encoder 1013 may for example encode tiles associated with the first subset of feature maps with the first coding parameter and/or encode tiles associated with the second subset of feature maps with the second coding parameter. The first quality level may be higher than the at least one second quality level.); encoding the first unit of information to the bitstream; and encoding the second unit of information to the bitstream ([0100] The VCM encoder 1010 may encode the first subset of feature maps with at least one first coding parameter, which may be associated with a first quality level, and encode the at least one second subset of feature maps with at least one second coding parameter, which may be associated with at least one second quality level.). 
Ikonin discloses a method of encoding at least a plurality of tensors to a bitstream ([Page 1] Embodiments of the present disclosure generally relate to the field of encoding data for image or video processing into a bitstream using a plurality of processing layers. In particular some embodiments relate to methods and apparatuses for such encoding.), the plurality of tensors forming a hierarchical representation of feature maps for a single frame ([Pages 1-2] A neural network usually comprises two or more layers. A feature map is an output of a layer. In a neural network that is split between devices, e.g. between encoder and decoder, a device and a cloud or between different devices, a feature map at the output of the place of splitting (e.g. a first device) is compressed and transmitted to the remaining layers of the neural network (e.g. to a second device). [Page 13] When programming a CNN for processing images, as shown in Fig. 1 , the input is a tensor with shape (number of images) x (image width) x (image height) x (image depth). Then after passing through a convolutional layer, the image becomes abstracted to a feature map, with shape (number of images) x (feature map width) x (feature map height) x (feature map channels). A convolutional layer within a neural network should have the following attributes. Convolutional kernels defined by a width and height (hyper-parameters). The number of input channels and output channels (hyper-parameter). The depth of the convolution filter (the input channels) should be equal to the number channels (depth) of the input feature map.), the method comprising: using a convolutional operation to determine a first unit of information from a first plurality of tensors ([Page 13] Then after passing through a convolutional layer, the image becomes abstracted to a feature map, with shape (number of images) x (feature map width) x (feature map height) x (feature map channels). [Pages 36-37] In another embodiment, convolutional operations are used for the downsampling in some or all of the layers. In convolutions, a filter kernel is applied to a group or block of elements in the input feature map. The kernel may itself be an array of elements with the same size as the block of input elements wherein each element of the kernel stores a weight for the filter operation.), feature maps of at least one tensor of the first plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the first plurality of tensors ([Page 33] processing of the data comprises, in a plurality of cascaded layers, generating feature maps, each feature map comprising a respective resolution, wherein the resolutions of at least two of the generated feature maps differ from each other In particular, the LayerMv tensor 611 is the subsampled motion vector field (feature map) which enters the cost calculation unit 613. The LayerMv tensor 611 also enters a layer information selection unit 614 of the first layer. The layer information selection unit 614 provides to the bitstream selected motion vectors in case there are selected motion vectors on this (first) layer. Its function will be further described below.); using a convolutional operation to determine a second unit of information from a second plurality of tensors, feature maps of at least one tensor of the second plurality of tensors having a different spatial resolution from feature maps of other tensor(s) of the second plurality of tensors ([Page 6] In any of the above examples, for instance, the processing comprises additional convolutional layers between the cascaded layers with different resolutions. [Page 33] processing of the data comprises, in a plurality of cascaded layers, generating feature maps, each feature map comprising a respective resolution, wherein the resolutions of at least two of the generated feature maps differ from each other [Pages 36-37] In another embodiment, convolutional operations are used for the downsampling in some or all of the layers. In convolutions, a filter kernel is applied to a group or block of elements in the input feature map. The kernel may itself be an array of elements with the same size as the block of input elements wherein each element of the kernel stores a weight for the filter operation.), and wherein feature maps of each tensor of the first plurality of tensors have different spatial resolution from feature maps of each tensor of the second plurality of tensors ([Page 33] In other words, the resolutions of two or more of the cascaded layers may mutually differ. Here, when referring to a resolution of a layer, what is meant is a resolution of the feature map processed by the layer. In an exemplary implementation it is the resolution of the feature map output by the layer. A feature map comprising a resolution means that at least a part of the feature map has said resolution. [Page 47-48] In particular, the LayerMv tensor 611 is the subsampled motion vector field (feature map) which enters the cost calculation unit 613. The LayerMv tensor 611 also enters a layer information selection unit 614 of the first layer. The layer information selection unit 614 provides to the bitstream selected motion vectors in case there are selected motion vectors on this (first) layer. Its function will be further described below.), and the tensors of the first plurality of tensors and the second plurality of tensors correspond to the hierarchical representation of feature maps for the single frame; encoding the first unit of information to the bitstream; and encoding the second unit of information to the bitstream ([Page 34] The method further comprises a step of selecting, among the plurality of layers, a layer different from the layer generating the feature map of the lowest resolution and generating the bitstream including inserting into the bitstream information related to the selected layer. In other words, in addition (or alternatively) to outputting into the bitstream the result of processing by all layers in the cascade, information to another (selected) layer is provided. There may be one or more selected layers. The information related to the selected layer may be any kind of information such as the output of the layer or some segmentation information of the layer (as will be discussed later) or other information also related to the feature map processed by the layer and/or to the processing performed by the layer. In other words, in some examples, the information can be elements of feature map and/or positions of the elements within the feature map (within the layer).).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Cricri with the teachings of Ikonin in order to improve the encoding and decoding using trained network architectures [See Ikonin].

Claim 11 lists all the same elements of claim 1, but in decoder form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to claim 11. 

Claim 12 lists all the same elements of claim 10, but in encoder form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to claim 10. 

Claim 13 lists all the same elements of claim 1, but in non-transitory computer-readable storage medium form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to claim 13. 

Claim 14 lists all the same elements of claim 1, but in system form rather than method form. Therefore, the supporting rationale of the rejection to claim 1 applies equally as well to claim 14. 

Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cricri  in view of Ikonin in further view of Song et al. (Hereafter, “Song”) [US 12,175,764 B1].
In regards to claim 5, the limitations of claim 1 have been addressed. Cricri discloses wherein ([0103] In case of downsampling a subset of feature maps, e.g. based on their importance scores, the VCM decoder 1020 may resample the feature maps into their original/target resolution, for example using the signaling data, which may comprise an indication of the resolution.).
Song discloses wherein largest tensors of each of the first and second plurality of tensors are determined based on an upsampling operation applied to feature maps of the corresponding one of the first and second units of information ([Col. 18] upsampling occurs on a large tensor).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Song with the teachings of Ikonin in order to improve processing operations [See Song].
Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Kaitlin A Retallick whose telephone number is (571)270-3841. The examiner can normally be reached Monday-Friday 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chris Kelley can be reached at (571) 272-7331. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KAITLIN A RETALLICK/Primary Examiner, Art Unit 2482
Read full office action
Prosecution Timeline

Jan 03, 2025
Application Filed
Jan 08, 2026
Non-Final Rejection — §103, §112 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/010,177
Patent 12602757
SYSTEM AND COMPUTER-IMPLEMENTED METHOD FOR IMAGE DATA QUALITY ASSURANCE IN AN INSTALLATION ARRANGED TO PERFORM ANIMAL-RELATED ACTIONS, COMPUTER PROGRAM AND NON-VOLATILE DATA CARRIER
2y 5m to grant Granted Apr 14, 2026
18/533,390
Patent 12604045
Encoding Control Method and Apparatus, and Decoding Control Method and Apparatus
2y 5m to grant Granted Apr 14, 2026
18/368,061
Patent 12593058
BITSTREAM MERGING
2y 5m to grant Granted Mar 31, 2026
18/281,839
Patent 12587669
MOTION FLOW CODING FOR DEEP LEARNING BASED YUV VIDEO COMPRESSION
2y 5m to grant Granted Mar 24, 2026
18/572,190
Patent 12587678
INFORMATION PROCESSING APPARATUS AND METHOD THEREOF
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
75%
Grant Probability
86%
With Interview (+10.7%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 515 resolved cases by this examiner. Grant probability derived from career allow rate.