DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Foreign priority is not claimed.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/24/2025 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 9 and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 8 recite limitations – “generating a main feature by a second set of neural network layers with the current feature and the reference feature as inputs”, and Claim 9 recites limitations – “wherein generating the main feature comprises: down sampling the current feature”, appears that claim 9 recites the main feature referring to previously recited main feature in claim 8. However, claim 9 is not dependent on claim 8 therefore, there is lack of antecedent basis for the main feature in claim 9.
Claim 16 recite limitations – “entropy decoding a main feature bitstream”, Claim 1 recite limitations – “packing the multi-resolution motion feature into a bitstream”, appears to be decoding a main feature bitstream in claim 16 while in claim 1 a multi-resolution motion feature is packed into a bitstream. Therefore, bitstream in claim 1 appears to be corresponding to multiresolution motion feature and bitstream in claim 16 appears to be corresponding to main feature.
Therefore, Examiner suggests modifying “entropy decoding a main feature bitstream” with “entropy decoding a main feature bitstream different than the multi-resolution feature bitstream” in order to explicitly define and distinguish the bitstreams in claim 1 and claim 16 to render the claims definite.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8, 10-15 and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Xia et al. ("Learning dynamic point cloud compression via hierarchical inter-frame block matching." Proceedings of the 31st ACM International Conference on Multimedia. 2023, as provided) in view of Sugio et al. (US Pub No. 20150229930 A1).
Regarding Claim 1,
Xia discloses A method comprising: obtaining a first motion feature generated by a first set of neural network layers with a current feature and a reference feature as inputs; (Xia, pg. 7997, KNN-attention Block-matching, Fig. 3, Frameworks of (a) Motion Estimation, (b) Motion Compression, (c) Motion Compensation discloses the previous work D-DPCC about DPC compression designs a simple 2-layer SparseCNN for motion estimation, which fails to calculate the explicit correspondence of blocks between two frames. Therefore, we design a KNN-attention block-matching (KABM) network for more accu rate motion estimation. As Figure 4 depicts, the KABM consists of one ball-KNN search and two inter-frame attention networks. At first, the network concatenates the reference frame and the current frame, so that blocks in both frames can aggregate information simultaneously. The concatenation operation of sparse tensors is defined as follows: 𝑦𝑐𝑎𝑡,𝑢 = 𝑦1,𝑢 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1) ∩𝐶(𝑦2) 𝑦1,𝑢 ⊕ 00 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1),u ∉𝐶(𝑦2) u ∉𝐶(𝑦1),u ∈ 𝐶(𝑦2), (3) where 𝑦1 and 𝑦2 are latent features of the reference and current frames,𝐶(𝑦1) and𝐶(𝑦2) are corresponding integer coordinates, and ⊕ means the concatenation of feature channels. Then, the network uses ball-KNN to find the neighbors of𝑦𝑐𝑎𝑡 limited in a spherical region (the maximum number of neighbors is K) and generates the neighbors’ attributes, including relative coordinates, distances, and features. After that, the network uses two inter-frame attention networks to aggregate features between two frames and output an original flow embedding 𝑒𝑜,𝑡 that encapsulates the matching information. Each inter-frame attention network generates an attention weight matrix for the neighbors based on their attribute vectors and uses softmax to normalize the weight matrix; current frame and reference frame motion features are aggregated to generate concatenated motion features using neural network model 1)
obtaining a second motion feature generated by a second set of neural network layers with a downsampled current feature and a downsampled reference feature as inputs; (Xia, pg. 7997, discloses Figure 1 illustrates the structure of the hierarchical ME/MC mod ule (Hie-ME/MC). The framework is divided into two parts. First, the low-resolution inter-prediction takes 3× down-sampled latent features of the previous and current frame, i.e., ˆ𝑦3↓ 𝑡 −1 and𝑦3↓ 𝑡 as in put to generate the low-resolution optical flow 𝑒𝑡,𝑙, which describes coarse and large movements between blocks. In other words, the lower-scale motion estimation captures the "base" movements of a large block of points, corresponding to larger macroblock or coding unit partitions in AVC, HEVC, and VVC. The reconstructed low resolution optical flow ˆ𝑒𝑡,𝑙 is up-sampled to match the scale of ˆ𝑦2↓ 𝑡 −1 and𝑦2↓ 𝑡 for motion compensation. Finally, the network performs a rough compensation from ˆ𝑦2↓ 𝑡 −1 to𝑦2↓ 𝑡 with the low-resolution flow embedding ˆ𝑒𝑡,𝑙 and produces an initial prediction ¯𝑦𝑡,𝑖𝑛𝑖 for 𝑦2↓ 𝑡 . The high-resolution inter-prediction inherits the architecture of low-resolution. However, ¯𝑦𝑡,𝑖𝑛𝑖 instead of ˆ𝑦2↓ 𝑡 −1 serves as the reference frame to estimate the high-resolution flow embedding 𝑒𝑡,ℎ. Since the low-resolution ME/MC has already captured coarse movements at the low resolution (3× downsampled), this stage further extracts finer-grained motion at a higher resolution (2× downsampled) and outputs the final predicted latent representation ¯ 𝑦𝑡,𝑓 𝑖𝑛𝑎𝑙 for 𝑦2↓ 𝑡 ; motion features are downsampled and combined between previous (reference frame) and current frame to capture base movements of a large block of points as large movements between blocks using neural layer model 2)
generating a third motion feature by upsampling the second motion feature; (Xia, Fig. 1, Overview of proposed framework,Fig. 2, Frameworks of (a) Feature Extraction, (b) Point Cloud Reconstruction, (c) Residual Compression, discloses 3.4 Residual Compression As Figure 2(c) shows, the residual Compression module encodes the residual 𝑟𝑡 between 𝑦2↓ 𝑡 and the final prediction ¯𝑦𝑡,𝑓𝑖𝑛𝑎𝑙 with an Auto-Encoder (AE) style network. On the encoder side, a down sample block and a convolution layer serve as the parametric analysis transform that transforms 𝑟𝑡 into a more compact latent representation𝑙𝑟𝑡 = 𝐶(𝑦3↓ 𝑡 ),𝐹(𝑙𝑟𝑡 ) . The coordinate𝐶(𝑦3↓ 𝑡 ) isloss lessly compressed by G-PCC octree v14, and the features 𝐹(𝑙𝑟𝑡 ) is quantized and coded using a non-parametric, fully factorized den sity model [3]. On the decoder side, an upsample block serves as the parametric synthesis transform to recover the reconstructed resid ual ˆ𝑟𝑡. The compression of fused flow embedding 𝑒𝑡 follows a simi lar process, as Figure 3(b) shows, where a convolution/transpose convolution layer serves as the encoder/decoder. 3.5 Point Cloud Reconstruction The point cloud reconstruction module contains upsample blocks [9, 32] symmetric to the feature extraction, which recovers the current frame hierarchically ˆ𝑥𝑡 from the reconstructed latent feature 𝑦′ 𝑡 , as depicted in Figure 2(b). In addition, a sparse convolution layer with one output channel is utilized to generate the occupancy possibilities of each point. We adopt an adaptive pruning strategy to remove false points based on the occupancy possibility; motion features are up sampled to recover the reconstructed residual)
generating a multi-resolution motion feature by a third set of neural network layers by merging the first and the third motion features; (Xia, pg. 7996, 3.2, Feature Extraction, Figure 2(a), discloses the feature extraction module follows with serially connected down-sampling blocks to hierarchically reduce spatial redundancies and produce latent features of 𝑥𝑡 and ˆ𝑥𝑡−1, i.e., 𝑦𝑡 and ˆ𝑦𝑡−1. Different from with only 2× down-sampling to generate single-scale latent features 𝑦2↓ 𝑡 /ˆ𝑦2↓ 𝑡 −1, Channel-wise Motion Compensation. The overall optical flow 𝑚𝑡 ∈ R𝑁×64×3 consists of 64 separate optical flows for every channel of the latent feature𝑦2↓ 𝑡 ∈ R𝑁×64, so that each channel can find its own match in the reference frame to provide richer temporal context. Specifically, the network first warps the coordinates in𝑦2↓ for each channel separately: 𝑢(𝑖) 𝑤 =𝑢+𝑚(𝑖) where𝐶(𝑦2↓ 𝑡,𝑢, ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) is the coordinates of𝑦2↓ 𝑡 ) 𝑡 (1) 𝑡 ,𝑢 denotes an arbitrary coordinate this paper’s feature extraction module contains one more down sampling block to produce the 3× down-sampled latent feature 𝑦3↓ 𝑡 /ˆ𝑦3↓ 𝑡 −1 . The multi-scale latent features will be used for the subsequent hierarchical block-matching-based inter-prediction. 3.3 Hierarchical Block-matching-based Inter-prediction As shown in Figure 3, we adopt the ME/MC framework in the one shot inter-prediction module of D-DPCC. Specifically, the motion estimation module analyses the temporal correlation between the latent feature𝑦𝑡 and ˆ𝑦𝑡−1 with a KNN-attention block-matching (KABM) network and generates an original flow embedding 𝑒𝑜,𝑡. 𝑒𝑜,𝑡 passes through a Multi-scale Motion Fusion (MMF) module to generate the multi-scale fused flow embedding 𝑒𝑡. The motion compression module is an Auto-Encoder-style network that com presses and decompresses𝑒𝑡 with a non-parametric, fully factorized density model, which will be introduced in 3.4 in detail. For motion compensation, the decompressed flow embedding ˆ𝑒𝑡 is fed into a Multi-scale Motion Reconstruction (MMR) module. MMR restores and fuses the coarse- and fine-grained optical flow to generate the final optical flow𝑚𝑡. in 𝐶(𝑦2↓ 𝑡 ), 𝑖 is the channel index,𝑢(𝑖) 𝑤 is the warped coordinate of 𝑢 of the 𝑖-th channel, and𝑚(𝑖) 𝑡,𝑢 = (Δ𝑥,Δ𝑦,Δ𝑧) is the optical f low located at𝑢 of the 𝑖-th channel. Subsequently, considering the sparse nature of point clouds and the derivability of prediction, we adopt the 3D Adaptively Weighted Interpolation (3DAWI) algorithm for motion compensation: 𝑤 ,𝑣)−1 ·𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 ¯ 𝑦(𝑖) 𝑡,𝑢 = where 𝜗(𝑢(𝑖) 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1,𝛼 max 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) , ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) (2) 𝑤 ) is the 3-nearest-neighbor set of𝑢(𝑖) 𝑤 ,𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1 is the inverse Euclidean distance between 𝑢(𝑖) 𝑤 and the neighbor 𝑣, and𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 is the feature value of the reference frame𝑦𝑟𝑒𝑓 defined at position𝑣 and channel𝑖.In the low-resolution inter-prediction,𝑦𝑟𝑒𝑓 is ˆ𝑦2↓ 𝑡 −1 , and in the high-resolution inter-prediction, 𝑦𝑟𝑒𝑓 is ¯𝑦𝑡,𝑖𝑛𝑖. 𝛼 is a hyperparameter named penalty coefficient that adaptively decreases the weight of isolated warped points. Note that𝐶(𝑦2↓ 𝑡 ) is required in Equation 1 and 2, which is losslessly coded; interframe motion features are concatenated to generate combined motion features) and
Xia does not explicitly disclose packing the multi-resolution motion feature into a bitstream.
Sugio discloses packing the multi-resolution motion feature into a bitstream. (Sugio, [0131-0132], discloses the inter prediction control unit 114 selects the predicted motion vector in the second prediction direction from the candidate predicted motion vector list for the second prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S112); when the prediction direction is not the second prediction direction (No at S108), the inter prediction control unit 114 calculates a candidate predicted motion vector in the first prediction direction (S113). Next, the inter prediction control unit 114 selects the predicted motion vector in the first prediction direction from the candidate predicted motion vector list for the first prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S114); concatenated motion feature vectors are packed into bitstream)
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to combine the teachings of Xia in view of Sugio having a method of concatenating up sampled and down sampled motion features from current and reference frames, with the teachings of Sugio having a module of concatenating motion feature vectors from inter frames into a bitstream in order to compress the feature of relevance to reduce computations resources.
Regarding Claim 2,
The combination of Xia and Sugio further discloses obtaining a fourth motion feature generated by a fourth set of neural network layers with inputs of a two or more time downsampled current feature and a two or more time downsampled reference feature; and generating a fifth motion feature by upsampling two or more times the fourth motion feature, wherein generating the multi-resolution motion feature further comprises merging the fifth motion feature with the first and third motion features. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 3,
The combination of Xi and Sugio further discloses wherein obtaining the first motion feature comprises: concatenating the current feature and the reference feature; performing a feature enhancement process on the concatenated current and reference features; and pruning the feature enhanced features to generate the first motion feature. (Xia, pg. 7996, 3.2, Feature Extraction, Figure 2(a), discloses the feature extraction module follows with serially connected down-sampling blocks to hierarchically reduce spatial redundancies and produce latent features of 𝑥𝑡 and ˆ𝑥𝑡−1, i.e., 𝑦𝑡 and ˆ𝑦𝑡−1. Different from with only 2× down-sampling to generate single-scale latent features 𝑦2↓ 𝑡 /ˆ𝑦2↓ 𝑡 −1, Channel-wise Motion Compensation. The overall optical flow 𝑚𝑡 ∈ R𝑁×64×3 consists of 64 separate optical flows for every channel of the latent feature𝑦2↓ 𝑡 ∈ R𝑁×64, so that each channel can find its own match in the reference frame to provide richer temporal context. Specifically, the network first warps the coordinates in𝑦2↓ for each channel separately: 𝑢(𝑖) 𝑤 =𝑢+𝑚(𝑖) where𝐶(𝑦2↓ 𝑡,𝑢, ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) is the coordinates of𝑦2↓ 𝑡 ) 𝑡 (1) 𝑡 ,𝑢 denotes an arbitrary coordinate this paper’s feature extraction module contains one more down sampling block to produce the 3× down-sampled latent feature 𝑦3↓ 𝑡 /ˆ𝑦3↓ 𝑡 −1 . The multi-scale latent features will be used for the subsequent hierarchical block-matching-based inter-prediction. 3.3 Hierarchical Block-matching-based Inter-prediction As shown in Figure 3, we adopt the ME/MC framework in the one shot inter-prediction module of D-DPCC. Specifically, the motion estimation module analyses the temporal correlation between the latent feature𝑦𝑡 and ˆ𝑦𝑡−1 with a KNN-attention block-matching (KABM) network and generates an original flow embedding 𝑒𝑜,𝑡. 𝑒𝑜,𝑡 passes through a Multi-scale Motion Fusion (MMF) module to generate the multi-scale fused flow embedding 𝑒𝑡. The motion compression module is an Auto-Encoder-style network that com presses and decompresses𝑒𝑡 with a non-parametric, fully factorized density model, which will be introduced in 3.4 in detail. For motion compensation, the decompressed flow embedding ˆ𝑒𝑡 is fed into a Multi-scale Motion Reconstruction (MMR) module. MMR restores and fuses the coarse- and fine-grained optical flow to generate the final optical flow𝑚𝑡. in 𝐶(𝑦2↓ 𝑡 ), 𝑖 is the channel index,𝑢(𝑖) 𝑤 is the warped coordinate of 𝑢 of the 𝑖-th channel, and𝑚(𝑖) 𝑡,𝑢 = (Δ𝑥,Δ𝑦,Δ𝑧) is the optical f low located at𝑢 of the 𝑖-th channel. Subsequently, considering the sparse nature of point clouds and the derivability of prediction, we adopt the 3D Adaptively Weighted Interpolation (3DAWI) algorithm [9] for motion compensation: 𝑤 ,𝑣)−1 ·𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 ¯ 𝑦(𝑖) 𝑡,𝑢 = where 𝜗(𝑢(𝑖) 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1,𝛼 max 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) , ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) (2) 𝑤 ) is the 3-nearest-neighbor set of𝑢(𝑖) 𝑤 ,𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1 is the inverse Euclidean distance between 𝑢(𝑖) 𝑤 and the neighbor 𝑣, and𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 is the feature value of the reference frame𝑦𝑟𝑒𝑓 defined at position𝑣 and channel𝑖.In the low-resolution inter-prediction,𝑦𝑟𝑒𝑓 is ˆ𝑦2↓ 𝑡 −1 , and in the high-resolution inter-prediction, 𝑦𝑟𝑒𝑓 is ¯𝑦𝑡,𝑖𝑛𝑖. 𝛼 is a hyperparameter named penalty coefficient that adaptively decreases the weight of isolated warped points. Note that𝐶(𝑦2↓ 𝑡 ) is required in Equation 1 and 2, which is losslessly coded; interframe motion features are concatenated to generate combined motion features). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 4,
The combination of Xia and Sugio further discloses wherein obtaining the second motion feature comprises: downsampling the current and reference features; concatenating the downsampled current feature and the downsampled reference feature; performing a feature enhancement process on the concatenated features; and pruning the feature enhanced features to generate the second motion feature. (Xia, pg. 7996, 3.2, Feature Extraction, Figure 2(a), discloses the feature extraction module follows with serially connected down-sampling blocks to hierarchically reduce spatial redundancies and produce latent features of 𝑥𝑡 and ˆ𝑥𝑡−1, i.e., 𝑦𝑡 and ˆ𝑦𝑡−1. Different from with only 2× down-sampling to generate single-scale latent features 𝑦2↓ 𝑡 /ˆ𝑦2↓ 𝑡 −1, Channel-wise Motion Compensation. The overall optical flow 𝑚𝑡 ∈ R𝑁×64×3 consists of 64 separate optical flows for every channel of the latent feature𝑦2↓ 𝑡 ∈ R𝑁×64, so that each channel can find its own match in the reference frame to provide richer temporal context. Specifically, the network first warps the coordinates in𝑦2↓ for each channel separately: 𝑢(𝑖) 𝑤 =𝑢+𝑚(𝑖) where𝐶(𝑦2↓ 𝑡,𝑢, ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) is the coordinates of𝑦2↓ 𝑡 ) 𝑡 (1) 𝑡 ,𝑢 denotes an arbitrary coordinate this paper’s feature extraction module contains one more down sampling block to produce the 3× down-sampled latent feature 𝑦3↓ 𝑡 /ˆ𝑦3↓ 𝑡 −1 . The multi-scale latent features will be used for the subsequent hierarchical block-matching-based inter-prediction. 3.3 Hierarchical Block-matching-based Inter-prediction As shown in Figure 3, we adopt the ME/MC framework in the one shot inter-prediction module of D-DPCC. Specifically, the motion estimation module analyses the temporal correlation between the latent feature𝑦𝑡 and ˆ𝑦𝑡−1 with a KNN-attention block-matching (KABM) network and generates an original flow embedding 𝑒𝑜,𝑡. 𝑒𝑜,𝑡 passes through a Multi-scale Motion Fusion (MMF) module to generate the multi-scale fused flow embedding 𝑒𝑡. The motion compression module is an Auto-Encoder-style network that com presses and decompresses𝑒𝑡 with a non-parametric, fully factorized density model, which will be introduced in 3.4 in detail. For motion compensation, the decompressed flow embedding ˆ𝑒𝑡 is fed into a Multi-scale Motion Reconstruction (MMR) module. MMR restores and fuses the coarse- and fine-grained optical flow to generate the final optical flow𝑚𝑡. in 𝐶(𝑦2↓ 𝑡 ), 𝑖 is the channel index,𝑢(𝑖) 𝑤 is the warped coordinate of 𝑢 of the 𝑖-th channel, and𝑚(𝑖) 𝑡,𝑢 = (Δ𝑥,Δ𝑦,Δ𝑧) is the optical f low located at𝑢 of the 𝑖-th channel. Subsequently, considering the sparse nature of point clouds and the derivability of prediction, we adopt the 3D Adaptively Weighted Interpolation (3DAWI) algorithm [9] for motion compensation: 𝑤 ,𝑣)−1 ·𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 ¯ 𝑦(𝑖) 𝑡,𝑢 = where 𝜗(𝑢(𝑖) 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1,𝛼 max 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) , ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) (2) 𝑤 ) is the 3-nearest-neighbor set of𝑢(𝑖) 𝑤 ,𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1 is the inverse Euclidean distance between 𝑢(𝑖) 𝑤 and the neighbor 𝑣, and𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 is the feature value of the reference frame𝑦𝑟𝑒𝑓 defined at position𝑣 and channel𝑖.In the low-resolution inter-prediction,𝑦𝑟𝑒𝑓 is ˆ𝑦2↓ 𝑡 −1 , and in the high-resolution inter-prediction, 𝑦𝑟𝑒𝑓 is ¯𝑦𝑡,𝑖𝑛𝑖. 𝛼 is a hyperparameter named penalty coefficient that adaptively decreases the weight of isolated warped points. Note that𝐶(𝑦2↓ 𝑡 ) is required in Equation 1 and 2, which is losslessly coded; interframe motion features are concatenated to generate combined motion features). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 5,
The combination of Xia and Sugio further discloses wherein generating the third motion feature comprises: upsampling the second motion feature; pruning the upsampled second motion feature to generate the third motion feature; and passing at least one of the second or the third motion feature through a fifth set of neural network layers at least one of before or after upsampling the second motion feature. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC [25] is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 6,
The combination of Xia and Sugio further discloses concatenating the first and third motion features; and performing a feature enhancement neural network layer process on the concatenated motion features to generate the multi-resolution motion feature. (Xia, pg. 7996, 3.2, Feature Extraction, Figure 2(a), discloses the feature extraction module follows with serially connected down-sampling blocks to hierarchically reduce spatial redundancies and produce latent features of 𝑥𝑡 and ˆ𝑥𝑡−1, i.e., 𝑦𝑡 and ˆ𝑦𝑡−1. Different from with only 2× down-sampling to generate single-scale latent features 𝑦2↓ 𝑡 /ˆ𝑦2↓ 𝑡 −1, Channel-wise Motion Compensation. The overall optical flow 𝑚𝑡 ∈ R𝑁×64×3 consists of 64 separate optical flows for every channel of the latent feature𝑦2↓ 𝑡 ∈ R𝑁×64, so that each channel can find its own match in the reference frame to provide richer temporal context. Specifically, the network first warps the coordinates in𝑦2↓ for each channel separately: 𝑢(𝑖) 𝑤 =𝑢+𝑚(𝑖) where𝐶(𝑦2↓ 𝑡,𝑢, ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) is the coordinates of𝑦2↓ 𝑡 ) 𝑡 (1) 𝑡 ,𝑢 denotes an arbitrary coordinate this paper’s feature extraction module contains one more down sampling block to produce the 3× down-sampled latent feature 𝑦3↓ 𝑡 /ˆ𝑦3↓ 𝑡 −1 . The multi-scale latent features will be used for the subsequent hierarchical block-matching-based inter-prediction. 3.3 Hierarchical Block-matching-based Inter-prediction As shown in Figure 3, we adopt the ME/MC framework in the one shot inter-prediction module of D-DPCC. Specifically, the motion estimation module analyses the temporal correlation between the latent feature𝑦𝑡 and ˆ𝑦𝑡−1 with a KNN-attention block-matching (KABM) network and generates an original flow embedding 𝑒𝑜,𝑡. 𝑒𝑜,𝑡 passes through a Multi-scale Motion Fusion (MMF) module to generate the multi-scale fused flow embedding 𝑒𝑡. The motion compression module is an Auto-Encoder-style network that com presses and decompresses𝑒𝑡 with a non-parametric, fully factorized density model, which will be introduced in 3.4 in detail. For motion compensation, the decompressed flow embedding ˆ𝑒𝑡 is fed into a Multi-scale Motion Reconstruction (MMR) module. MMR restores and fuses the coarse- and fine-grained optical flow to generate the final optical flow𝑚𝑡. in 𝐶(𝑦2↓ 𝑡 ), 𝑖 is the channel index,𝑢(𝑖) 𝑤 is the warped coordinate of 𝑢 of the 𝑖-th channel, and𝑚(𝑖) 𝑡,𝑢 = (Δ𝑥,Δ𝑦,Δ𝑧) is the optical f low located at𝑢 of the 𝑖-th channel. Subsequently, considering the sparse nature of point clouds and the derivability of prediction, we adopt the 3D Adaptively Weighted Interpolation (3DAWI) algorithm [9] for motion compensation: 𝑤 ,𝑣)−1 ·𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 ¯ 𝑦(𝑖) 𝑡,𝑢 = where 𝜗(𝑢(𝑖) 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1,𝛼 max 𝑣∈𝜗(𝑢(𝑖) 𝑤 ) 𝑑(𝑢(𝑖) , ∀𝑢 ∈𝐶(𝑦2↓ 𝑡 ) (2) 𝑤 ) is the 3-nearest-neighbor set of𝑢(𝑖) 𝑤 ,𝑑(𝑢(𝑖) 𝑤 ,𝑣)−1 is the inverse Euclidean distance between 𝑢(𝑖) 𝑤 and the neighbor 𝑣, and𝑦(𝑖) 𝑟𝑒𝑓 ,𝑣 is the feature value of the reference frame𝑦𝑟𝑒𝑓 defined at position𝑣 and channel𝑖.In the low-resolution inter-prediction,𝑦𝑟𝑒𝑓 is ˆ𝑦2↓ 𝑡 −1 , and in the high-resolution inter-prediction, 𝑦𝑟𝑒𝑓 is ¯𝑦𝑡,𝑖𝑛𝑖. 𝛼 is a hyperparameter named penalty coefficient that adaptively decreases the weight of isolated warped points. Note that𝐶(𝑦2↓ 𝑡 ) is required in Equation 1 and 2, which is losslessly coded; interframe motion features are concatenated to generate combined motion features). Additionally, the rational and motivation to combine the references Xia and **** as applied in rejection of claim 1 apply to this claim.
Regarding Claim 7,
The combination of Xia and Sugio further discloses wherein packing the multi-resolution motion feature into the bitstream comprises: quantizing the multi-resolution motion feature; and entropy encoding the quantized multi-resolution motion feature;
; and arranging the entropy encoded multi-resolution motion feature into the bitstream. (Xia, pg. 7997, Fig. 2c, 3.4, Residual Compression, discloses the residual Compression module encodes the residual rt between y;! and the final prediction Yt,final with an Auto-Encoder (AE) style network. On the encoder side, a downsample block and a convolution layer serve as the parametric analysis transform that transforms rt into a more compact latent representation lr, = { C(yi! ), F( lr,)}. The coordinate C(y;!) is losslessly compressed by G-PCC octree v14 [31], and the features F(lr,) is quantized and coded using a non-parametric, fully factorized density model. On the decoder side, an upsample block serves as the parametric synthesis transform to recover the reconstructed residual ft. The compression of fused flow embedding et follows a similar
process, as Figure 3(b) shows, where a convolution/transpose convolution layer serves as the encoder/decoder; entropy encoding and quantized motion feature is disclosed). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 8,
The combination of Xia and Sugio further discloses generating a main feature by a second set of neural network layers with the current feature and the reference feature as inputs. (Xia, pg. 7997, KNN-attention Block-matching, Fig. 3, Frameworks of (a) Motion Estimation, (b) Motion Compression, (c) Motion Compensation discloses the previous work D-DPCC about DPC compression designs a simple 2-layer SparseCNN for motion estimation, which fails to calculate the explicit correspondence of blocks between two frames. Therefore, we design a KNN-attention block-matching (KABM) network for more accu rate motion estimation. As Figure 4 depicts, the KABM consists of one ball-KNN search and two inter-frame attention networks. At first, the network concatenates the reference frame and the current frame, so that blocks in both frames can aggregate information simultaneously. The concatenation operation of sparse tensors is defined as follows: 𝑦𝑐𝑎𝑡,𝑢 = 𝑦1,𝑢 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1) ∩𝐶(𝑦2) 𝑦1,𝑢 ⊕ 00 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1),u ∉𝐶(𝑦2) u ∉𝐶(𝑦1),u ∈ 𝐶(𝑦2), (3) where 𝑦1 and 𝑦2 are latent features of the reference and current frames,𝐶(𝑦1) and𝐶(𝑦2) are corresponding integer coordinates, and ⊕ means the concatenation of feature channels. Then, the network uses ball-KNN to find the neighbors of𝑦𝑐𝑎𝑡 limited in a spherical region (the maximum number of neighbors is K) and generates the neighbors’ attributes, including relative coordinates, distances, and features. After that, the network uses two inter-frame attention networks to aggregate features between two frames and output an original flow embedding 𝑒𝑜,𝑡 that encapsulates the matching information. Each inter-frame attention network generates an attention weight matrix for the neighbors based on their attribute vectors and uses softmax to normalize the weight matrix; current frame and reference frame motion features are aggregated to generate concatenated motion features using neural network model 1). (Sugio, [0131-0132], discloses the inter prediction control unit 114 selects the predicted motion vector in the second prediction direction from the candidate predicted motion vector list for the second prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S112); when the prediction direction is not the second prediction direction (No at S108), the inter prediction control unit 114 calculates a candidate predicted motion vector in the first prediction direction (S113). Next, the inter prediction control unit 114 selects the predicted motion vector in the first prediction direction from the candidate predicted motion vector list for the first prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S114); concatenated motion feature vectors are packed into bitstream). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 10,
The combination of Xia and Sugio further discloses reconstructing a point cloud by a separate set of neural network layers with the bitstream as an input. (Xia, pg. 7995-7996, Fig. 1-2, 3.1 Overview, pg. 8002, Model Complexity, discloses our network consists of five modules: feature extraction, low-resolution inter-prediction, high-resolution inter-prediction, residual compression, and point cloud reconstruction. Specifically, let Xt-1 = {C(Xt-1), F(Xt-1)} and Xt = {C(xt), F(xt)} be two consecutive point cloud frames, where C(Xt-i) and C(xt) are coordinates of occupied points, F(Xt-1) and F(xt) are associated features with all-one values that indicate point occupancy. The network's input is the current frame Xt and the previously reconstructed frame (reference frame) Xt-1· The feature extraction module encodes Xt and Xt-1 into latent features Yt and Yt-1 of different scales, where y;! !y:_: 1 and y;! !y;_: 1 separately indicate the 2X and 3X downsampled latent features. y;! and g;_:1 are the input of the low-resolution inter-prediction module, which generates the low-resolution flow embedding et,1 and the initial prediction of y; ! , i.e., Yt ,ini• Yt ,ini serves as the reference frame of the high-resolution inter-prediction that generates the high-resolution flow embedding et,h and the final prediction Yt,Jinal· The residual compression module compresses and decompresses the residual rt between y;!
and Yt,final · On the decoder side, Yt,final is added with the decompressed
residual ft to get the reconstructed latent feature y;. y; enters the point cloud reconstruction module to generate the reconstructed current frame Xt- In the following sections, we will introduce the architecture of each module in detail; As shown in Table 3, we calculate and compare the runtime of the methods mentioned above.The encoding time ("Enc") refers to the interval between the time points when the point cloud data is input into the model, and when the bitstreams are generated by the entropy encoder and written into bin documents. The decoding time (Dec) refers to the interval between the time points when the bin documents are input into the entropy decoder, and when the point cloud frame is reconstructed. The unit of encoding/decoding time is seconds per frame (s/frame). The time cost is found by
averaging the encoding and decoding time of the first 10 frames of the sequence basketball.; reconstruction of motion features is performed and bitstream is generated of the reconstructed features). (Sugio, [0131-0132], discloses the inter prediction control unit 114 selects the predicted motion vector in the second prediction direction from the candidate predicted motion vector list for the second prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S112); when the prediction direction is not the second prediction direction (No at S108), the inter prediction control unit 114 calculates a candidate predicted motion vector in the first prediction direction (S113). Next, the inter prediction control unit 114 selects the predicted motion vector in the first prediction direction from the candidate predicted motion vector list for the first prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S114); concatenated motion feature vectors are packed into bitstream).
Claim 11 recite method with steps corresponding to the method steps recited in Claims 1 and 10. Therefore, the recited steps of the method claim 11 are mapped to the proposed combination in the same manner as the corresponding steps of Claims 1 and 10. Additionally, the rationale and motivation to combine the Xia and Sugio
references presented in rejection of Claim 1, apply to these claims.
Furthermore, the combonation of Xia and Sugio further discloses A method comprising: decoding a multi-resolution motion feature from a bitstream; (Xia, pg. 8002, A Lossless Compressor for C(y;1), discloses Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Downsample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Upsample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; bitstream is decoded for motion compensation)
reconstructing a point cloud by a separate set of neural network layers with the first and the second motion compensated features as inputs. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC [25] is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers).
Regarding Claim 12,
The combination of Xia and Sugio further discloses wherein generating the second motion feature comprises: performing a neural network layer process on the second motion feature; and downsampling an output of the neural network layer process. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 13,
The combination of Xia and Sugio further discloses wherein obtaining the reference feature comprises: obtaining the reconstructed reference frame; and downsampling the reconstructed reference frame to generate the reference feature. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC [25] is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 14,
The combination of Xia and Sugio further discloses wherein the first motion compensated feature corresponds to a first level. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC [25] is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 15,
The combination of Xia and Sugio further discloses wherein the second motion compensated feature corresponds to a second level, and wherein the second level is different from the first level. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC [25] is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Claim 17 recite method with steps corresponding to the method steps recited in Claims 1 and 7. Therefore, the recited steps of the method claim 17 are mapped to the proposed combination in the same manner as the corresponding steps of Claims 1 and 7. Additionally, the rationale and motivation to combine the Xia and Sugio
references presented in rejection of Claim 1, apply to these claims.
Regarding Claim 18,
The combination of Xia and Sugio further discloses arranging the entropy encoded quantized output in a motion feature bitstream. (Sugio, [0131-0132], discloses the inter prediction control unit 114 selects the predicted motion vector in the second prediction direction from the candidate predicted motion vector list for the second prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S112); when the prediction direction is not the second prediction direction (No at S108), the inter prediction control unit 114 calculates a candidate predicted motion vector in the first prediction direction (S113). Next, the inter prediction control unit 114 selects the predicted motion vector in the first prediction direction from the candidate predicted motion vector list for the first prediction direction. Then, the variable length coding unit 104 codes a predicted motion vector index corresponding to the selected predicted motion vector, and adds the coded index to a bitstream (S114); concatenated motion feature vectors are packed into bitstream)
Regarding Claim 19,
The combination of Xia and Sugio further discloses concatenating the downsampled reference frame and the downsampled current frame to generate a concatenated feature; feature enhancing the concatenated feature; and pruning the enhanced feature to generate the output of the motion estimation. (Xia, pg. 7994, 2.2 Learning-based video compression, Fig. 1-4, pg. 8002, discloses A lossless compressor, discloses C(y;1) As described in Section 3.3, the decoder side needs the coordinates of the 2X down-sampled latent feature of the current frame in Motion Compensation, i.e., C(y;!)_ As shown in Figure 10, we losslessly code C(y;!) with an Auto-Encoder-style network similar to the Residual Compression module (Figure 2(c)). Note that the 3X down-sampled coordinates C(y?) is losslessly coded in the Residual Compression module using G-PCC octree v14. So we only need to compress the occupancy of 2x down-sampled coordinates and can obtain C(y;!) by up-sampling C(y;!) and pruning based on occupancy. Specifically, we input an all-one vector representing the occupancy of points in C(y;!), denoted as F(y;!) = 1, 1 E IRNx!. It passes through a parametric analysis transform (a Down sample Block and a sparse convolution layer) to generate a more compact latent representation, which is quantized and coded by the arithmetic codec with a fully factorized density model. Then an Up sample Block serves as the parametric synthesis transform that recovers a lossy reconstruction of F(y;!), i.e., F(y;!). The Binary Cross Entropy (BCE) between F(y;!) and F(y;!) is also calculated for lossless compression; Deep-learning-based video compression methods have recently emerged. DVC is the pioneering method of learning-based end-to-end video compression, which follows the residual-coding framework in H.264/ AVC and H.265/HEVC but designs the modules using neural networks. Subsequent explorations refine DVC's interprediction module or entropy models, e.g.,proposes feature space motion compensation, and further introduces multi-scale motion estimation and quadtree-based block partition modes for entropy coding; motion features are concatenated after being up sampled, pruned and passed through neural network layers). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Regarding Claim 20,
The combination of Xia and Sugio further discloses generating a main feature by a set of neural network layers with the current feature and the reference feature as inputs. (Xia, pg. 7997, KNN-attention Block-matching, Fig. 3, Frameworks of (a) Motion Estimation, (b) Motion Compression, (c) Motion Compensation discloses the previous work D-DPCC about DPC compression designs a simple 2-layer SparseCNN for motion estimation, which fails to calculate the explicit correspondence of blocks between two frames. Therefore, we design a KNN-attention block-matching (KABM) network for more accu rate motion estimation. As Figure 4 depicts, the KABM consists of one ball-KNN search and two inter-frame attention networks. At first, the network concatenates the reference frame and the current frame, so that blocks in both frames can aggregate information simultaneously. The concatenation operation of sparse tensors is defined as follows: 𝑦𝑐𝑎𝑡,𝑢 = 𝑦1,𝑢 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1) ∩𝐶(𝑦2) 𝑦1,𝑢 ⊕ 00 ⊕𝑦2,𝑢 u ∈𝐶(𝑦1),u ∉𝐶(𝑦2) u ∉𝐶(𝑦1),u ∈ 𝐶(𝑦2), (3) where 𝑦1 and 𝑦2 are latent features of the reference and current frames,𝐶(𝑦1) and𝐶(𝑦2) are corresponding integer coordinates, and ⊕ means the concatenation of feature channels. Then, the network uses ball-KNN to find the neighbors of𝑦𝑐𝑎𝑡 limited in a spherical region (the maximum number of neighbors is K) and generates the neighbors’ attributes, including relative coordinates, distances, and features. After that, the network uses two inter-frame attention networks to aggregate features between two frames and output an original flow embedding 𝑒𝑜,𝑡 that encapsulates the matching information. Each inter-frame attention network generates an attention weight matrix for the neighbors based on their attribute vectors and uses softmax to normalize the weight matrix; current frame and reference frame motion features are aggregated to generate concatenated motion features using neural network model 1). Additionally, the rational and motivation to combine the references Xia and Sugio as applied in rejection of claim 1 apply to this claim.
Allowable Subject Matter
Claims 9 and 16 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Possos et al. (US-10284852-B2, content adaptive prediction distance analysis and hierarchical motion estimation for video coding may address the general problem of designing a new, advanced video codec that maximizes the achievable compression efficiency while remaining sufficiently practical for implementation on various platforms including limited devices. More specifically, certain motion estimation techniques may be adaptive to properties of the content and results in improved motion compensation, lower computational complexity, and lower cost of motion vector coding as compared to existing solutions, Abstract)
Aminlau et al. (US-20140254681-A1, methods, apparatuses and computer program products for video coding. In some embodiments motion parameters are obtained for a block of first layer samples and a first layer reference picture for the block of first layer samples is identified. A second layer reference picture corresponding to the first layer reference picture is identified, intermediate reference picture samples are derived by using sample values of the first layer reference picture and information based on sample values of the second layer reference picture, and inter-layer reference picture samples are derived by using intermediate reference picture samples and first layer samples. In some embodiments motion compensated sample values are derived from the second layer reference picture on the basis of the motion parameters; and an inter-layer reference block is derived by using residual sample values of first layer samples and motion compensated sample values from the second layer reference picture, Abstract)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PINALBEN V PATEL whose telephone number is (571)270-5872. The examiner can normally be reached M-F: 10am - 8pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chineyere Wills-Burns can be reached at 571-272-9752. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Pinalben Patel/Examiner, Art Unit 2673