DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement filed August 19, 2024 fails to comply with 37 CFR 1.98(a)(2), which requires a legible copy of each cited foreign patent document; each non-patent literature publication or that portion which caused it to be listed; and all other information or that portion which caused it to be listed. It has been placed in the application file, but the information referred to therein has not been considered. Copies of the foreign references 110728219, 112017198, 112132844, 112184577, and 112287940 were missing.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 100 in Fig. 1. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The disclosure is objected to because of the following informalities:
In paragraph 0007 line 1, “additionally, Sampling” should read “additionally, sampling”.
In paragraph 0030 line 3, “(e.g., 104a, 104b, 104c, 104d as shown” should read “(e.g., 104a, 104b, 104c, 104d) as shown”.
In paragraph 0034 line 7, “View 160 of Fig. 1C” should read “View 140 of Fig. 1C”.
In paragraph 0055 line 11, “separately separately” should read “separately”.
The use of the term “Bluetooth” and “Bluetooth LE” in paragraph 0090 line 4, which is a trade name or a mark used in commerce, has been noted in this application. The term should be accompanied by the generic terminology; furthermore the term should be capitalized wherever it appears or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the term.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-5, 9, 12-15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over You (Long Short-term Convolutional Transformer for No-Reference Video Quality Assessment) in view of Liu et al. (Swin Transformer: Hierarchical Vision Transformer using Shifted Windows) and He et al. (End-to-End Video Object Detection with Spatial-Temporal Transformers), hereinafter You, Liu, and He respectively.
Regarding claim 1, You teaches a method for processing videos (Paragraph 2 in 1st Col. of Page 6 – “Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet. As videos have varying lengths, zero padding to the end of short sequences is performed to serve multiple video sequences in a batch”; Note: there is a method for video processing. It is implied that there is a processor for performing the method below because it could not be performed otherwise), the method comprising:
grouping and rescaling, by one or more processors, neighboring input video frames of a single video into a pyramid of multi-resolution frames including both lower resolution frames and higher resolution frames (Fig. 1 and 3, Paragraph 3 in 1st Col. of Page 4, Paragraph 1-2 in 2nd Col. of Page 4, Paragraph 1-2 in 1st Col. of Page 6 – “PHIQnet, as depicted in Fig. 1, is inspired by pyramidal hierarchy of deep convolutional networks used in feature pyramids for object detection [23] and also earlier studies that multi-scale quality evaluation can be more accurate than single scale in IQA [28][14]. Pyramidic feature maps can represent image semantics for target tasks at different spatial scales. Furthermore, it has been found that fusion between high-scale and low-scale feature maps can achieve higher performance in other computer vision tasks… intermediate maps are obtained by first upsampling each of the result maps in Eq. (1) to match its next low-scale (i.e., high resolution) and then fuse… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet… time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames”; Note: the frames are grouped and rescaled into a feature map pyramid hierarchy, shown in Fig. 1 below. The modified screenshot of Fig. 3 below shows the frame groups);
encoding, by the one or more processors, the set of frames as a set of multi-resolution input tokens (Paragraph 2 in 2nd Col. of Page 5, Paragraph 1 in 1st Col. of Page 6, Paragraph 1-2 in 2nd Col. of Page 6 – “Five perceived quality maps (QF) at different scales are derived… the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…The 1D CNN model produces output vector with shape of 1×M in each clip, which is then first masked based on whether all the elements in the vector equal to 0. Subsequently, a learnable projection layer using a dense layer with D filters to project the clip outputs (Cj) to features (Fj) with D dimension, as Trans former uses constant latent vector size D. This projection layer actually serves similarly to the embedding layer in the original Transformer model… The embedding feature vectors from both the added embedding F0 and the video clips are served as input (Z) to the Transformer encoder”; Note: the clips (set of frames) are projected/encoded as features, which is equivalent to the multi-resolution input tokens since the features are embedded and are used as input for the transformer encoder. The features derive from maps that are multi-scale/multi-resolution);
generating, by a transformer encoder implemented by the one or more processors, a representation of the frame groups for a plurality of time steps (Paragraph 2 in 1st Col. of Page 6, Paragraph 2-3 in 2nd Col. of Page 6 – “time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames…The embedding feature vectors from both the added embedding F0 and the video clips are served as input (Z) to the Transformer encoder. Assuming the number of clips in a video is N…The Transformer encoder, same as the original Transformer model in [27], consists of alternative number of encoder layers. As shown in Fig. 3, each encoder layer contains two sublayers: the first is for multi-head self-attention (MHA) with mask, and the second contains position-wise fully connected feed-forward (FF) layer. Layer normalization (LN) is performed on the residual connection around the two sublayers. The Transformer encoder, as briefly explained in Eq. (12), produces an output of shape (1+N)×D”; Note: the transformer encoder generates an output (representation) for all clips (frame groups), which implies that it occurs for a plurality of time steps since each clip corresponds to a different time step);
aggregating, by a transformer encoder implemented by the one or more processors, across the plurality of time steps (Paragraph 2 in 1st Col. of Page 6, Paragraph 1 in 1st Col. of Page 7 – “time distributed 1D CNN on temporal clips in a padded sequence is performed…the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception”; Note: the transformer encoder aggregates the information for all of the clips (frame groups), which implies that it aggregates across the time steps since the clips correspond to different time steps);
and generating, by the one or more processors based on the aggregating, a quality score associated with a parameter of the video (Paragraph 1 in 1st Col. of Page 7 – “the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception, is fed into a multi-layer perceptron (MLP) head. The MLP head consisting of two dense layers and a dropout layer in between predicts the perceived video quality. The first dense layer uses dff filters with GELU activation, as suggested in BERT [37] and Vision Transformer [36], and the second dense layer has only one filter as a single value for video quality should be produced. Consequently, MSE is chosen as the loss function to measure the distance between the predicted video quality and ground-truth MOS”; Note: video quality prediction is generated based on aggregated information from transformer encoder. The video quality is associated with mean opinion score (MOS), which is a parameter of the video).
PNG
media_image1.png
269
564
media_image1.png
Greyscale
Screenshot of Fig. 1 (taken from You)
PNG
media_image2.png
523
844
media_image2.png
Greyscale
Modified screenshot of Fig. 3 (taken from You)
You does not teach sampling, by the one or more processors, the pyramid of multi-resolution frames to obtain a set of patches; nor the “patches” in the limitation: “encoding, by the one or more processors, the set of patches as a set of multi-resolution input tokens”. However, Liu teaches sampling, by the one or more processors, the pyramid of multi-resolution frames to obtain a set of patches (Paragraph 4 in 2nd Col. of Page 4 – “For efficient modeling, we propose to compute self-attention within local windows. The windows are arranged to evenly partition the image in a non-overlapping manner. Supposing each window contains M × M patches…”; Note: patches in the frames are sampled as windows. The pyramid of multi-resolution frames was previously taught by You). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Liu to sample frames to obtain patches for the benefit of decreasing computational cost since not every patch would have to be processed/analyzed. When the frames are sampled, attention can be placed on the important or relevant parts of the frames. Additionally, a person of ordinary skill in the art before the effective filing date of the claimed invention would have recognized that the frames of You could have been substituted for the patches of Liu for encoding as a set of multi-resolution input tokens because both the frames and patches serve the purpose of representing a part of a video and containing features to be processed. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Finally, the substitution achieves the predictable result of being encoded to be input into a transformer. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the frames of You for the patches of Liu to encode as a set of multi-resolution input tokens according to known methods to yield the predictable result of being encoded to be input into a transformer.
You modified by Liu also does not teach the “spatial transformer encoder” in the limitation: “generating, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps”, nor the “temporal transformer encoder” in the limitation: “aggregating, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps”. However, He teaches generating, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps (Paragraph 1 in 1st Col. of Page 5, Paragraph 1 in 2nd Col. of Page 5 – “The modified detector includes Spatial Transformer Encoder and Spatial Transformer Decoder which encodes each frame 𝐹𝑡 (including Reference Frame and Current Frame) into two compact representations: Spatial Object Query 𝑄𝑡 and Memory Encoding E𝑡”; Note: the spatial transformer decoder generates a representation of a frame group. It is implied that it occurs for a plurality of time steps because each frame corresponds to a different time step), and aggregating, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps (Paragraph 2 in 2nd Col. of Page 5 – “The goal of the Temporal Deformable Transformer Encoder is to encode the spatial temporal feature representations and provide the location cues”). Since You already teaches a transformer encoder (see Fig. 3 above), it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of He to have a spatial transformer encoder and temporal transformer encoder for the benefit of processing spatial and temporal features separately, which would help refine the video analysis and decrease the complexity of the architecture and computations.
Regarding claim 2, You in view of Liu and He teaches the method of claim 1. You further teaches prepending a set of classification tokens to the set of multi-resolution input tokens (Paragraph 2 in 2nd Col. of Page 6 – “Similar to the special classification token [CLS] in BERT [37], a learnable embedding F0 is added to the beginning of the projected features. This embedding is expected to represent the aggregated information for quality assessment on the whole set of video clips without being biased to any particular clips”; Note: the learnable embedding is equivalent to the classification token).
Regarding claim 3, You in view of Liu and He teaches the method of claim 1. You further teaches wherein generating the quality score includes applying aggregated output from the temporal transformer encoder to a multi-layer perceptron model (Paragraph 1 in 1st Col. of Page 7 – “the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception, is fed into a multi-layer perceptron (MLP) head. The MLP head consisting of two dense layers and a dropout layer in between predicts the perceived video quality”; Note: the aggregated output from the transformer encoder is applied to the MLP. The temporal transformer encoder was previously taught by He in the rejection of claim 1).
Regarding claim 4, You in view of Liu and He teaches the method of claim 1. You further teaches wherein the quality score is a mean opinion score (Paragraph 4 in 2nd Col. of Page 5, Paragraph 1 in 1st Col. of Page 7 – “In principle, subjective IQA is to assign a quality rating scale to a target image by many participants and their mean option score (MOS) is used as image quality indicator… MSE is chosen as the loss function to measure the distance between the predicted video quality and ground-truth MOS”; Note: the video quality is represented by the mean opinion score (MOS)).
Regarding claim 5, You in view of Liu and He teaches the method of claim 1. You further teaches wherein the encoding includes capturing both global video composition from the lower resolution frame and local details from the higher resolution frames (Paragraph 3 in 1st Col. of Page 4, Paragraph in 2nd Col. of Page 4, Paragraph 2 in 2nd Col. of Page 5, Paragraph 2 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “Pyramidic feature maps can represent image semantics for target tasks at different spatial scales…Five perceived quality maps (QF) at different scales are derived… Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet… a learnable projection layer using a dense layer with D filters to project the clip outputs (Cj) to features (Fj) with D dimension, as Trans former uses constant latent vector size D”; Note: frames are encoded as features of different resolutions/scales. It is obvious that low-scale feature maps capture global concepts because the frame would not be clear enough to view details, while high-scale feature maps capture local concepts because the frame would be clear enough to view the details).
Regarding claim 9, You in view of Liu and He teaches the method of claim 1. You does not teach wherein a patch size P is the same for all of the multi- resolution frames in the pyramid. However, Liu teaches wherein a patch size P is the same for all of the multi- resolution frames in the pyramid (Fig. 2, Paragraph 2 in 2nd Col. of Page 3 – “It first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGBvalues. In our implementation, we use a patch size of 4×4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48”; Note: each patch is the same size. Fig. 2 shows the size of the patches. The multi- resolution frames in the pyramid were previously taught by You in the rejection of claim 1). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Liu to have the patches be the same size for the benefit of uniform input for encoding and ease of managing the data in memory.
PNG
media_image3.png
192
486
media_image3.png
Greyscale
’
Screenshot of Fig. 2 (taken from Liu)
Regarding claim 12, You teaches a video processing system (Paragraph 2 in 1st Col. of Page 6 – “Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet. As videos have varying lengths, zero padding to the end of short sequences is performed to serve multiple video sequences in a batch”; Note: there is an architecture/system for video processing), comprising:
memory configured to store imagery (Paragraph 2 in 2nd Col. of Page 9 – “Transformer was directly performed on the features of all frames derived by PHIQNet. We chose a subset of the hyper-parameters of Transformer in Table 3 that can fit in the GPU memory”; Note: it is implied that there is a memory to store imagery because the architecture would not be able to reasonably process the video otherwise);
and one or more processors operatively coupled to the memory (Paragraph 2 in 1st Col. of Page 6, Paragraph 2 in 2nd Col. of Page 9 – “Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet… Transformer was directly per formed on the features of all frames derived by PHIQNet. We chose a subset of the hyper-parameters of Transformer in Table 3 that can fit in the GPU memory”; Note: it is implied that there is a processor coupled to memory because the transformer would not be able to work otherwise), the one or more processors being configured to:
group and rescale neighboring input video frames of a single video into a pyramid of multi-resolution frames including both lower resolution frames and higher resolution frames (Fig. 1 and 3, Paragraph 3 in 1st Col. of Page 4, Paragraph 1-2 in 2nd Col. of Page 4, Paragraph 1-2 in 1st Col. of Page 6 – “PHIQnet, as depicted in Fig. 1, is inspired by pyramidal hierarchy of deep convolutional networks used in feature pyramids for object detection [23] and also earlier studies that multi-scale quality evaluation can be more accurate than single scale in IQA [28][14]. Pyramidic feature maps can represent image semantics for target tasks at different spatial scales. Furthermore, it has been found that fusion between high-scale and low-scale feature maps can achieve higher performance in other computer vision tasks… intermediate maps are obtained by first upsampling each of the result maps in Eq. (1) to match its next low-scale (i.e., high resolution) and then fuse… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet… time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames”; Note: the frames are grouped and rescaled into a feature map pyramid hierarchy, shown in Fig. 1 above. The modified screenshot of Fig. 3 above shows the frame groups);
encode the set of frames as a set of multi-resolution input tokens (Paragraph 2 in 2nd Col. of Page 5, Paragraph 1 in 1st Col. of Page 6, Paragraph 1-2 in 2nd Col. of Page 6 – “Five perceived quality maps (QF) at different scales are derived… the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…The 1D CNN model produces output vector with shape of 1×M in each clip, which is then first masked based on whether all the elements in the vector equal to 0. Subsequently, a learnable projection layer using a dense layer with D filters to project the clip outputs (Cj) to features (Fj) with D dimension, as Trans former uses constant latent vector size D. This projection layer actually serves similarly to the embedding layer in the original Transformer model… The embedding feature vectors from both the added embedding F0 and the video clips are served as input (Z) to the Transformer encoder”; Note: the clips (set of frames) are projected/encoded as features, which is equivalent to the multi-resolution input tokens since the features are embedded and are used as input for the transformer encoder. The features derive from maps that are multi-scale/multi-resolution);
generate, by a transformer encoder implemented by the one or more processors, a representation of the frame groups for a plurality of time steps (Paragraph 2 in 1st Col. of Page 6, Paragraph 2-3 in 2nd Col. of Page 6 – “time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames…The embedding feature vectors from both the added embedding F0 and the video clips are served as input (Z) to the Transformer encoder. Assuming the number of clips in a video is N…The Transformer encoder, same as the original Transformer model in [27], consists of alternative number of encoder layers. As shown in Fig. 3, each encoder layer contains two sublayers: the first is for multi-head self-attention (MHA) with mask, and the second contains position-wise fully connected feed-forward (FF) layer. Layer normalization (LN) is performed on the residual connection around the two sublayers. The Transformer encoder, as briefly explained in Eq. (12), produces an output of shape (1+N)×D”; Note: the transformer encoder generates an output (representation) for all clips (frame groups), which implies that it occurs for a plurality of time steps since each clip corresponds to a different time step);
aggregate, by a transformer encoder implemented by the one or more processors, across the plurality of time steps (Paragraph 2 in 1st Col. of Page 6, Paragraph 1 in 1st Col. of Page 7 – “time distributed 1D CNN on temporal clips in a padded sequence is performed…the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception”; Note: the transformer encoder aggregates the information for all of the clips (frame groups), which implies that it aggregates across the time steps since the clips correspond to different time steps);
and generate, based on the aggregating, a quality score associated with a parameter of the video (Paragraph 1 in 1st Col. of Page 7 – “the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception, is fed into a multi-layer perceptron (MLP) head. The MLP head consisting of two dense layers and a dropout layer in between predicts the perceived video quality. The first dense layer uses dff filters with GELU activation, as suggested in BERT [37] and Vision Transformer [36], and the second dense layer has only one filter as a single value for video quality should be produced. Consequently, MSE is chosen as the loss function to measure the distance between the predicted video quality and ground-truth MOS”; Note: video quality prediction is generated based on aggregated information from transformer encoder. The video quality is associated with mean opinion score (MOS), which is a parameter of the video).
You does not teach sampling the pyramid of multi-resolution frames to obtain a set of patches; nor the “patches” in the limitation: “encode the set of patches as a set of multi-resolution input tokens”. However, Liu teaches sampling the pyramid of multi-resolution frames to obtain a set of patches (Paragraph 4 in 2nd Col. of Page 4 – “For efficient modeling, we propose to compute self-attention within local windows. The windows are arranged to evenly partition the image in a non-overlapping manner. Supposing each window contains M × M patches…”; Note: patches in the frames are sampled as windows. The pyramid of multi-resolution frames was previously taught by You). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Liu to sample frames to obtain patches for the benefit of decreasing computational cost since not every patch would have to be processed/analyzed. When the frames are sampled, attention can be placed on the important or relevant parts of the frames. Additionally, a person of ordinary skill in the art before the effective filing date of the claimed invention would have recognized that the frames of You could have been substituted for the patches of Liu for encoding as a set of multi-resolution input tokens because both the frames and patches serve the purpose of representing a part of a video and containing features to be processed. Furthermore, a person of ordinary skill in the art would have been able to carry out the substitution. Finally, the substitution achieves the predictable result of being encoded to be input into a transformer. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to substitute the frames of You for the patches of Liu to encode as a set of multi-resolution input tokens according to known methods to yield the predictable result of being encoded to be input into a transformer.
You modified by Liu also does not teach the “spatial transformer encoder” in the limitation: “generate, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps”, nor the “temporal transformer encoder” in the limitation: “aggregate, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps”. However, He teaches generating, by a spatial transformer encoder implemented by the one or more processors, a representation per frame group for a plurality of time steps (Paragraph 1 in 1st Col. of Page 5, Paragraph 1 in 2nd Col. of Page 5 – “The modified detector includes Spatial Transformer Encoder and Spatial Transformer Decoder which encodes each frame 𝐹𝑡 (including Reference Frame and Current Frame) into two compact representations: Spatial Object Query 𝑄𝑡 and Memory Encoding E𝑡”; Note: the spatial transformer decoder generates a representation of a frame group. It is implied that it occurs for a plurality of time steps because each frame corresponds to a different time step), and aggregating, by a temporal transformer encoder implemented by the one or more processors, across the plurality of time steps (Paragraph 2 in 2nd Col. of Page 5 – “The goal of the Temporal Deformable Transformer Encoder is to encode the spatial temporal feature representations and provide the location cues”). Since You already teaches a transformer encoder (see Fig. 3 above), it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of He to have a spatial transformer encoder and temporal transformer encoder for the benefit of processing spatial and temporal features separately, which would help refine the video analysis and decrease the complexity of the architecture and computations.
Regarding claim 13, You in view of Liu and He teaches the video processing system of claim 12. You further teaches wherein the one or more processors are further configured to prepend a set of classification tokens to the set of multi-resolution input tokens (Paragraph 2 in 2nd Col. of Page 6 – “Similar to the special classification token [CLS] in BERT [37], a learnable embedding F0 is added to the beginning of the projected features. This embedding is expected to represent the aggregated information for quality assessment on the whole set of video clips without being biased to any particular clips”; Note: the learnable embedding is equivalent to the classification token).
Regarding claim 14, You in view of Liu and He teaches the video processing system of claim 12. You further teaches wherein generation of the quality score includes applying aggregated output from the temporal transformer encoder to a multi-layer perceptron model (Paragraph 1 in 1st Col. of Page 7 – “the first vector of output (ZL[0]), that is supposed to contain aggregated information from the Transformer encoder for video quality perception, is fed into a multi-layer perceptron (MLP) head. The MLP head consisting of two dense layers and a dropout layer in between predicts the perceived video quality”; Note: the aggregated output from the transformer encoder is applied to the MLP. The temporal transformer encoder was previously taught by He in the rejection of claim 12).
Regarding claim 15, You in view of Liu and He teaches the video processing system of claim 12. You further teaches wherein the encoding the set of patches includes capturing both global video composition from the lower resolution frame and local details from the higher resolution frames (Paragraph 3 in 1st Col. of Page 4, Paragraph in 2nd Col. of Page 4, Paragraph 2 in 2nd Col. of Page 5, Paragraph 2 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “Pyramidic feature maps can represent image semantics for target tasks at different spatial scales…Five perceived quality maps (QF) at different scales are derived… Fig. 3 depicts the architecture of LSCT based on quality features derived from PHIQNet… a learnable projection layer using a dense layer with D filters to project the clip outputs (Cj) to features (Fj) with D dimension, as Trans former uses constant latent vector size D”; Note: frames are encoded as features of different resolutions/scales. It is obvious that low-scale feature maps capture global concepts because the frame would not be clear enough to view details, while high-scale feature maps capture local concepts because the frame would be clear enough to view the details).
Regarding claim 20, You in view of Liu and He teaches the video processing system of claim 12. You does not teach wherein a patch size P is the same for all of the multi- resolution frames in the pyramid. However, Liu teaches wherein a patch size P is the same for all of the multi- resolution frames in the pyramid (Fig. 2, Paragraph 2 in 2nd Col. of Page 3 – “It first splits an input RGB image into non-overlapping patches by a patch splitting module, like ViT. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGBvalues. In our implementation, we use a patch size of 4×4 and thus the feature dimension of each patch is 4 × 4 × 3 = 48”; Note: each patch is the same size. Fig. 2 shows the size of the patches. The multi- resolution frames in the pyramid were previously taught by You in the rejection of claim 12). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Liu to have the patches be the same size for the benefit of uniform input for encoding and ease of managing the data in memory.
Claims 11 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over You in view of Liu, He, and Karyodisa et al. (US 20190213420 A1), hereinafter Karyodisa.
Regarding claim 11, You in view of Liu and He teaches the method of claim 1. You does not teach wherein sampling the pyramid of multi-resolution frames to obtain a set of patches includes forming a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames. However, Karyodisa teaches forming a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames (Fig. 8B, Paragraph 0079, 0114 – “the features can be from local patches around selected locations on detected faces (or other objects)… FIG. 8B is a diagram illustrating an example of conversion of the bounding box and landmark information for the face 812 from the low resolution video frame 851 to the high resolution video frame 855”; Note: the figure shows a tube connecting bounding boxes (comprised of patches) of multi-resolution video frames. It is obvious that the tube has the same center since it extends from one bounding box to another and thus, retains the box shape throughout. See the tube in Fig. 8B below. The pyramid of multi-resolution frames was previously taught by You in the rejection of claim 1). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Karyodisa to form a tube of multi-resolution patches with the same center because it would be beneficial for detecting objects that the viewer would likely see across the video frames and putting focus on those objects for analysis. Connecting the patches for object detection could be performed using or with the help of the attention modules in You and Liu.
PNG
media_image4.png
605
427
media_image4.png
Greyscale
Screenshot of Fig. 4A and 4B (taken from Karyodisa)
PNG
media_image5.png
465
410
media_image5.png
Greyscale
Screenshot of Fig. 8B (taken from Karyodisa)
Regarding claim 19, You in view of Liu and He teaches the video processing system of claim 12. You does not teach wherein the one or more processors are configured to sample the pyramid of multi-resolution frames to obtain a set of patches by formation of a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames. However, Karyodisa teaches forming a tube of multi-resolution patches, the tube having the same center in through the pyramid of multi-resolution frames (Fig. 8B, Paragraph 0079, 0114 – “the features can be from local patches around selected locations on detected faces (or other objects)… FIG. 8B is a diagram illustrating an example of conversion of the bounding box and landmark information for the face 812 from the low resolution video frame 851 to the high resolution video frame 855”; Note: the figure shows a tube connecting bounding boxes (comprised of patches) of multi-resolution video frames. It is obvious that the tube has the same center since it extends from one bounding box to another and thus, retains the box shape throughout. See the tube in Fig. 8B above. The pyramid of multi-resolution frames was previously taught by You in the rejection of claim 12). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Karyodisa to form a tube of multi-resolution patches with the same center because it would be beneficial for detecting objects that the viewer would likely see across the video frames and putting focus on those objects for analysis. Connecting the patches for object detection could be performed using or with the help of the attention modules in You and Liu.
Claim 21 is rejected under 35 U.S.C. 103 as being unpatentable over You in view of Liu, He, and Regunathan et al. (US 20220248074 A1), hereinafter Regunathan.
Regarding claim 21, You in view of Liu and He teaches the video processing system of claim 12. You does not teach wherein the one or more processors are further configured to assign quality scores to different videos in order to prioritize the different videos for serving. However, Regunathan teaches assigning quality scores to different videos in order to prioritize the different videos for serving (Paragraph 0050 – “some video products and/or platforms may impose specific requirements for identifying and/or classifying high- or low-quality uploads. For example, some video products and/or platforms may require high confidence scores for identifying and/or classifying high quality uploads (for ranking and/or promotion)”; Note: video uploads are given quality scores for ranking, which is equivalent to prioritizing). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified You to incorporate the teachings of Regunathan to assign quality scores to different videos and prioritize the videos for the benefit of preventing low-quality videos from being shown to users, which would help increase their viewing experience.
Allowable Subject Matter
Claims 6-8, 10, and 16-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 6 would be allowable for disclosing wherein the grouping and rescaling includes: dividing the neighboring input video frames by a group of N; and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio; wherein an i-th frame is resized to shorter-side length i x l, where l is a smallest length.
Regarding claim 6, You in view of Liu and He teaches the method of claim 1. You further teaches dividing the neighboring input video frames by a group of N (Paragraph 2-3 in 2nd Col. of Page 4, Paragraph 2 in 2nd Col. of Page 5, Paragraph 1-2 in 1st Col. of Page 6 – “the outputs from the later four stages in ResNet 50 are chosen as pyramidic feature maps, named C2, C3, C4, and C5… another convolution layer using K kernels with kernel size (3×3) is performed on the fused feature maps {M2, M3, M4, M5} to generate perceptual maps {P2, P3, P4, P5}. In addition, the feature map C5 in the last residual block in ResNet-50 is expected to carry the most abstract information for the target task. Therefore, another perceptual map P6 is obtained by applying a convolution layer with K kernels, kernel size (1×1) and stride=2 on C5…Five perceived quality maps (QF) at different scales are derived as follows, with the same shape as the perceptual maps… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames”; Note; the video frames are divided into temporal clips based on the quality feature maps. In this case, there are 5 clips so N = 5); and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio (Paragraph 2 in 2nd Col. of Page 5, Paragraph 1 in 1st Col. of Page 6 – “Five perceived quality maps (QF) at different scales are derived as follows, with the same shape as the perceptual maps {P2, P3, P4, P5, P6}… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA”; Note: the 5 clips each have different scales/resolutions. They all have the same shape, and thus have the same aspect ratio), wherein an i-th frame is resized (Paragraph 2 in 2nd Col. of Page 4 – “intermediate maps are obtained by first upsampling each of the result maps in Eq. (1) to match its next low-scale (i.e., high resolution) and then fuse, as shown in Eq. (2):
PNG
media_image6.png
55
491
media_image6.png
Greyscale
”). Brattoli et al. (Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications), hereinafter Brattoli, also teaches resizing each frame (Paragraph 7 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “we reshaped each frame’s shortest side to 128 pixels”). However, none of the prior art teaches the claim limitation wherein an i-th frame is resized to shorter-side length i x l, where l is a smallest length. Brattoli teaches resizing the frames to a fixed size (Paragraph 7 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “we reshaped each frame’s shortest side to 128 pixels”), but it does not teach resizing an i-th frame to i x l. Based on the configuration, it would be improper hindsight to modify You to resize an i-th frame to shorter side length i x l, where l is a smallest length. Therefore, the combination of features is considered allowable.
Claim 7 would be allowable for disclosing wherein: sampling the pyramid of multi-resolution frames to obtain a set of patches includes aligning patch grid centers for each frame; during model training, randomly choosing a center for each frame along a middle line for a longer-length side; and for inference, using the center of the video frames.
Regarding claim 7, You in view of Liu and He teaches the method of claim 1. Wang et al. (AllFocus: Patch-Based Video Out-of-Focus Blur Reconstruction), hereinafter Wang, teaches aligning patch grid centers for each frame (Paragraph 4-5 in 2nd Col. of Page 4 – “We cut the image into a group of small grids and apply Random sample consensus (RANSAC) within each grid cell to estimate a translation model. Only the feature point matches that agree with the estimated model up to a threshold distance (<2 pixels) are retained as inliers. Subsequently, we fit a homography matrix to all the inliers in the whole image. Let I0 and {It}t=1,2,...,T denote the frame to be restored and the T assisting frames, respectively. We use It to represent the warped image generated by aligning It to I0. p0,x denotes a patch in I0, centered at x, and its corresponding patch in the aligned assisting frame It is represented by pt,x”; Note: patches of different frames are aligned and centered). However, none of the prior art teaches the claim limitations: during model training, randomly choosing a center for each frame along a middle line for a longer-length side; and for inference, using the center of the video frames. Based on the configuration, it would be improper hindsight to modify You to randomly choose a center for each frame along a middle line for a longer-length side during model training, and for inference, use the center of the video frames. Therefore, the combination of features is considered allowable.
Claim 8 would be allowable for disclosing wherein sampling the pyramid of multi-resolution frames to obtain a set of patches includes: from a first one of the neighboring input video frames, uniformly sampling grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sampling spaced-out patches to provide local details.
Regarding claim 8, You in view of Liu and He teaches the method of claim 1. Liu teaches from a first one of the neighboring input video frames, linearly sampling grid patches (Fig. 2 Caption on Page 2, Paragraph 3 in 2nd Col. of Page 4 – “In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window…As illustrated in Figure 2, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the 8 × 8 feature map is evenly partitioned into 2×2 windows of size 4×4 (M = 4)”; Note: for a first layer/frame, the patches are linearly sampled as evenly partitioned windows; see screenshot of Fig. 2 above). However, none of the prior art teaches the claim limitations: from a first one of the neighboring input video frames, uniformly sampling grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sampling spaced-out patches to provide local details. Based on the configuration, it would be improper hindsight to modify You to uniformly sample grid patches from a first video frame and to linearly sample spaced-out patches for following ones of the video frames. Therefore, the combination of features is considered allowable.
Claim 10 would be allowable for disclosing wherein, for the i-th frame in the pyramid, the distance between patches is set to (i - 1) x P.
Regarding claim 10, You in view of Liu and He teaches the method of claim 9. Besley et al. (US 20200160560 A1), hereinafter Besley, teaches calculating a distance between patches (Paragraph 0185 – “a distance between each pair of oriented patches in the set of oriented patches, is determined under execution of the co-processor 390 and stored, for example, in the RAM 170. The distance determined at step 610 is a function of pixel distance between the patches and the difference in reference direction”). However, none of the prior art teaches the claim limitations: for the i-th frame in the pyramid, the distance between patches is set to (i - 1) x P. Based on the configuration, it would be improper hindsight to modify You to set the distance between patches to (i - 1) x P for the i-th frame in the pyramid. Therefore, the combination of features is considered allowable.
Claim 16 would be allowable for disclosing wherein grouping and rescaling neighboring input video frames includes: division of the neighboring input video frames by a group of N; and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio; wherein an i-th frame is resized to shorter-side length i x l, where l is a smallest length.
Regarding claim 16, You in view of Liu and He teaches the video processing system of claim 12. You further teaches division of the neighboring input video frames by a group of N (Paragraph 2-3 in 2nd Col. of Page 4, Paragraph 2 in 2nd Col. of Page 5, Paragraph 1-2 in 1st Col. of Page 6 – “the outputs from the later four stages in ResNet 50 are chosen as pyramidic feature maps, named C2, C3, C4, and C5… another convolution layer using K kernels with kernel size (3×3) is performed on the fused feature maps {M2, M3, M4, M5} to generate perceptual maps {P2, P3, P4, P5}. In addition, the feature map C5 in the last residual block in ResNet-50 is expected to carry the most abstract information for the target task. Therefore, another perceptual map P6 is obtained by applying a convolution layer with K kernels, kernel size (1×1) and stride=2 on C5…Five perceived quality maps (QF) at different scales are derived as follows, with the same shape as the perceptual maps… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA…time distributed 1D CNN on temporal clips in a padded sequence is performed, and each clip contains 16 frames”; Note; the video frames are divided into temporal clips based on the quality feature maps. In this case, there are 5 clips so N = 5); and proportionally resizing the group of N to N different resolutions preserving a same aspect ratio (Paragraph 2 in 2nd Col. of Page 5, Paragraph 1 in 1st Col. of Page 6 – “Five perceived quality maps (QF) at different scales are derived as follows, with the same shape as the perceptual maps {P2, P3, P4, P5, P6}… pooled features of the five perceived quality maps {QF2, QF3, QF4, QF5, QF6} derived on all video frames, each has shape 1×256, are concatenated. They are severed as the quality features with a shape of 1×1280 to be fed into LSCT for NR-VQA”; Note: the 5 clips each have different scales/resolutions. They all have the same shape, and thus have the same aspect ratio), wherein an i-th frame is resized (Paragraph 2 in 2nd Col. of Page 4 – “intermediate maps are obtained by first upsampling each of the result maps in Eq. (1) to match its next low-scale (i.e., high resolution) and then fuse, as shown in Eq. (2):
PNG
media_image6.png
55
491
media_image6.png
Greyscale
”). Brattoli also teaches resizing each frame (Paragraph 7 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “we reshaped each frame’s shortest side to 128 pixels”). However, none of the prior art teaches the claim limitation wherein an i-th frame is resized to shorter-side length i x l, where l is a smallest length. Brattoli teaches resizing the frames to a fixed size (Paragraph 7 in 1st Col. of Page 6, Paragraph 1 in 2nd Col. of Page 6 – “we reshaped each frame’s shortest side to 128 pixels”), but it does not teach resizing an i-th frame to i x l. Based on the configuration, it would be improper hindsight to modify You to resize an i-th frame to shorter side length i x l, where l is a smallest length. Therefore, the combination of features is considered allowable.
Claim 17 would be allowable for disclosing wherein: the one or more processors are configured to sample the pyramid of multi-resolution frames to obtain a set of patches by alignment of patch grid centers for each frame; during model training, the one or more processors are configured to randomly choose a center for each frame along a middle line for a longer-length side; and for inference, the one or more processors are configured to use the center of the video frames.
Regarding claim 17, You in view of Liu and He teaches the video processing system of claim 12. Wang teaches alignment of patch grid centers for each frame (Paragraph 4-5 in 2nd Col. of Page 4 – “We cut the image into a group of small grids and apply Random sample consensus (RANSAC) within each grid cell to estimate a translation model. Only the feature point matches that agree with the estimated model up to a threshold distance (<2 pixels) are retained as inliers. Subsequently, we fit a homography matrix to all the inliers in the whole image. Let I0 and {It}t=1,2,...,T denote the frame to be restored and the T assisting frames, respectively. We use It to represent the warped image generated by aligning It to I0. p0,x denotes a patch in I0, centered at x, and its corresponding patch in the aligned assisting frame It is represented by pt,x”; Note: patches of different frames are aligned and centered). However, none of the prior art teaches the claim limitations: during model training, randomly choosing a center for each frame along a middle line for a longer-length side; and for inference, using the center of the video frames. Based on the configuration, it would be improper hindsight to modify You to randomly choose a center for each frame along a middle line for a longer-length side during model training, and for inference, use the center of the video frames. Therefore, the combination of features is considered allowable.
Claim 18 would be allowable for disclosing wherein the one or more processors are configured to sample the pyramid of multi-resolution frames to obtain a set of patches as follows: from a first one of the neighboring input video frames, uniformly sample grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sample spaced-out patches to provide local details.
Regarding claim 18, You in view of Liu and He teaches the video processing system of claim 12. Liu teaches from a first one of the neighboring input video frames, linearly sampling grid patches (Fig. 2 Caption on Page 2, Paragraph 3 in 2nd Col. of Page 4 – “In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window…As illustrated in Figure 2, the first module uses a regular window partitioning strategy which starts from the top-left pixel, and the 8 × 8 feature map is evenly partitioned into 2×2 windows of size 4×4 (M = 4)”; Note: for a first layer/frame, the patches are linearly sampled as evenly partitioned windows; see screenshot of Fig. 2 above). However, none of the prior art teaches the claim limitations: from a first one of the neighboring input video frames, uniformly sampling grid patches to capture a complete global view; and for following ones of the neighboring input video frames, linearly sampling spaced-out patches to provide local details. Based on the configuration, it would be improper hindsight to modify You to uniformly sample grid patches from a first video frame and to linearly sample spaced-out patches for following ones of the video frames. Therefore, the combination of features is considered allowable.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Kolarov et al. (US 20140002670 A1) teaches a method of image and video quality assessment by detecting spatial and temporal errors in reconstructed video. Wang et al. (US 20220201317 A1) teaches a method of video assessment based on a target quality and how encoding can be configured to attain the target quality.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE HAU MA whose telephone number is (571)272-2187. The examiner can normally be reached M-Th 7-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Poon can be reached at (571) 270-0728. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHELLE HAU MA/ Examiner, Art Unit 2617
/KING Y POON/Supervisory Patent Examiner, Art Unit 2617