DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-10 and 12-17 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al., US 2021/0326638 A1 (Lee).
Regarding claim 1, Lee teaches a processor-implemented (including a processor) ([0043]) method (video processing, and more specifically to video panoptic segmentation) ([0001]), the method comprising:
obtaining a video feature of a video comprising a plurality of video frames (obtaining video features, from an encoder, of a target frame and one or more reference frames from a video) ([0017] and [0025-0026]);
determining a target object representation of the video based on the video feature (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]) using a neural network (using an artificial neural network (ANN)) ([0031]); and
generating a panorama segmentation result of the video based on the target object representation (generating video panoptic segmentation based on the target object representation from the class labels and instance IDs of the object) ([0029-0031]).
Although Lee does not explicitly teach a “target object”, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that since Lee teaches a “target frame” ([0017]) and that the panoptic segmentation network can be used to depict objects such as people and vehicles ([0016]), that these objects are the “target objects”.
Regarding claim 2, Lee teaches wherein the determining of the target object representation of the video based on the video feature (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]) using the neural network (using an artificial neural network (ANN)) ([0031]) comprises determining the target object representation of the video by performing multiple iteration processing (in some cases, signals may traverse certain layers multiple times) ([0032-0033]) on the video feature (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]) using the neural network (using an artificial neural network (ANN)) ([0031]).
Regarding claim 3, Lee teaches wherein the determining of the target object representation of the video by performing the multiple iteration processing (in some cases, signals may traverse certain layers multiple times) ([0032-0033]) on the video feature (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]) using the neural network (using an artificial neural network (ANN)) ([0031]) comprises determining an object representation by current iteration processing of the video by performing iteration processing based on the video feature (based on the generated features) (Fig. 7; [0101]) and an object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) by previous iteration processing of the video, using the neural network (using an artificial neural network (ANN)) ([0031]) (wherein the object representation is performed by the ANN and updated based on the comparison) (Fig. 7; [0104-0105]).
Regarding claim 4, Lee teaches wherein the object representation by the previous iteration processing is a pre-configured initial object representation in a case of first iteration processing of the multiple iteration processing (wherein the object representation in a first iteration is a predicted panoptic segmentation information; and the ANN is updated for use again in the iterative processing) (Fig. 7; [0032-0033] and [0103-0105]).
Regarding claim 5, Lee teaches wherein the determining of the object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) by the current iteration processing of the video by performing the iteration processing based on the video feature (based on the generated features) (Fig. 7; [0101]) and the object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) by the previous iteration processing of the video (wherein the object representation in a first iteration is a predicted panoptic segmentation information; and the ANN is updated for use again in the iterative processing) (Fig. 7; [0032-0033] and [0103-0105]) comprises:
generating a mask (generating an object mask that identifies the limits of the object) ([0020], [0031], [0037], [0082-0084], [0104] and [0107]) by performing transformation processing (different layers of the neural network can perform different transformations on their inputs) ([0033]) on the object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) by the previous iteration processing of the video (wherein the object representation in a first iteration is a predicted panoptic segmentation information; and the ANN is updated for use again in the iterative processing) (Fig. 7; [0032-0033] and [0103-0105]) (video clips) ([0096]);
generating a first object representation by processing the video feature, the object representation by the previous iteration processing, and the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]); and
determining the object representation by the current iteration processing of the video based on the first object representation (wherein the object representation is further accurate based on the updated ANN) ([0032-0033] and [0105]).
Regarding claim 6, Lee teaches wherein the generating of the first object representation by processing the video feature, the object representation by the previous iteration processing, and the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]) comprises:
generating an object representation related to a mask by performing attention processing on the video feature (performing a spatial-temporal attention module to the target features and the reference features) ([0133]) (as well as using a Mask R-CNN and a Spatial Attention Network) ([0037-0038]), the object representation by the previous iteration processing, and the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]); and
generating the first object representation by performing self-attention processing (performing a spatial-temporal attention module to the target features and the reference features) ([0133]) (as well as using a Spatial Attention Network) ([0038]) and classification processing (classification processing) ([0030-0031], [0104], and [0107]) based on the object representation related to the mask and the object representation by the previous iteration processing (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]).
Regarding claim 7, Lee teaches wherein the generating of the object representation related to the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]) by performing the attention processing on the video feature (performing a spatial-temporal attention module to the target features and the reference features) ([0133]) (as well as using a Spatial Attention Network) ([0038]), the object representation by the previous iteration processing, and the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]) comprises:
generating a second object representation based on a key feature (wherein the second representation can be from the next iteration while using key frames) ([0070]) corresponding to the video feature, the object representation by the previous iteration processing, and the mask (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, and the mask) (Fig. 7; [0101-0107]);
determining a first probability indicating an object category (class probability) ([0104]) in the video based on the second object representation (determining a predicted class probability based on the next iteration of the object representation within the ANN) (Fig. 7; [0104-0105]); and
generating the object representation related to the mask based on the first probability, a value feature corresponding to the video feature, and the video feature (wherein generating a unique category label and an instance label by processing the generated features, the previous object representation, the mask, and the class probability) (Fig. 7; [0101-0107]).
Regarding claim 8, Lee teaches wherein the determining of the object representation by the current iteration processing of the video based on the first object representation (wherein the object representation is further accurate based on the updated ANN) ([0032-0033] and [0105]) comprises:
determining an object representation corresponding to each video frame of one or more video frames of the plurality of video frames (determining an object representation for each frame of the plurality of frames) ([0100] and [0104]), based on the video feature (based on the generated features) (Fig. 7; [0101]) and the first object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]); and
determining the object representation by the current iteration processing of the video based on the first object representation (wherein the object representation is further accurate based on the updated ANN) ([0032-0033] and [0105]) and the determined object representation corresponding to the each video frame (and the determined object representation for each frame) ([0104]).
Regarding claim 9, Lee teaches wherein the determining of the object representation corresponding to each video frame of the one or more video frames (determining an object representation for each frame of the plurality of frames) ([0100] and [0104]) based on the video feature (based on the generated features) (Fig. 7; [0101]) and the first object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) comprises:
determining a fourth object representation based on a key feature (wherein the fourth object representation can be from the next iteration while using key frames) ([0070]) corresponding to the video feature (based on the generated features) (Fig. 7; [0101]) and the first object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]);
determining a second probability indicating an object category in the video based on the fourth object representation (wherein the class probabilities are continuously made as the ANN is updated, such as for a fourth object representation, until it’s above a certain probability) ([0104-0105]); and
determining the object representation corresponding to each video frame of the one or more video frames (determining an object representation for each frame of the plurality of frames) ([0100] and [0104]) based on the second probability (wherein the class probabilities are continuously made as the ANN is updated; second probability) ([0104-0105]) and a value feature corresponding to the video feature (and a value corresponding to the features) ([0104]).
Regarding claim 10, Lee teaches wherein the determining of the object representation by the current iteration processing of the video based on the first object representation (wherein the object representation is further accurate based on the updated ANN) ([0032-0033] and [0105]) and the determined object representation corresponding to the each video frame (determining an object representation for each frame of the plurality of frames) ([0100] and [0104]) comprises:
generating a third object representation (third iteration of updating the ANN) ([0100-0105]) corresponding to the video by performing classification processing (classification processing) ([0030-0031], [0104], and [0107]) and self-attention processing (performing a spatial-temporal attention module to the target features and the reference features) ([0133]) (as well as using a Spatial Attention Network) ([0038]) on the determined object representation corresponding to the each video frame (determining an object representation for each frame of the plurality of frames) ([0100] and [0104]); and
determining the object representation by the current iteration processing of the video based on the first object representation and the third object representation (wherein the object representation is based on the iterations of updating the ANN) (Fig. 7; [0100-0105]).
Regarding claim 12, Lee teaches wherein the generating of the panorama segmentation result comprises generating the panorama segmentation result using a trained panorama segmentation model (generating the video panoptic segmentation based on training an ANN for the video panoptic segmentation) (Fig. 7; [0095]), and
the panorama segmentation model is trained using a target loss function based on a sample panorama segmentation result corresponding to a training video (during the training process, weights may be adjusted to improve the accuracy of the result; i.e., by minimizing a loss function which corresponds in some way to the difference between the current result and the target result) ([0033] and [0107]),
one or more prediction object representations of the training video (generating predicted panoptic segmentation information) (Fig. 7; [0103]) determined through a first module (different heads, such as track head, semantic head, bounding box head, mask head, and segmentation component) ([0071-0072]) configured to implement one or more portions of a panorama segmentation model (for panoptic video segmentation of different portions) ([0071-0072]), and
one or more prediction results of the training video (prediction results) ([0105]) determined through a second module (different heads, such as track head, semantic head, bounding box head, mask head, and segmentation component) ([0071-0072]) configured to implement one or more other portions of the panorama segmentation model (for panoptic video segmentation of different portions) ([0071-0072]).
Regarding claim 13, Lee teaches a non-transitory computer-readable storage medium (non-transitory computer readable medium) ([0112]) storing instructions (storing code) ([0112]) that, when executed by one or more processors (the code comprises instructions executable by a processor) ([0112]), configure the one or more processors to perform the method of claim 1 (to perform the video panoptic segmentation method of claim 1) (see claim 1 above and [0112]).
Regarding claim 14, see the rejection made to claim 1, as well as Lee for an electronic apparatus (an apparatus for video panoptic segmentation) ([0111]) comprising: one or more processors (the apparatus comprising a processor) ([0111]), for they teach all the limitation within this claim.
Regarding claim 15, Lee teaches a processor-implemented (including a processor) ([0043]) method (video processing, and more specifically to video panoptic segmentation) ([0001]), the method comprising:
obtaining training data (utilizing training datasets) ([0041]), wherein the training data comprises a training video (training videos) ([0065]), a first video feature of the training video (wherein the training set comprising a plurality of video clips includes target and reference features) ([0118]), and a sample panorama segmentation result corresponding to the training video (a training set including a set of video clips and original panoptic segmentation information) (Fig. 7; [0096]);
generating a second video feature (wherein the training set comprising a plurality of video clips includes target and reference features) ([0118]) by changing a frame sequence of the first video feature (changing the sequence by selecting a frame and then another frame that could be before the frame instead of after the frame) ([0089]);
determining, through a first module configured to implement one or more portions of a panorama segmentation model (determining through a part of the video panoptic segmentation) (Fig. 7; [0096] and [0100]), a first prediction object representation and a second prediction object representation (wherein a first prediction and a second prediction are generated based on updating the ANN of the panoptic segmentation) (Fig. 7; [0104]) of the training video based on the first video feature and the second video feature (based on the generated features; target features and reference features) (Fig. 7; [0101]), respectively;
determining, through a second module configured to implement one or more other portions of the panorama segmentation model (based on portions of the panoptic segmentation) (Fig. 7; [0095]), a first prediction result and a second prediction result of the training video (wherein a first prediction is made, and then the ANN is updated and during the second iteration a second prediction is made for comparison) (Fig. 7; [0103-0105]) based on the first prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]) and the second prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]), respectively; and
training the panorama segmentation model (training an ANN for video panoptic segmentation) (Fig. 7; [0095]) using a target loss function based on the sample panorama segmentation result (loss function in the video panoptic segmentation) ([0033] and [0107]), the first prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]), the second prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]), the first prediction result, and the second prediction result (wherein a first prediction is made, and then the ANN is updated and during the second iteration a second prediction is made for comparison) (Fig. 7; [0103-0105]).
Although Lee does not explicitly teach “changing a frame sequence of the first video feature” it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention that since Lee teaches that the subsequent frames can be determined before and after the anchor frame that this is obviously changing a frame sequence (such as going backwards) ([0089]).
Regarding claim 16, Lee teaches wherein the training of the panorama segmentation model (training an ANN for video panoptic segmentation) (Fig. 7; [0095]) using the target loss function based on the sample panorama segmentation result (loss function in the video panoptic segmentation) ([0033] and [0107]), the first prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]), the second prediction object representation (based on the prediction object representation; predicted class probabilities) ([0103-0104]), the first prediction result, and the second prediction result (wherein a first prediction is made, and then the ANN is updated and during the second iteration a second prediction is made for comparison) (Fig. 7; [0103-0105]) comprises:
determining a first similarity matrix based on the first prediction object representation and the second prediction object representation (determining a feature matrix which represents a correspondence between regions of interest for objects; within the reference frame and target frame; affinity matrix) ([0027] and [0080-0081]);
determining a second similarity matrix based on the sample panorama segmentation result (wherein another similarity matrix is made after updating the ANN on the second iteration) ([0102-0105]), the first prediction result, and the second prediction result (wherein a first prediction is made, and then the ANN is updated and during the second iteration a second prediction is made for comparison) (Fig. 7; [0103-0105]); and
outputting a trained panorama segmentation model (outputting the trained ANN of the panoptic segmentation) ([0105]) in response to the target loss function being determined to be minimum based on the first similarity matrix and the second similarity matrix (by minimizing a loss function which corresponds in some way to the difference between the current result and the target result; such as the first similarity matrix and the second similarity matrix) ([0033] and [0105-0107]).
Regarding claim 17, Lee teaches further comprising, using the trained panorama segmentation model (using the trained ANN of the panoptic segmentation) ([0105]):
obtaining a video feature of a video comprising a plurality of video frames (obtaining video features, from an encoder, of a target frame and one or more reference frames from a video) ([0017] and [0025-0026]);
determining a target object representation of the video based on the video feature (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]) using a neural network of the trained panorama segmentation model (using an artificial neural network (ANN) of the panoptic segmentation) ([0031]); and
generating a panorama segmentation result of the video based on the target object representation (generating video panoptic segmentation based on the target object representation from the class labels and instance IDs of the object) ([0029-0031]).
Claim(s) 11 is rejected under 35 U.S.C. 103 as being unpatentable over Lee et al., US 2021/0326638 A1 (Lee), and further in view of Zhang et al., US 2022/0237414 A1 (Zhang).
Regarding claim 11, Lee teaches wherein the generating of the panorama segmentation result of the video based on the target object representation (generating video panoptic segmentation based on the target object representation from the class labels and instance IDs of the object) ([0029-0031]) comprises:
performing transformation processing (different layers of the neural network can perform different transformations on their inputs) ([0033]) on the target object representation (using the combined features from the target frame and the reference frames may be used to generate both instance segmentation information and semantic segmentation information for the object) ([0016-0017], [0025-0027], and [0029]); and
determining mask information of the video (generating an object mask that identifies the limits of the object) ([0020], [0031], [0037], [0082-0084], [0104] and [0107]) based on the transformation-processed (different layers of the neural network can perform different transformations on their inputs) ([0033]) target object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]) and the video feature (based on the generated features) (Fig. 7; [0101]) and determining category information (class probability) ([0104]) of the video based on the transformation-processed (different layers of the neural network can perform different transformations on their inputs) ([0033]) target object representation (wherein each pixel can contain a unique category label and an instance label) (Fig. 7; [0104]).
Lee teaches different layers of the neural network can perform different transformations on their inputs ([0033]). However, Lee does not explicitly teach “linear” transformation.
Zhang teaches apparatuses, systems, and techniques to generate one or more confidence values associated with one or more objects identified by one or more neural networks (Abstract); wherein a neural network is implemented with a set of segmentation classes of objects to detect within images ([0071]); wherein a hidden layer of the neural network can perform linear transformations ([0071]); and wherein the neural network can be used in panoptic segmentation ([0086]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lee to include within the neural network transformations to include linear transformations for continuous learning/improvement (Zhang; [0590]) to achieve a desired accuracy (Zhang; [0145] and [0554]).
Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL J VANCHY JR whose telephone number is (571)270-1193. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached at (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MICHAEL J VANCHY JR/Primary Examiner, Art Unit 2666 Michael.Vanchy@uspto.gov