Last updated: April 19, 2026

Application No. 18/232,131

VIDEO OBJECT SEGMENTATION USING ESTIMATED MOTION INFORMATION AND IMAGE FEATURES

Non-Final OA §103

Filed

Aug 09, 2023

Examiner

BITOR, RENAE ALLYN

Art Unit

2663

Tech Center

2600 — Communications

Assignee

Adobe Inc.

OA Round

1 (Non-Final)

Interview Optional

— +25.0% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 35 resolved cases, 2023–2026

Examiner Intelligence

BITOR, RENAE ALLYN View full profile →

Grants 86% — above average

Career Allow Rate

30 granted / 35 resolved

+23.7% vs TC avg

Strong +25% interview lift

Without

With

+25.0%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

9 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

7.0%

-33.0% vs TC avg

§103

51.9%

+11.9% vs TC avg

§102

26.0%

-14.0% vs TC avg

§112

15.1%

-24.9% vs TC avg

Black line = Tech Center average estimate • Based on career data from 35 resolved cases

Office Action

§103

DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 08/09/2023 was considered by the examiner.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (NPL: TokenCut: Segmenting Objects in Images and Videos with Self-supervised Transformer and Normalized Cut, hereafter referred as Wang) in view of Lin et al. (U.S. Patent App. Pub No. 2022/0101539 A1, hereafter referred as Lin).
Regarding Claim 1:
Wang teaches a system comprising: at least one computer processor; and one or more computer storage media storing computer-useable instructions that, when used by the at least one computer processor, cause the at least one computer processor to perform operations comprising (Wang: p. 2, bottom right; see implementation of TokenCut algorithmic code using storage and a processor): receiving a video frame, of a plurality of video frames, the plurality of video frames corresponding to a video (Wang: p. 3, Approach: TokenCut; either an image or a sequence of frames); generating a data structure that includes a plurality of nodes and one or more edges, each node, of the plurality of nodes, represents a respective section of the video frame, each edge, of the one or more edges, being associated with a weight that at least partially indicates a measure of similarity between respective sections of the video frame (Wang: p. 3, Approach: TokenCut and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
Wang fails to further teach accessing estimated motion information for each pixel, of a plurality of pixels, of the video frame.
Lin, like Wang, is directed to video segmentation. Lin does teach accessing estimated motion information for each pixel, of a plurality of pixels, of the video frame (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Wang to utilize the estimated motion, as taught by Lin, to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. As taught by Lin, the proposed modification would help determine the overall optical flow estimation of the frame (Lin: Par. [0060]).
In regards to Claim 2, Wang as modified by Lin further teaches the system of claim 1, wherein the at least one computer processor performs further operations comprising: in response to the receiving of the video frame and the accessing of the estimated motion information, deriving, via a pre-trained vision transformer, one or more embeddings that are one or more encoded representations of the video frame and the estimated motion information (Wang: p. 3, Vision Transformers; a positional encoding is added to CLS token and the set of patch tokens, then they are fed to a standard transformer network with self-attention and layer normalization), and wherein the one or more embeddings are used as input for the generating of the data structure (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features).
In regards to Claim 3, Wang as modified by Lin further teaches the system of claim 1, wherein the estimated motion information includes optical flow information that indicates a predicted displacement of each pixel from the video frame to a second video frame, of the plurality of video frames (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics; the temporal characteristics can include a confidence associated with the significance and/or relevance of the motion of the pixel to overall optical flow estimation).
In regards to Claim 4, Wang as modified by Lin further teaches the system of claim 1, wherein the weight further indicates a second measure of feature similarity between respective sections of the video frame according to image feature similarity between the respective sections, and wherein performing video object segmentation of the video frame is based at least in part on the second measure of similarity (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 5, Wang as modified by Lin further teaches the system of claim 1, wherein the data structure is a graph, and wherein the operations further comprising grouping, by performing a graph cut on the graph, one or more nodes, of the plurality of nodes, together based on the measure of similarity exceeding a threshold, and wherein the performing of the video object segmentation is based at least in part on the performing of the graph cut (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 6, Wang as modified by Lin further teaches the system of claim 1, wherein the performing of the video object segmentation of the video frame includes partitioning, via one or more indicators, one or more foreground objects from a background in the video frame (Wang: p. 3, Approach: TokenCut, and Fig. 2; bi-partition of the graph using the second smallest eigenvector allows to detect foreground object).
In regards to Claim 7, Wang as modified by Lin further teaches the system of claim 1, wherein the receiving of the video frame, the accessing of the estimated motion information, the generating of the data structure, and the performing of the video object segmentation is a part of an unsupervised end-to-end pipeline that excludes learning or training a machine learning model for the video object segmentation (Wang: p. 2; as a training-free method, TokenCut achieves competitive performance on unsupervised video segmentation).
In regards to Claim 8, Wang as modified by Lin further teaches the system of claim 1, wherein the operations further comprising; receiving a second video frame, of the plurality of video frames, the second video frame being a next succeeding video frame relative to the video frame in the video (Wang: p. 3, Approach: TokenCut; either an image or a sequence of frames); accessing second estimated motion information for a second plurality of pixels of the second video frame (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics); generating a second data structure that includes a second plurality of nodes and a second set of one or more edges, each node, of the second plurality of nodes, represents a respective section of the second video frame, each edge, of the second set of one or more edges, being associated with a second weight that at least partially indicates a measure of similarity between respective sections of the second video frame according to the second estimated motion information (Wang: p. 3, Approach: TokenCut and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features), the second estimated motion information being based at least in part on the estimated motion information for the video frame; and based at least in part on the generating of the second data structure, performing video object segmentation of the second video frame (Wang: p. 3, Approach: TokenCut and Fig. 2; object segmentation is then solved using the Ncut algorithm) such that an object is tracked between the video frame and the second video frame (Wang: p. 5, Video Graph; similarity includes a score based on both RGB appearance and a RGB representation of optical flow computed between consecutive frames).
Regarding Claim 9:
Wang as modified by Lin further teaches a computer-implemented method comprising (Wang: p. 2, bottom right; see implementation of TokenCut algorithmic code obviously using storage and a processor): receiving a video frame, of a plurality of video frames, the plurality of video frames corresponding to a video (Wang: p. 3, Approach: TokenCut; either an image or a sequence of frames); parsing the video frame into a plurality of sections; generating a first similarity score that indicates a measure of feature similarity between a first section and a second section, of the plurality of sections (Wang: p. 3, Approach: TokenCut and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features); generating a second similarity score that indicates a measure of estimated motion information similarity between the first section and the second section (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics); and based at least in part on the first similarity score and the second similarity score, performing video object segmentation of the video frame (Wang: p. 3, Approach: TokenCut and Fig. 2; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 10, Wang as modified by Lin further teaches the computer-implemented method claim 9, further comprising: in response to the receiving of the video frame and determining estimated motion information, deriving, via a pre-trained vision transformer, one or more embeddings that are one or more encoded representations of the video frame and the estimated motion information (Wang: p. 3, Vision Transformers; a positional encoding is added to CLS token and the set of patch tokens, then they are fed to a standard transformer network with self-attention and layer normalization), and wherein the one or more embeddings are used as input for the generating of the first similarity score and the second similarity score (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features).
In regards to Claim 11, Wang as modified by Lin further teaches the computer-implemented method of claim 9, wherein the second similarity score is generated based on estimated motion information, the estimated motion information includes optical flow information that indicates a predicted displacement of each pixel from the video frame to a second video frame, of the plurality of video frames (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics; the temporal characteristics can include a confidence associated with the significance and/or relevance of the motion of the pixel to overall optical flow estimation).
In regards to Claim 12, Wang as modified by Lin further teaches the computer-implemented method of claim 9, further comprising generating a graph that includes a plurality of nodes and one or more edges, each node, of the plurality of nodes, represents a respective section of the video frame, each edge, of the one or more edges, being associated with a weight that at least partially indicates the first similarity score and the second similarity score (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 13, Wang as modified by Lin further teaches the computer-implemented method of claim 12, further comprising grouping, by performing a graph cut on the graph, one or more nodes, of the plurality of nodes, together based on the weight exceeding a threshold, and wherein the performing of the video object segmentation is based at least in part on the performing of the graph cut (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 14, Wang as modified by Lin further teaches the computer-implemented method of claim 9, wherein the performing of the video object segmentation of the video frame includes partitioning, via one or more indicators, one or more foreground objects from a background in the video frame (Wang: p. 3, Approach: TokenCut, and Fig. 2; bi-partition of the graph using the second smallest eigenvector allows to detect foreground object).
In regards to Claim 15, Wang as modified by Lin further teaches the computer-implemented method of claim 9, wherein the receiving of the video frame, the parsing, the generating of the first similarity score, and the generating of second similarity score is a part of an unsupervised end-to-end pipeline that excludes learning or training a machine learning model for the video object segmentation (Wang: p. 2; as a training-free method, TokenCut achieves competitive performance on unsupervised video segmentation).
In regards to Claim 16, Wang as modified by Lin further teaches the computer-implemented method of claim 9, further comprising; receiving a second video frame, of the plurality of video frames, the second video frame being a next succeeding video frame relative to the video frame in the video (Wang: p. 3, Approach: TokenCut; either an image or a sequence of frames); parsing the second video frame into a second plurality of sections; generating a third similarity score that indicates a measure of feature similarity between a first section and a second section of the second plurality of sections (Wang: p. 3, Approach: TokenCut and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features); generating a fourth similarity score that indicates a second measure of estimated motion information similarity between the first section and the second section of the second plurality of sections, the second measure of estimated motion being based at least in part on the measure of estimated motion between the first section and the second section of the plurality of sections of the video frame (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics); based at least in part on the third similarity score and the fourth similarity score, performing video object segmentation of the second video frame (Wang: p. 3, Approach: TokenCut and Fig. 2; object segmentation is then solved using the Ncut algorithm) such that an object is tracked between the video frame and the second video frame (Wang: p. 5, Video Graph; similarity includes a score based on both RGB appearance and a RGB representation of optical flow computed between consecutive frames).
Regarding Claim 17:
Wang as modified by Lin further teaches a computerized system, the system comprising (Wang: p. 2, bottom right; see implementation of TokenCut algorithmic code obviously using storage and a processor): an embedding means for receiving a video frame, of a plurality of video frames, the plurality of video frames corresponding to a video (Wang: p. 3, Approach: TokenCut; either an image or a sequence of frames); a video frame patch means for parsing the video frame into a plurality of sections (Wang: p. 3, Vision Transformers, Approach: TokenCut, and Fig. 2; each patch is used as a token, described by a vector of numerical features that provide an embedding; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features); an estimated motion means for receiving or determining motion information for a plurality of pixels of the video frame (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics); wherein the embedding means is further for deriving one or more embeddings that are one or more encoded representations of the video frame and the estimated motion information (Wang: p. 3, Vision Transformers; a positional encoding is added to CLS token and the set of patch tokens, then they are fed to a standard transformer network with self-attention and layer normalization); a similarity score means for generating one or more similarity scores that indicates at least one of a measure of feature similarity between a first section and a second section (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features), of the plurality of sections, and a measure of motion information similarity between the first section and the second section (Lin: Par. [0060]; the feature extraction engine 304 can determine contextual features associated with the pixels of the source frame; the temporal characteristics of the pixel can include one or more characteristics associated with the motion of the pixel, including the velocity of the motion, the acceleration of the motion, among other characteristics); and a VOS means for performing video object segmentation of the video frame based at least in part on the generating of the one or more similarity scores (Wang: p. 3, Approach: TokenCut and Fig. 2; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 18, Wang as modified by Lin further teaches the computerized system of claim 17, further comprising a graph generator means for generating a graph that includes a plurality of nodes and one or more edges, each node, of the plurality of nodes, represents a respective section of the video frame, each edge, of the one or more edges, being associated with a weight that at least partially indicates the one or more similarity scores.
In regards to Claim 19, Wang as modified by Lin further teaches the computerized system of claim 18, further comprising a graph cut means 118 for grouping, by performing a graph cut on the graph, one or more nodes, of the plurality of nodes, together based on the weight exceeding a threshold, and wherein the performing of the video object segmentation is based at least in part on the performing of the graph cut (Wang: p. 3, Approach: TokenCut, and Fig. 2; the algorithm constructs a fully connected graph in which the nodes are image patches and the edges are similarities between the image patches using transformer features; object segmentation is then solved using the Ncut algorithm).
In regards to Claim 20, Wang as modified by Lin further teaches the computerized system of claim 17, wherein the performing of the video object segmentation of the video frame includes partitioning, via one or more indicators, one or more foreground objects from a background in the video frame (Wang: p. 3, Approach: TokenCut, and Fig. 2; bi-partition of the graph using the second smallest eigenvector allows to detect foreground object).
Pertinent Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Loui et al. (U.S. Patent App. Pub No. 2016/0379055 A1) teaches graph-based spatiotemporal video segmentation and automatic target object extraction in high-dimensional feature space.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RENAE BITOR whose telephone number is (703)756-5563. The examiner can normally be reached Monday to Friday: 8:00 - 5:30 but off the 1st Friday of the biweek.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, GREG MORSE can be reached on (571)272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/RENAE A BITOR/Examiner, Art Unit 2663


/GREGORY A MORSE/Supervisory Patent Examiner, Art Unit 2698

Read full office action

Prosecution Timeline

Aug 09, 2023

Application Filed

Jan 10, 2026

Non-Final Rejection — §103

Feb 13, 2026

Interview Requested

Feb 19, 2026

Applicant Interview (Telephonic)

Feb 19, 2026

Examiner Interview Summary

Precedent Cases

Applications granted by this same examiner with similar technology

18/043,072

Patent 12602809

MEASURING METHOD AND MEASURING APPARATUS OF BLOOD VESSEL DIAMETER OF FUNDUS IMAGE

2y 5m to grant Granted Apr 14, 2026

18/512,075

Patent 12602826

NEURAL NETWORK-BASED POSE ESTIMATION AND REGISTRATION METHOD AND DEVICE FOR HETEROGENEOUS IMAGES, AND MEDIUM

2y 5m to grant Granted Apr 14, 2026

18/076,723

Patent 12546882

CAMERA-RADAR SENSOR FUSION USING LOCAL ATTENTION MECHANISM

2y 5m to grant Granted Feb 10, 2026

18/051,612

Patent 12524909

Plane Detection and Identification for City-Scale Localization

2y 5m to grant Granted Jan 13, 2026

18/150,936

Patent 12518373

METHODS OF CLASSIFYING DEFECTS OF A PATTERN

2y 5m to grant Granted Jan 06, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

1-2

Expected OA Rounds

86%

Grant Probability

99%

With Interview (+25.0%)

2y 10m

Median Time to Grant

Low

PTA Risk

Based on 35 resolved cases by this examiner. Grant probability derived from career allow rate.