Prosecution Insights
Last updated: April 19, 2026
Application No. 18/237,563

ITERATIVE BACKGROUND GENERATION FOR VIDEO STREAMS

Final Rejection §103
Filed
Aug 24, 2023
Examiner
JONES, CARISSA ANNE
Art Unit
2691
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
2 (Final)
83%
Grant Probability
Favorable
3-4
OA Rounds
2y 10m
To Grant
99%
With Interview

Examiner Intelligence

Grants 83% — above average
83%
Career Allow Rate
20 granted / 24 resolved
+21.3% vs TC avg
Strong +25% interview lift
Without
With
+25.0%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
30 currently pending
Career history
54
Total Applications
across all art units

Statute-Specific Performance

§101
3.1%
-36.9% vs TC avg
§103
76.0%
+36.0% vs TC avg
§102
11.6%
-28.4% vs TC avg
§112
4.9%
-35.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases

Office Action

§103
DETAILED ACTION This action is in response to the remarks filed 10/10/2025. Claims 1 – 20 are pending and have been examined. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant's arguments filed 10/10/2025 have been fully considered but they are not persuasive. As discussed in Remarks dated 10/10/2025, a telephonic interview took place on October 7, 2025 in which proposed amendments were discussed (see Office Action Appendix dated 10/10/2025). Examiner stated that proposed amendments appeared to overcome the prior art of record, however further search and consideration was required. The entirety of the proposed amendments were not reflected in the amended claims dated 10/10/2025. As the only amendment to the claimed limitations is that each step in the claimed method is during a video conference. It appears that the prior art of record does teach this limitation, and therefore Non-Final Rejection dated 07/10/2025 has been maintained. Therefore, independent claims 1,9 and 17 are rejected, and subsequent dependent claims are rejected as well, except for claims 6 and 14. Claims 6 and 14 remain objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Response to Amendment Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 1 – 3, 9 – 11, and 17 - 19 are rejected under 35 U.S.C. 103 as being unpatentable over Sommerlade et al. (U.S. Pub. No. 2022/0383034, hereinafter “Sommerlade”) in view of Cower (U.S. Patent No. 11,869,274), Ho (KR Pub. No. 20160057867), and Jung et al. (EP Pub. No. 4105878, hereinafter “Jung”). Regarding Claim 1, Sommerlade teaches A method (see Sommerlade Abstract, method) comprising: determining a first background layer and a first foreground layer of a first frame of a video stream provided by a client device associated with a first participant of a plurality of participants of a video conference (see Sommerlade Paragraph [0025], The stream processor 112 is configured to segment an input image from a stream of input images into a “foreground” portion that contains a target object of the input image, and a “background” portion that contains a remainder of the input image. The target object may be a person in a video conference feed, an object of interest (e.g., a toy or coffee mug that may be held up to a camera), or other suitable target and Paragraph [0004], The method comprises: receiving the stream of input images, including receiving a current input image; identifying one or more target objects, including a first target object, spatio-temporally within the stream of input images; tracking the one or more target objects, including the first target object, spatio-temporally within the stream of input images; segmenting the current input image into i) a foreground including the first target object, and ii) a background); modifying, during the video conference and using the image of the obscured region and the combined background layer, background layers of subsequent frames of the video stream (see Sommerlade Paragraph [0005], processing the background of the current input image differently from the foreground of the current input image; and generating an output image by merging the foreground and the first target object with the background Paragraph [0049], For each subsequent input image, a region is either updated via corresponding detection or predicted from the previous frame location, for example using a Kalman filter (not shown) and subsequent facial landmark detection, Paragraph [0025], The stream processor 112 is configured to segment an input image from a stream of input images into a “foreground” portion that contains a target object of the input image, and a “background” portion that contains a remainder of the input image. The target object may be a person in a video conference feed and Paragraph [0047], the foreground processor 220 uses the unique identifier and associated metadata to use a same processing technique during the stream of input images. For example, the foreground processor 220 performs a color reconstruction process using the color processor 224 with consistent parameters or level of detail for the target object, allowing the target object to be consistently displayed in the output images (i.e., without significant changes in color that might otherwise appear due to noise or other small variations in the input images). Tracking the target object ensures that processing of the target object uses the same processing technique (e.g., color reconstruction, superresolution, etc.) during the stream of input images); and providing, during the video conference, the video stream with modified background layers for presentation on one or more client devices of one or more of the plurality of participants of the video conference (see Sommerlade Paragraph [0026], After segmentation, the stream processor 112 is configured to process the foreground layers and background layers separately using different image processing techniques. The stream processor 112 then merges the foreground and background to obtain an output image. The output image may be displayed on a local display device, transmitted to another display device, encoded, etc., which may be a video conference (Paragraph [0025])). Sommerlade does not expressively teach determining, during the video conference, a second background layer and a second foreground layer of a second frame of the video stream; combining, during the video conference, the first background layer and the second background layer to obtain a combined background layer, wherein the combined background layer comprises a region obscured by both the first foreground layer and the second foreground layer; performing, during the video conference and using a generative machine learning model, an inpainting of the obscured region to obtain an image of the obscured region; However, Cower teaches determining, during the video conference, a second background layer and a second foreground layer of a second frame of the video stream (see Cower Abstract, A method includes receiving a set of video frames that correspond to a video, including a first video frame and a second video frame that each include a face, wherein the second video frame is subsequent to the first video frame and Column 10, lines 11 – 16, The video analyzer 204 receives the set of decoded video frames and, for each frame, identifies a background and a face in the decoded video frame. For example, the video analyzer 204 identifies a first background and a face (or foreground) in the first video frame and a second background and the face (or foreground) in the second video frame, and Column 9, lines 34 – 38, In some embodiments, the decoder 202 includes a set of instructions executable by the processor 235 to decode encoded video frames, e.g., received from a sender device that participates in a video call with the computing device 200); combining, during the video conference, the first background layer and the second background layer to obtain a combined background layer (see Cower Column 2, lines 20-22, In some embodiments, the method further comprises blending the first background with the second background to obtain a blended background and Column 9, lines 34 – 38, In some embodiments, the decoder 202 includes a set of instructions executable by the processor 235 to decode encoded video frames, e.g., received from a sender device that participates in a video call with the computing device 200), It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining a background and foreground layer of a video stream in a video conference and subsequently modifying the background for display (as taught in Sommerlade), with determining a second background and foreground layer of a second frame of the video and combining the background layers (as taught in Cower), the motivation being to collect more than one set of data (video layers) in order to create more realistic and dynamic background replacements in videos, and to minimize inconsistencies (see Cower Column 15, lines 34 - 39). Sommerlade in view of Cower does not expressively teach wherein the combined background layer comprises a region obscured by both the first foreground layer and the second foreground layer; performing, during the video conference and using a generative machine learning model, an inpainting of the obscured region to obtain an image of the obscured region; However, Ho teaches wherein the background layer comprises a region obscured by both the first foreground layer and the second foreground layer (see Ho Paragraph [0034], Specifically, the local image region of the second image may be a background region that was obscured by the subject from the entire background of the first image and Figure 3, 3 is a view for explaining an embodiment in which an area obscured by a subject occurs in an image being shot); It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining a background and foreground layer of a video stream in a video conference and subsequently modifying the background for display (as taught in Sommerlade), with determining a second background and foreground layer of a second frame of the video and combining the background layers (as taught in Cower), the motivation being to collect more than one set of data (video layers) in order to create more realistic and dynamic background replacements in videos, and to minimize inconsistencies (see Cower Column 15, lines 34 - 39). It would have been further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining multiple background and foreground layers of a video stream in a video conference to combine and subsequently modifying the background for display (as taught in Sommerlade in view of Cower), with an obscured area in a background layer due to segmentation of video layers (as taught in Ho), the motivation being to determine and address the area that is absent after segmentation, in order to replace that absent area in case a user moves from the original spot in the frame during the video conference (see Ho, Paragraph [0002] and [0003]). Sommerlade in view of Cower and Ho does not expressively teach performing, during the video conference and using a generative machine learning model, an inpainting of the obscured region to obtain an image of the obscured region; However, Jung teaches performing, during the video conference and using a generative machine learning model, an inpainting of the obscured region to obtain an image of the obscured region (see Jung Paragraph [0043], The image acquisition device 1000 may detect at least one of the main objects 1a and 1b and the sub-objects 2a, 2b, and 2c from the acquired image by using a trained model 3000. The image acquisition device 1000 may restore at least a portion of the main object 1b hidden by the sub-objects 2a, 2b, and 2c, by using the trained model 3000. After the restoration of the hidden part of the object 1b, a cumulative effect to a viewer of the image is that of a complete and continuous representation of the object 1b including the portion that was previously hidden by one or more of the sub-objects, Figure 31, in which image acquisition device includes A/V input interface 1600, and Paragraph [0174], The A/V input interface 1600 inputs an audio signal or a video signal, and may include a camera 1610 and the microphone 1620. The camera 1610 may acquire an image frame, such as a still image or a moving picture, via an image sensor in a video call mode or a photography mode. An image captured via the image sensor may be processed by the processor 1300 or a separate image processor (not shown)); It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining a background and foreground layer of a video stream in a video conference and subsequently modifying the background for display (as taught in Sommerlade), with determining a second background and foreground layer of a second frame of the video and combining the background layers (as taught in Cower), the motivation being to collect more than one set of data (video layers) in order to create more realistic and dynamic background replacements in videos, and to minimize inconsistencies (see Cower Column 15, lines 34 - 39). It would have been further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining multiple background and foreground layers of a video stream in a video conference to combine and subsequently modifying the background for display (as taught in Sommerlade in view of Cower), with an obscured area in a background layer due to segmentation of video layers (as taught in Ho), the motivation being to determine and address the area that is absent after segmentation, in order to replace that absent area in case a user moves from the original spot in the frame during the video conference (see Ho, Paragraph [0002] and [0003]). It would have been even further obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining multiple background and foreground layers of a video stream in a video conference to combine, that has an obscured area in a background layer due to segmentation of video layers, and subsequently modifying the background for display (as taught in Sommerlade in view of Cower and Ho), with using a generative machine learning model, restoring an obscured region to obtain an image of the obscured region (as taught in Jung), the motivation being to ensure the obscured area isn’t an absent space displayed during a video conference, and filling or restoring that area in a quick and accurate process (see Jung Paragraph [0002]). Regarding Claim 2, Sommerlade in view of Cower, Ho, and Jung teach The method of claim 1, wherein determining, during the video conference, the first background layer and the first foreground layer of the video stream comprises: providing the first frame of the video stream as input to a machine learning model, wherein the machine learning model is trained to predict, based on a given frame, segmentation labels for the given frame that represent foreground and background regions of the given frame (see Sommerlade Paragraph [0023], In one embodiment, an instance of the neural network model 114 is configured to receive an input image, or a portion thereof, perform an image processing technique, and provide an output image and Paragraph [0024], In another embodiment, the neural network model 114 is a recurrent neural network model, convolutional neural network model, or other suitable neural network model that is configured to estimate a mask for segmenting an input image, as described herein); obtaining a plurality of outputs from the machine learning model, wherein the plurality of outputs comprises one or more background regions and one or more foreground regions; combining the one or more background regions to obtain the first background layer; and combining the one or more foreground regions to obtain the first foreground layer (see Sommerlade Paragraph [0049] The object tracker 310 in some embodiments is a face tracker. For each face, a region is memorized for the subsequent input image. If a current location of a detected face is related to a previously detected face, the current location is taken as the update to the previous location. This way, a temporally consistent labelling is possible, in other words, a target object will have a same label even as the target object moves around within an image (i.e., within a scene shown in the image). In some embodiments, the object tracker 310 uses a neural network model 114 to recursively update the estimate and take into account previous frames. The neural network model 114 may be a recurrent neural network model, a convolutional neural network model, or other suitable neural network model, in various embodiments. In an embodiment, a relationship between locations is established via overlap of the output regions. For each subsequent input image, a region is either updated via corresponding detection or predicted from the previous frame location, for example using a Kalman filter (not shown) and subsequent facial landmark detection. The output of the face tracker is turned into a binary mask by setting the pixels inside of the face regions to “foreground”, the other pixels to “background.”). Regarding Claim 3, Sommerlade in view of Cower, Ho, and Jung teach The method of claim 1, further comprising performing iterative modifications on the image for subsequent frames of the video stream as portions of the obscured region are revealed (see Sommerlade Paragraph [0046], The object tracker 310 is configured to identify and/or classify target objects within an input image, such as input image 340. The object tracker 310 may select unique identifiers for target objects within a stream of input images. For example, when a target object is identified, the object tracker 310 assigns a unique identifier to the target object that persists for a duration of the stream of input images. In some scenarios, the target object may not be identifiable in a subsequent input image. For example, the target object may be partially or totally obscured within one or more input images of a stream (e.g., obscured by another object such as a hat or book that passes in front of a user's face, obscured by a feature in the background such as a screen that the user walks behind, or hidden by moving out of frame) for a period of time, but return to the stream of input images at a later time (e.g., resume being identifiable). The object tracker 310 stores and maintains the unique identifier and associated metadata for the target object in a memory (e.g., stream data store 116), allowing the object tracker 310 to continue tracking the target object once it is no longer obscured in subsequent input images of the stream of input images, then tracking the target object (using the same unique identifier) and processing the target object (using the same processing technique) when it is no longer obscured). Regarding Claims 9 - 11, they are rejected similarly as Claims 1 - 3, respectively. The system can be found in Sommerlade (Paragraph [0005], system). Regarding Claims 17 - 19, they are rejected similarly as Claims 1 - 3, respectively. The non-transitory computer-readable storage medium can be found in Cower (Column 2, line 58, non-transitory computer readable medium). Claims 4, 5, 7, 8, 12, 13, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Sommerlade et al. (U.S. Pub. No. 2022/0383034, hereinafter “Sommerlade”) in view of Cower (U.S. Patent No. 11,869,274), Ho (KR Pub. No. 20160057867), Jung et al. (EP Pub. No. 4105878, hereinafter “Jung”) and Dal Zotto (U.S. Pub. No. 2023/0005159). Regarding Claim 4, Sommerlade in view of Cower, Ho, and Jung teach all the limitations of claim 3, but do not expressively teach The method of claim 3, wherein performing iterative modifications on the image for subsequent frames of the video stream as portions of the obscured region are revealed comprises: determining a third background layer of the of a third frame of the video stream; determining a shared region of the image that shares a common area with the third background layer; and modifying the image to replace a portion image corresponding to the shared region with a portion of the third background layer corresponding to the shared region. However, Dal Zotto teaches The method of claim 3, wherein performing iterative modifications on the image for subsequent frames of the video stream as portions of the obscured region are revealed comprises: determining a third background layer of the of a third frame of the video stream (see Dal Zotto Paragraph [0029], Processor 102 also performs segmentation on the next frame (e.g., a third frame). After segmentation is performed on the third frame, the non-person pixels (background) of the third frame are compared to the non-person pixels of the original (non-altered) second frame; determining a shared region of the image that shares a common area with the third background layer (see Dal Zotto Paragraph [0029], The pixels in the second frame used for the comparison are the pixels as they appeared before the mitigation action was applied to the second frame); and modifying the image to replace a portion image corresponding to the shared region with a portion of the third background layer corresponding to the shared region (see Dal Zotto Paragraph [0029], If a difference is found between non-person pixels in the third frame and non-person pixels in the second frame, a mitigation action may be applied to the third frame. This process continues for each frame of the video, with each subsequent frame compared to the previous frame). It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaching of a method of determining multiple background and foreground layers of a video stream in a video conference to combine, that has an obscured area in a background layer due to segmentation of video layers that is restored using a machine learning model, and subsequently modifying the background for display (as taught in Sommerlade in view of Cower, Ho, and Jung), with iterative modifications on an image for subsequent frames of a video stream (as taught in Dal Zotto), the motivation being to perform mitigation actions on a user’s video stream automatically, in response to detecting certain conditions, to ensure the process continues throughout a video without user input (see Dal Zotto, Paragraph [0010]). Regarding Claim 5, Sommerlade in view of Cower, Ho, Jung, and Dal Zotto teach The method of claim 4, further comprising ceasing the iterative modifications on the image in response to satisfying one or more criteria (see Dal Zotto Paragraph [0022], Applying blur or another mitigation action may be stopped when the motion stops. If two sequential frames match, or if the difference between the two frames is below a threshold for motion detection, system 100 may stop applying the mitigation action to the frames. The two frames would be displayed normally, without any mitigation action performed on the background. Normal display of the background would continue, without any mitigation action, until motion is detected again and Paragraph [0031], If the motion stops after a number of frames that is lower than the predetermined number of frames, the mitigation action may not be applied). Regarding Claim 7, Sommerlade in view of Cower, Ho, Jung, and Dal Zotto teach The method of claim 5, wherein the one or more criteria comprise at least one of exceeding a threshold amount of time or a threshold number of frames of the video stream (see Dal Zotto Paragraph [0022], Applying blur or another mitigation action may be stopped when the motion stops. If two sequential frames match, or if the difference between the two frames is below a threshold for motion detection, system 100 may stop applying the mitigation action to the frames. The two frames would be displayed normally, without any mitigation action performed on the background. Normal display of the background would continue, without any mitigation action, until motion is detected again and Paragraph [0031], If the motion stops after a number of frames that is lower than the predetermined number of frames, the mitigation action may not be applied). Regarding Claim 8, Sommerlade in view of Cower, Ho, Jung, and Dal Zotto teach The method of claim 5, further comprising resuming the iterative modifications on the image in response to detecting movement within the video stream (see Dal Zotto Figure 2, if motion is detected, then mitigation action is taken and Paragraph [0039], Process 200 continues at block 260, where processor 102 determines if motion is detected. Motion may be detected by finding a difference between the previous background frame pixels and the current background frame pixels. If the difference between the background pixels in the frames is above a threshold, motion is detected and Paragraph [0042], Process 200 is repeated for each frame of the video source in one example. Segmentation and motion detection may be performed continuously on the frames of a video as those frames are received from the video source). Regarding Claims 12, 13, 15, and 16, they are rejected similarly as Claims 4, 5, 7, and 8, respectively. The system can be found in Sommerlade (Paragraph [0005], system). Regarding Claim 20, it is rejected similarly as Claims 4. The non-transitory computer-readable storage medium can be found in Cower (Column 2, line 58, non-transitory computer readable medium). Allowable Subject Matter Claims 6 and 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Conclusion The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Refer to PTO-892, Notice of References Cited for a listing of analogous art. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to CARISSA A JONES whose telephone number is (703)756-1677. The examiner can normally be reached Telework M-F 6:30 AM - 4:00 PM CT. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Duc Nguyen can be reached at 5712727503. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /CARISSA A JONES/Examiner, Art Unit 2691 /DUC NGUYEN/Supervisory Patent Examiner, Art Unit 2691
Read full office action

Prosecution Timeline

Aug 24, 2023
Application Filed
Jul 08, 2025
Non-Final Rejection — §103
Sep 12, 2025
Interview Requested
Oct 07, 2025
Applicant Interview (Telephonic)
Oct 07, 2025
Examiner Interview Summary
Oct 10, 2025
Response Filed
Jan 20, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12598267
IMAGE CAPTURE APPARATUS AND CONTROL METHOD
2y 5m to grant Granted Apr 07, 2026
Patent 12598354
INFORMATION PROCESSING SERVER, RECORD CREATION SYSTEM, DISPLAY CONTROL METHOD, AND NON-TRANSITORY RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
Patent 12593004
DISPLAY METHOD, DISPLAY SYSTEM, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM STORING PROGRAM
2y 5m to grant Granted Mar 31, 2026
Patent 12556468
QUALITY TESTING OF COMMUNICATIONS FOR CONFERENCE CALL ENDPOINTS
2y 5m to grant Granted Feb 17, 2026
Patent 12556655
Efficient Detection of Co-Located Participant Devices in Teleconferencing Sessions
2y 5m to grant Granted Feb 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

3-4
Expected OA Rounds
83%
Grant Probability
99%
With Interview (+25.0%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month