Last updated: April 19, 2026

Application No. 18/530,189

Virtual Occlusion Mask Prediction Through Implicit Depth Estimation

Final Rejection §103

Filed

Dec 05, 2023

Examiner

HE, WEIMING

Art Unit

2611

Tech Center

2600 — Communications

Assignee

Niantic, Inc.

OA Round

2 (Final)

Interview Optional

— +13.8% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 410 resolved cases, 2023–2026

Examiner Intelligence

HE, WEIMING View full profile →

Grants 46% of resolved cases

Career Allow Rate

190 granted / 410 resolved

-15.7% vs TC avg

Moderate +14% lift

Without

With

+13.8%

Interview Lift

resolved cases with interview

Typical timeline

3y 4m

Avg Prosecution

40 currently pending

Career history

450

Total Applications

across all art units

Statute-Specific Performance

§101

7.4%

-32.6% vs TC avg

§103

59.2%

+19.2% vs TC avg

§102

12.4%

-27.6% vs TC avg

§112

15.0%

-25.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 410 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed on 3/3/26 has been entered and made of record. Claims 1, 12 and 21 are amended. Claims 1-21 are pending.

Response to Arguments
Applicant’s arguments with respect to claims 1, 12 and 21 have been fully considered but they are moot because the arguments do not apply to the references being used in the current rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-21 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (US 2024/0169541 A1) in view of Daly et al. (US 2022/0414973 A1) and Veges et al. (Temporal Smoothing for 3D Human Pose Estimation and Localization for Occluded People, https://arxiv.org/abs/2011.00250).
As to Claim 1, Zhang teaches A computer-implemented method for generating a composite image including a virtual object placed in an image of a real-world environment, the method comprising: 
receiving one or more input images captured by a camera assembly of a client device of the real-world environment (Zhang discloses “According to some aspects, an image generator 200 receives an original image including original content having an occluded portion and an occluding portion” in [0039]; “The image may be provided by a user and/or previously generated by an image generation model…” in [0109]. It is obvious that the input image can be an image captured by a camera of a user device, for example, Daly discloses “The AR HMD 110 may include one or more cameras which can capture images and videos of environments” in [0021]); 
generating a feature map from the one or more input images, wherein the feature map comprises abstract features representing depth of one or more objects in the real-world environment (Zhang discloses “At operation 630, the image encoder can generate feature maps for the image 510, where the feature maps represent each object instance 512, 515” in [0070], see also [0110]; “The image encoder 720 can generate features that include information indicating which object is the occluding object and which object is the occluded object that would be completed” in [0076], see also [0091]. Here, the detection of the objects with their positions can have abstract depth information to be used to determine which object is occluded.)
Zhang doesn’t directly teach depth map. The combination of Daly further teaches following limitations:
accessing instructions for rendering augmented reality content inclusive of a virtual object to be placed into images captured of the real-world environment, wherein the instructions include a depth map for placement of the virtual object (Zhang discloses “In some aspects, the output image combines additional content in a manner consistent with the original content” in [0049]. Daly further discloses “Based on the depth map, the computing system may generate a two-dimensional occlusion surface representing at least a visible portion of the one or more physical objects that are located within a predetermined depth range defined relative to the viewpoint… The computing system may determine a visibility of a virtual object relative to the one or more physical objects by comparing a model of the virtual object with the two-dimensional occlusion surface and generate an output image based on the determined visibility of the virtual object” in [0004]);
generating an occlusion mask from the feature map and the depth map for the virtual object, wherein the depth map for the virtual object indicates a depth of each pixel of the virtual object, and wherein the occlusion mask indicates one or more pixels of the virtual object that are occluded by an object in the real-world environment (Zhang discloses “At operation 1120, the image can be encoded to obtain image features and feature maps” in [0110]; “At operation 1130, the image features can be decoded to obtain an occlusion mask for the occluding and occluded objects” in [0111]. Daly further discloses “A method includes generating a depth map of a real environment as seen from a viewpoint that comprises pixels having corresponding depth values of one or more physical objects. Based on the depth map a two-dimensional occlusion surface is generated representing at least a visible portion of the one or more physical objects that are located within a predetermined depth range defined relative to the viewpoint… A method includes generating a depth map of a real environment as seen from a viewpoint that comprises pixels having corresponding depth values of one or more physical objects. Based on the depth map a two-dimensional occlusion surface is generated representing at least a visible portion of the one or more physical objects that are located within a predetermined depth range defined relative to the viewpoint” in Abstract.); 
generating the composite image based on a first input image at a current timestamp, the virtual object, and the occlusion mask; storing the composite image for subsequent display on an electronic display of the client device (Zhang discloses “and generating an output image including the original content from the original image and the additional content in the region using a diffusion model 1360 that takes the embedding vector as input” in [0125]. Daly further discloses “The displays 114 may be transparent or translucent allowing a user wearing the AR HMD 110 to look through the displays 114 to see the real world and displaying visual artificial reality content to the user at the same time.” in [0021]; “As an example and not by way of limitation, this initial output image of a view may be a view of an artificial reality environment including a set of virtual objects, for example a virtual bear, and one or more two-dimensional occlusion surfaces that represent real objects within the real environment” in [0042]; “The artificial reality content may include video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer)” in [0002], see also storage in Fig 9.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Zhang with the teaching of Daly so as to determine the visibility of a virtual object relative to the one or more physical objects and generate a composite image based on the visibility of the virtual object (Daly, Abstract).
Zhang and Daly are silent on temporal smoothing. The combination of Veges further teaches following limitation:
wherein generating the occlusion mask employs temporal smoothing across timestamps of the one or more input images (Daly discloses “In particular embodiments the two-dimensional occlusion mask may be generated to represent the user's viewing frustrurn from the particular viewpoint or pose. The two-dimensional occlusion surfaces may represent the one or more physical objects as they should appear from a particular viewpoint, and as such, may account for the user's perspective of the object from the view at a particular time” in [0040], see also [0036, 0043]. Veges further discloses “While temporal methods can still predict a reasonable estimation for a temporarily disappeared pose using past and future frames… We present an energy minimization approach to generate smooth, valid trajectories in time, bridging gaps in visibility” in Abstract; “Our second contribution is an energy minimization based smoothing function, targeting specifically those frames where a person became temporarily invisible. It adaptively smoothes the prediction stronger at frames where the pose is occluded and weaker when the pose is visible” at p.2.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the invention of Zhang and Daly with the teaching of Veges so as to use temporal methods to predict a reasonable estimation for  a temporarily disappeared pose using past and future frames to generate smooth, valid trajectories in time, bridging gaps in visibility (Veges, Abstract).

As to Claim 2, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein the one or more input images are frames from video data captured by the camera assembly (Daly discloses “The AR HMD 110 may include one or more cameras which can capture images and videos of environments” in [0021], see also [0023].)

As to Claim 3, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein a dimensionality of the feature map is the same as a dimensionality of the one or more input images (Zhang, [0061, 0080]).

As to Claim 4, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 3, wherein the feature map is a matrix comprising features across a plurality of input images (Zhang discloses “In various embodiments, an input RGB image can be split into non-overlapping image patches by a patch splitting module. Each image patch is treated as a "token", where a feature is set as a concatenation of the raw pixel RGB values. For example, with a patch size of 4x4, the feature dimension of each patch would be 4x4x3=48 for the RGB image.” in [0080].)

As to Claim 5, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein generating the feature map from the one or more input images comprises applying a trained feature network to the one or more input features to generate the feature map (Zhang discloses “image features generated by an encoder (i.e., latent diffusion)” in [0052]; “At operation 630, the image encoder can generate feature maps for the image 510, where the feature maps represent each object instance 512, 515.” in [0070]; “The image encoder 720 can include a plurality of convolutional neural network (CNN) layers that forms a backbone of the mask network of the image processing system 130…The image encoder 720 can generate one or more feature maps 740” in [0077].)

As to Claim 6, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 5, wherein the trained feature network is a neural network (Zhang discloses “The image encoder 720 can include a plurality of convolutional neural network (CNN) layers that forms a backbone of the mask network of the image processing system 130…The image encoder 720 can generate one or more feature maps 740” in [0077].)

As to Claim 7, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein generating the occlusion mask from the feature map and the depth map for the virtual object comprises applying a mask predictor to the feature map and the depth map for the virtual object to generate the occlusion mask (Zhang discloses “In various embodiments, a mask network is used to provide an instance mask prediction and a rough occlusion prediction based on the features generated by an image encoder in the first step. The outputs of the first step are then provided to the diffusion model to perform amodal mask completion” in [0069]; see also Fig 6 & 9.)

As to Claim 8, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 7, wherein the mask predictor is a multi-layer perceptron (Zhang discloses multilayer perceptron 755 in Fig 7.)

As to Claim 9, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein generating the occlusion mask comprises performing temporal smoothing with a previous occlusion mask generated for a second input image at a prior timestamp before the current timestamp (Zhang discloses “The instance mask 820 can be based on the object detection and feature masks previously generated by the mask network. An occluded region 785 can be identified in the image 510” in [0105]. Daly discloses “there is less need for re-rendering as the computing system can reuse the mask surfaces as the user moves around the environment (e.g., the user's perspective changes less for occlusion masks located at greater distances from the user)” in [0028]. Veges further discloses “While temporal methods can still predict a reasonable estimation for a temporarily disappeared pose using past and future frames… We present an energy minimization approach to generate smooth, valid trajectories in time, bridging gaps in visibility” in Abstract; “Our second contribution is an energy minimization based smoothing function, targeting specifically those frames where a person became temporarily invisible. It adaptively smoothes the prediction stronger at frames where the pose is occluded and weaker when the pose is visible” at p.2.)

As to Claim 10, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein generating the composite image comprises: applying the occlusion mask to the virtual object to determine a portion of the virtual object that is in view; and placing the portion of the virtual object into the first input image to generate the composite image (Zhang discloses applying occlusion mask to two objects in Fig 7-9. Here, the occluding object can be a virtual object. For example, Daly discloses “The computing system may use the posed surfaces to determine a visibility of a virtual object relative to the one or more physical objects by comparing a model or surface representing the virtual object with the two-dimensional occlusion surfaces and generating an output image based on the determined visibility of the virtual object” in [0029], see also Fig 5.)

As to Claim 11, Zhang in view of Daly and Veges teaches The computer-implemented method of claim 1, wherein the occlusion mask is generated further based on a depth map for a second virtual object, and wherein the composite image further includes the second virtual object (Daly discloses “For example, the predetermined depths of the one or more two-dimensional occlusion surfaces permits the computing system to determine and render the proper occlusion of the virtual objects relative to the one or more physical objects in the real environment, for example by occluding a portion of the surface representing a virtual object in the scene based on the pose of the one or more two-dimensional occlusion surfaces in the three-dimensional coordinate system” in [0043]. Here, a second virtual object can be the second virtual object at the same field of view of the HMD, or a second virtual object at different field of view of the HMD.)

Claim 12 recites similar limitations as claim 1 but in a computer readable storage medium form. Therefore, the same rationale used for claim 1 is applied.
Claim 13 is rejected based upon similar rationale as Claim 2.
Claim 14 is rejected based upon similar rationale as Claim 3.
Claim 15 is rejected based upon similar rationale as Claim 4.
Claim 16 is rejected based upon similar rationale as Claim 5.
Claim 17 is rejected based upon similar rationale as Claim 7.
Claim 18 is rejected based upon similar rationale as Claim 9.
Claim 19 is rejected based upon similar rationale as Claim 10.
Claim 20 is rejected based upon similar rationale as Claim 11.
Claim 21 recites similar limitations as claim 1 but in a system form. Therefore, the same rationale used for claim 1 is applied.

Conclusion
THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEIMING HE whose telephone number is (571)270-1221. The examiner can normally be reached Monday-Friday, 8:30am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on 571-272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/Weiming He/
Primary Examiner, Art Unit 2611

Read full office action

Prosecution Timeline

Dec 05, 2023

Application Filed

Sep 01, 2025

Non-Final Rejection — §103

Mar 02, 2026

Applicant Interview (Telephonic)

Mar 02, 2026

Examiner Interview Summary

Mar 03, 2026

Response Filed

Mar 24, 2026

Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/580,103

Patent 12567135

MULTIMEDIA PLAYBACK MONITORING SYSTEM AND METHOD, AND ELECTRONIC APPARATUS

2y 5m to grant Granted Mar 03, 2026

18/059,377

Patent 12561876

System and method for an audio-visual avatar creation

2y 5m to grant Granted Feb 24, 2026

18/513,815

Patent 12514672

System, Method And Software Program For Aiding In Positioning Of Objects In A Surgical Environment

2y 5m to grant Granted Jan 06, 2026

18/001,120

Patent 12494003

AUTOMATIC LAYER FLATTENING WITH REAL-TIME VISUAL DEPICTION

2y 5m to grant Granted Dec 09, 2025

16/532,321

Patent 12468949

SYSTEMS AND METHODS FOR FEW-SHOT TRANSFER LEARNING

2y 5m to grant Granted Nov 11, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

46%

Grant Probability

60%

With Interview (+13.8%)

3y 4m

Median Time to Grant

Moderate

PTA Risk

Based on 410 resolved cases by this examiner. Grant probability derived from career allow rate.