Office Action Analysis: 18435458 — MULTI-SOURCE POSE MERGING FOR DEPTH ESTIMATION

Examiner Intelligence

LIU, XIAO View full profile →
Grants 89% — above average
Career Allow Rate
257 granted / 290 resolved
+26.6% vs TC avg
Moderate +12% lift
Without
With
+11.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
44 currently pending
Career history
334
Total Applications
across all art units
Statute-Specific Performance

§101
8.8%
-31.2% vs TC avg
§103
50.9%
+10.9% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
17.4%
-22.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 290 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
In paragraph [0077], lines 5-6, symbols “                                
                                    
                                            I
                                        
                                            T
                                        
                            ”, “602” and  “604” are not shown in Figure 6.
In paragraph [0079], lines 2-3, symbols “606” and  “608” are not shown in Figure 6.
In paragraph [0080], lines 1-3, symbols “612” and  “614” are not shown in Figure 6.
In paragraphs [0081]-[0083], symbols “616” and  “630” are not shown in Figure 6.
Appropriate correction is required.

Applicant is reminded of the proper language and format for an abstract of the disclosure.
The abstract should be in narrative form and generally limited to a single paragraph on a separate sheet within the range of 50 to 150 words in length. The abstract should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.
The language should be clear and concise and should not repeat information given in the title. It should avoid using phrases which can be implied, such as, “The disclosure concerns,” “The disclosure defined by this invention,” “The disclosure describes,” etc.  In addition, the form and legal phraseology often used in patent claims, such as “means” and “said,” should be avoided.
The abstract of the disclosure is objected to because it has phrases “This disclosure provides”.  A corrected abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text. See MPEP § 608.01(b).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-8 and 10-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al (2017 CVPR), hereinafter Zhou in view of Zhou et al (Xiv:1605.03557v3 11 Feb 2017), hereinafter Zhou1.
	-Regarding claim 1, Zhou discloses a method for depth estimation, comprising (Abstract; FIGS. 1-10): generating, in accordance with first image data of a first image frame and second image data of a second image frame (FIGS. 2-3; equation (1), p. 1853, 1st Col., Sec. 3.1., 2nd paragraph, “

    PNG
    media_image1.png
    144
    370
    media_image1.png
    Greyscale
”), a first mask indicating one or more pixels determined not to change position between the first image frame and the second image frame (p. 1853, 2nd Col., Sec. 3.3, 1st paragraph – p.1854, 1st Col., 1st paragraph, “outputs a per-pixel soft mask                         
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair, indicating the 

    PNG
    media_image2.png
    101
    383
    media_image2.png
    Greyscale
”; p. 1855, 1st Col., 2nd paragraph, “Explainability mask”; p. 1858, Sec. 4.3.); generating, in accordance with the first image data and the second image data, a second mask indicating one or more pixels determined not to change position between the first image frame and the second image frame, wherein the first mask and the second mask are generated using at least some different input data (p.1853, equations (1)-(2); p.1854, equations (3)-(4), 2nd Col., 1st paragraph; Note: one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     is considered as the first image data, and the corresponding pixels or scales for source view frame                         
                            
                                    I
                                
                                    s
                                
                     is considered as the second image data.  Other one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a different scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     , and the corresponding other one or more pixels or a different scale for source view frame                         
                            
                                    I
                                
                                    s
                                
                     are considered as different input data);
	Zhou does not disclose combining the first mask with the second mask to generate a third mask. However, Zhou does teach combining the output of each per-pixel soft mask                          
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair (p.1854, equation (3)).
In the same field of endeavor,  Zhou1 teaches a view synthesis method for multiple input views by learning how to optimally combine single-view predictions (Zhou1: Abstract; FIGS. 1-9). Zhou1 further teaches combining the first mask with the second mask to generate a third mask  (Zhou1: FIG. 3; p.7, Sec.3.2., 2nd paragraph, “

    PNG
    media_image3.png
    190
    545
    media_image3.png
    Greyscale
”)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Zhou with the teaching of Zhou1 by combining the first mask with the second mask to generate a third mask in order to individual strength of different input views to synthesize target views that might not be feasible with any input view alone (Zhou1: p. 7., Sec.3.2., 1st paragraph).
-Regarding claim 10, Zhou discloses an apparatus, comprising (Abstract; FIGS. 1-10): a memory storing processor-readable code; and at least one processor coupled to the memory, the at least one processor configured to execute the processor-readable code to cause the at least one processor to perform operations including (one or more processors or memories has to be used in order implement Zhou’s FIGS. 1-4): generating, in accordance with first image data of a first image frame and second image data of a second image frame (FIGS. 2-3; equation (1), p. 1853, 1st Col., Sec. 3.1., 2nd paragraph, “

    PNG
    media_image1.png
    144
    370
    media_image1.png
    Greyscale
”), a first mask indicating one or more pixels determined not to change position between the first image frame and the second image frame (p. 1853, 2nd Col., Sec. 3.3, 1st paragraph – p.1854, 1st Col., 1st paragraph, “outputs a per-pixel soft mask                         
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair, indicating the 

    PNG
    media_image2.png
    101
    383
    media_image2.png
    Greyscale
”; p. 1855, 1st Col., 2nd paragraph, “Explainability mask”; p. 1858, Sec. 4.3.); generating, in accordance with the first image data and the second image data, a second mask indicating one or more pixels determined not to change position between the first image frame and the second image frame, wherein the first mask and the second mask are generated using at least some different input data (p.1853, equations (1)-(2); p.1854, equations (3)-(4), 2nd Col., 1st paragraph; Note: one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     is considered as the first image data, and the corresponding pixels or scales for source view frame                         
                            
                                    I
                                
                                    s
                                
                     is considered as the second image data.  Other one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a different scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     , and the corresponding other one or more pixels or a different scale for source view frame                         
                            
                                    I
                                
                                    s
                                
                     are considered as different input data);
	Zhou does not disclose combining the first mask with the second mask to generate a third mask. However, Zhou does teach combining the output of each per-pixel soft mask                          
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair (p.1854, equation (3)).
In the same field of endeavor,  Zhou1 teaches a view synthesis method for multiple input views by learning how to optimally combine single-view predictions (Zhou1: Abstract; FIGS. 1-9). Zhou1 further teaches combining the first mask with the second mask to generate a third mask  (Zhou1: FIG. 3; p.7, Sec.3.2., 2nd paragraph, “

    PNG
    media_image3.png
    190
    545
    media_image3.png
    Greyscale
”)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Zhou with the teaching of Zhou1 by combining the first mask with the second mask to generate a third mask in order to individual strength of different input views to synthesize target views that might not be feasible with any input view alone (Zhou1: p. 7., Sec.3.2., 1st paragraph).
-Regarding claim 16, Zhou discloses an apparatus, comprising (Abstract; FIGS. 1-10): at least one image sensor configured to capture first image data and second image data; a positioning engine; a memory storing processor-readable code; and at least one processor coupled to the memory, to the at least one image sensor (p.1852, 2nd Col., Sec.3., 1st paragraph, “a moving camera”), and to the position engine (FIGS. 1-2, pose CNN), the at least one processor configured to execute the processor-readable code to cause the at least one processor to perform operations including (one or more processors or memories has to be used in order implement Zhou’s FIGS. 1-4): generating, in accordance with first image data of a first image frame and second image data of a second image frame (FIGS. 2-3; equation (1), p. 1853, 1st Col., Sec. 3.1., 2nd paragraph, “

    PNG
    media_image1.png
    144
    370
    media_image1.png
    Greyscale
”), a first mask indicating one or more pixels determined not to change position between the first image frame and the second image frame (p. 1853, 2nd Col., Sec. 3.3, 1st paragraph – p.1854, 1st Col., 1st paragraph, “outputs a per-pixel soft mask                         
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair, indicating the 

    PNG
    media_image2.png
    101
    383
    media_image2.png
    Greyscale
”; p. 1855, 1st Col., 2nd paragraph, “Explainability mask”; p. 1858, Sec. 4.3.); generating, in accordance with the first image data and the second image data, a second mask indicating one or more pixels determined not to change position between the first image frame and the second image frame, wherein the first mask and the second mask are generated using at least some different input data (p.1853, equations (1)-(2); p.1854, equations (3)-(4), 2nd Col., 1st paragraph; Note: one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     is considered as the first image data, and the corresponding pixels or scale for source view frame                         
                            
                                    I
                                
                                    s
                                
                     is considered as the second image data.  Other one or more pixels p in target view image frame                         
                            
                                    I
                                
                                    t
                                
                     or a different scale of target view image frame                         
                            
                                    I
                                
                                    t
                                
                     , and the corresponding other one or more pixels or a different scale for source view frame                         
                            
                                    I
                                
                                    s
                                
                     are considered as different input data); wherein: generating the first mask comprises generating the first mask in accordance with first positioning information indicating a position of the at least one image sensor that captured the first image frame and the second image frame from the positioning engine (FIGS. 1-3; equations (1)-(4)), and generating the second mask comprises generating the second mask in accordance with second positioning information indicating the position of the at least one image sensor that captured the first image frame and the second image frame from a pose estimation network (FIGS. 1-3; equations (1)-(4); It is known that algorithm based on Structure from Motion (SFM) often assume precise per-frame camera poses (e.g., camera position and orientation) as auxiliary inputs, which are typically estimated with SFM; See Kopf et al (US 12243251 B1): Col. 1, lines 50-55).
	Zhou does not disclose combining the first mask with the second mask to generate a third mask. However, Zhou does teach combining the output of each per-pixel soft mask                          
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair (p.1854, equation (3)).
In the same field of endeavor,  Zhou1 teaches a view synthesis method for multiple input views by learning how to optimally combine single-view predictions (Zhou1: Abstract; FIGS. 1-9). Zhou1 further teaches combining the first mask with the second mask to generate a third mask  (Zhou1: FIG. 3; p.7, Sec.3.2., 2nd paragraph, “

    PNG
    media_image3.png
    190
    545
    media_image3.png
    Greyscale
”)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Zhou with the teaching of Zhou1 by combining the first mask with the second mask to generate a third mask in order to individual strength of different input views to synthesize target views that might not be feasible with any input view alone (Zhou1: p. 7., Sec.3.2., 1st paragraph).
-Regarding claims 2, 11 and 17, Zhou in view of Zhou1 teaches the method of claim 1, the apparatus of claim 10, and the apparatus of claim 16. Zhou discloses wherein the first mask comprises a first explainability mask, the second mask comprises a second explainability mask (p. 1853, 2nd Col., Sec. 3.3, 1st paragraph – p.1854, 1st Col., 1st paragraph, “outputs a per-pixel soft mask                         
                            
                                            E
                                        
                                        ^
                                    
                                    x
                                
                     for each target-source pair, indicating the 

    PNG
    media_image2.png
    101
    383
    media_image2.png
    Greyscale
”; p. 1855, 1st Col., 2nd paragraph, “Explainability mask”; p. 1858, Sec. 4.3.). 
Zhou does not disclose wherein the third mask comprises a third explainability mask.
In the same field of endeavor,  Zhou1 teaches a view synthesis method for multiple input views by learning how to optimally combine single-view predictions (Zhou1: Abstract; FIGS. 1-9). Zhou1 further teaches wherein the third mask comprises a third explainability mask (Zhou1: FIG. 3; p.7, Sec.3.2., 2nd paragraph, “

    PNG
    media_image3.png
    190
    545
    media_image3.png
    Greyscale
”)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Zhou with the teaching of Zhou1 by combining the first mask with the second mask to generate a third mask in order to individual strength of different input views to synthesize target views that might not be feasible with any input view alone (Zhou1: p. 7., Sec.3.2., 1st paragraph).
-Regarding claims 3 and 12, Zhou in view of Zhou1 teaches the method of claim 1 and the apparatus of claim 10. The combination further teaches wherein generating the first mask comprises generating the first mask in accordance with first positioning information indicating a position of the at least one image sensor that captured the first image frame and the second image frame from the positioning engine (Zhou: FIGS. 1-3; equations (1)-(4)), and generating the second mask comprises generating the second mask in accordance with second positioning information indicating the position of the at least one image sensor that captured the first image frame and the second image frame from a pose estimation network (Zhou: FIGS. 1-3; equations (1)-(4); It is known that algorithm based on Structure from Motion (SFM) often assume precise per-frame camera poses (e.g., camera position and orientation) as auxiliary inputs, which are typically estimated with SFM; See Kopf et al (US 12243251 B1): Col. 1, lines 50-55).
-Regarding claims 4, 13 and 18, Zhou in view of Zhou1 teaches the method of claim 3, the apparatus of claim 12, and the apparatus of claim 17. The combination further teaches receiving, from the positioning engine, the first positioning information; generating a first reconstructed version of the first image frame in accordance with the first positioning information, the second image data, and depth information for the first image frame (Zhou: FIGS. 1-3; equation (1); p. 1853, 2nd Col., Sec. 3.2., “As indicated in Eq. 1, a key component of our learning framework is a differentiable depth image-based renderer that reconstructs the target view                         
                            
                                    I
                                
                                    t
                                
                     by sampling pixels from a source view                         
                            
                                    I
                                
                                    s
                                
                     based on the predicted depth map                         
                            
                                            D
                                        
                                        ^
                                    
                                    t
                                
                     and the relative pose                         
                            
                                            T
                                        
                                        ^
                                    
                                    t
                                    →
                                    s
                                
                    .”); and generating the first mask in accordance with the first reconstructed version of the first image frame and the first image frame (Zhou: equation (3); p. 1853 – p.1854, Sec. 3.3.).
-Regarding claims 5, 14 and 19, Zhou in view of Zhou1 teaches the method of claim 4, the apparatus of claim 13, and the apparatus of claim 18. The combination further teaches receiving, from the pose estimation network, the second positioning information; generating a second reconstructed version of the first image frame in accordance with the second positioning information, the second image data, and depth information for the first image frame  (Zhou: FIGS. 1-3; equation (1); p. 1853, 2nd Col., Sec. 3.2., “As indicated in Eq. 1, a key component of our learning framework is a differentiable depth image-based renderer that reconstructs the target view                         
                            
                                    I
                                
                                    t
                                
                     by sampling pixels from a source view                         
                            
                                    I
                                
                                    s
                                
                     based on the predicted depth map                         
                            
                                            D
                                        
                                        ^
                                    
                                    t
                                
                     and the relative pose                         
                            
                                            T
                                        
                                        ^
                                    
                                    t
                                    →
                                    s
                                
                    .”);  and generating the second mask in accordance with the second reconstructed version of the first image frame and the first image frame (Zhou: equation (3); p. 1853 – p.1854, Sec. 3.3.).
-Regarding claims 6, 15 and 20, Zhou in view of Zhou1 teaches the method of claim 5, the apparatus of claim 14, and the apparatus of claim 19. The combination further teaches generating a third reconstructed version of the first image frame in accordance with the first reconstructed version of the third image frame and the second reconstructed version of the first image frame (Zhou: FIGS. 1-4; equations (1)-(4)).
-Regarding claim 7, Zhou in view of Zhou1 teaches the method of claim 1. The combination further teaches determining a photometric loss in accordance with the third mask; and training a depth estimation network based on the photometric loss (Zhou: FIG. 2 (caption), “the photometric reconstruction loss is used for training the CNN”; equation (3)).
-Regarding claim 8, Zhou in view of Zhou1 teaches the method of claim 1.
Zhou discloses determining a first mask value of a first pixel of the first mask; determining a second mask value of a second pixel of the second mask, wherein the second pixel corresponds to the first pixel (Equation (3); p. 1853 – p.1854, Sec.3.3.);
	Zhou does not disclose determining a combined mask value for a third pixel of the third mask in accordance with the first mask value and the second mask value, wherein the third pixel corresponds to the first and second pixels.
In the same field of endeavor,  Zhou1 teaches a view synthesis method for multiple input views by learning how to optimally combine single-view predictions (Zhou1: Abstract; FIGS. 1-9). Zhou1 further teaches determining a combined mask value for a third pixel of the third mask in accordance with the first mask value and the second mask value, wherein the third pixel corresponds to the first and second pixels (Zhou1: FIG. 3; p.7, Sec.3.2., 2nd paragraph, “

    PNG
    media_image3.png
    190
    545
    media_image3.png
    Greyscale
”)
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Zhou with the teaching of Zhou1 by combining the first mask with the second mask to generate a third mask in order to individual strength of different input views to synthesize target views that might not be feasible with any input view alone (Zhou1: p. 7., Sec.3.2., 1st paragraph).
Allowable Subject Matter
Claim 9 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAO LIU whose telephone number is (571)272-4539. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/XIAO LIU/Primary Examiner, Art Unit 2664
Read full office action
Prosecution Timeline

Feb 07, 2024
Application Filed
Dec 17, 2025
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/863,567
Patent 12603972
WIRELESS TRANSMITTER IDENTIFICATION IN VISUAL SCENES
2y 5m to grant Granted Apr 14, 2026
18/270,222
Patent 12592069
OBJECT RECOGNITION METHOD AND APPARATUS, AND DEVICE AND MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/319,896
Patent 12579834
Information Extraction Method and Apparatus for Text With Layout
2y 5m to grant Granted Mar 17, 2026
18/324,644
Patent 12576873
SYSTEM AND METHOD OF CAPTIONS FOR TRIGGERS
2y 5m to grant Granted Mar 17, 2026
18/268,374
Patent 12573175
TARGET TRACKING METHOD, TARGET TRACKING SYSTEM AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds
Prosecution Projections

1-2
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+11.5%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 290 resolved cases by this examiner. Grant probability derived from career allow rate.
MULTI-SOURCE POSE MERGING FOR DEPTH ESTIMATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

MULTI-SOURCE POSE MERGING FOR DEPTH ESTIMATION

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email