Office Action Analysis: 18539841 — METHOD AND SYSTEM FOR PROVIDING AUGMENTED REALITY OBJECT TRACKING SERVICE BASED ON DEEP LEARNING

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 12/14/2023 and 11/16/2024 have been made record of and considered by the examiner.

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-10 are rejected under 35 U.S.C. 103 as being unpatentable over Guler et al. (US 20220358770 A1) in view of Ye et al. (CN 113409340 A; translation provided by Examiner).

Regarding Claims 1 and 9:
Guler et al. teaches: a method (claim 9: system) for providing an AR (Abstract; ¶ [0028] “The methods and systems described herein may be applied to real-time augmented reality applications”) object tracking service (¶ [0038] "a given object 104 can be identified and/or tracked") based on deep learning (Abstract; ¶ [0038] "The neural network 106 processes the input data”; “¶ [0041] “The neural network 106 is a deep machine-learning model"), by which a tracking application executed by at least one processor of a terminal (¶ [0012] “at least one processor, wherein at least one processor is configured to perform a method”) provides an AR object tracking service based on deep learning (Abstract; ¶ [0038]; ¶ [0041]), the method comprising: 
obtaining first image data (¶ [0015] “the method comprising: receiving a single two-dimensional image”; ¶ [0041] “The neural network 106 takes as input the 2D image 102 and processes it through a plurality of neural network layers to generate the output data 108.); 
inputting the first image data to a first deep learning neural network (¶ [0038] “In some embodiments, only a single input image 102 is used. The neural network 106 processes the input data to identify objects 104 in the image”); 
obtaining 3D depth data (3D objects may be provided by neural networks in the form of meshes and mesh coordinate data corresponding to depth may be provided (i.e. depth data) (¶ [0065])) including each descriptor of a target object within the first image data and a distance value corresponding to the descriptor from the first deep learning neural network (¶ [0038] "generate output data 108 comprising an estimated 3D representation 110 of each (potential) identified object 104 in the input 2D image 102. The output data 108 may, in some embodiments, further comprise estimated coordinates 112 of bounding boxes of the potential objects 104 within the input 2D image 102 and/or probabilities of the presence of objects 104 at locations within the input 2D image 102. The output data 108 is further processed (i.e. post-processed) to generate a three-dimensional reconstruction 116 of the scene in the input 2D image 102.  In some embodiments, the output data 108 may further comprise an embedding vector (not shown) associated with each object 104 in the image. The embedding vector provides a REID embedding of the object, so that a given object 104 can be identified and/or tracked across multiple input images 102. ".); 
and performing AR object tracking based on the(¶ [0041] "The neural network may predict three-dimensional landmark positions of multiple objects by concatenating feature data from one or more intermediate layers of the neural network."), 
wherein the first deep learning neural network is a deep learning algorithm (¶ [0003] “Human pose estimation algorithms typically aim at localizing certain sparse points in images”; ¶ [0032] “In some embodiments, an efficient encoder-only neural network architecture is used for monocular three-dimensional human pose estimation… deep neural networks used for image classification can be used”) 
	Guler et al. fails to explicitly teach obtaining 3D depth data and a deep learning algorithm performing monocular depth estimation.
	In a related art, Ye et al. teaches: obtaining 3D depth data from a depth estimation network using a deep learning algorithm performing monocular depth estimation (p. 20 ¶ [n0043] “The depth map and two-dimensional segmentation result map of the sample image are determined by the depth estimation network”; p. 21 ¶ [n0044] “Depth estimation, in particular, uses RGB images to estimate the distance of each pixel in the image relative to the shooting source. When taking a picture, a three-dimensional graphic is projected onto a two-dimensional plane to form a two-dimensional image. The purpose of depth estimation is to estimate three-dimensional information from a two-dimensional image, which is the reverse process. Depth estimation (estimating depth from 2D images) is a key step in scene reconstruction and understanding tasks, and is part of 3D reconstruction in the field of computer vision.  Monocular estimation based on deep learning relies on the relationship between pixel values to reflect the relationship between depths. The method is to fit a function to map the image into a depth map. Obtaining the specific depth from a single image is equivalent to inferring three-dimensional space from a two-dimensional image. Depth estimation can be divided into monocular depth estimation and multi-view depth estimation.”).  It would be understood by a person of ordinary skill in the art that deep learning is performed using an algorithm(s).  Ye et al. does teach “computing units running machine learning model algorithms” (p. 61 ¶ [n0124]).
	Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the teachings of Guler et al. to incorporate the teachings of Ye et al. (p. 20-22 ¶¶ [n0043]-[n0044]) and obtain 3D depth data by performing monocular estimation during the first deep learning neural network, and to use the 3D depth data for AR object tracking.  Doing so would provide the AR object tracking method and system with more data found within an image, thereby improving the consistency, accuracy, and robustness of the method and system (Ye et al., p.14 ¶ [n0029]). 

Regarding Claim 2:
	Guler et al. in view of Ye et al. teaches the limitations of claim 1.
Ye et al. further teaches: inputting the first image data to a second deep learning neural network (Input first image data, p.7 ¶ [n0013] “The target image to be segmented is input into the target semantic segmentation model trained in the first aspect to obtain the semantic segmentation result of the target image to be segmented.”; Semantic segmentation network being the second deep learning neural network (p.2-3 ¶ [n0003] “In the field of computer vision, the main applications of neural networks are currently image recognition, target localization and detection, and semantic segmentation.”; p. 5 ¶ [n0007] “a depth estimation network and a semantic segmentation network are used”; p. 23 ¶ [n0047] “Specifically, the semantic segmentation network can be based on fully convolutional networks (FCNs), SegNet, E-Net, Link-Net, Mask R-CNN, or other networks that can achieve the functions of this application. This application does not impose any restrictions here.”)
obtaining object area information representing the area occupied by each object within the first image data from the second deep learning neural network (p. 3 ¶ [n0003] “Image recognition tells you what an image is, object localization and detection tells you where the object is in the image, and semantic segmentation answers the above two questions at the pixel level.”; p. 3 ¶ [0006] Segmentation means dividing the image into different objects from the perspective of pixels and labeling each pixel in the original image.),
wherein the second deep learning neural network is a deep learning algorithm performing semantic segmentation (p. 3 ¶ [n0003]; p. 3 ¶ [n0006]; p. 23 ¶ [n0047]).

Regarding Claim 3:
	Guler et al. in view of Ye et al. teaches the limitations of claim 2.
Guler et al. further teaches: determining the area of the target object based on the object area information.  Object area information is processed (¶ [0038] One or more 2D images 102, each comprising a plurality of objects 104 of a given type (e.g. people) is input into a neural network…The neural network 106 processes the input data to identify objects 104 in the image and generate output data…The output data 108 is further processed (i.e. post-processed) to generate a three-dimensional reconstruction 116 of the scene”).  Area of the target object(s) is determined based on the object area (¶ [0041] “The neural network may predict three-dimensional landmark positions of multiple objects by concatenating feature data from one or more intermediate layers of the neural network.).

Regarding Claim 4:
Guler et al. in view of Ye et al. teaches the limitations of claim 3.
Guler et al further teaches: wherein the determining of the target object area includes: 
detecting at least one target object candidate area based on the object area information (See Guler et al. teachings in Claim 3 rejection (found above); ¶ [0038]; ¶ [0041]) and determining one of the target object candidate areas as a target object area (¶ [0048] In some embodiments, the predictions of the neural network 106 may be specific to an object type. For example, a neural network 106 may predict 3D positions of facial parts, fingers, and limbs for humans, while a different neural network 106 may be dedicated to wheel, window, door and light positions for cars.).

Regarding Claim 5:
Guler et al. in view of Ye et al. teaches the limitations of claim 3.
Ye et al. further teaches: obtaining the 3D depth data based on the area (p. 24-25 ¶¶ [n0049]- [n0050] “Determine the 3D point cloud based on the depth map of the sample image, and perform semantic segmentation on the 3D point cloud to obtain the three-dimensional segmentation result map of the sample image. Point cloud data refers to a set of vectors in a three-dimensional coordinate system. 3D data can typically be represented in different formats, including depth images, point clouds, meshes, and volumetric meshes.) 
Ye et al. fails to teach obtaining data based on the area of the target object.
In a related art, Guler et al. further teaches: obtaining data based the area of the target object by predicting 3D “landmark positions” (i.e. target objects’ area) from feature data using a neural network (¶ [0041]) and predicting output data using a neural network (e.g. fully convolutional neural network) (¶ [0045]).
Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the teachings of Ye et al. to incorporate the teaching of Guler et al. and obtain 3D depth data based on the area of a target object within the area of the entire image map.  Doing so would provide the AR object tracking method and system with more information relating to identified pixel value groups (i.e. target objects) within the image, making for a more robust and efficient method and system.

Regarding Claim 6:
Guler et al. in view of Ye et al. teaches the limitations of claim 2.
Ye et al. further teaches: operating the first deep learning neural network and the second deep learning neural network in parallel (Ye teaches the first deep learning neural network performs monocular depth estimation and the second deep learning neural network performs semantic segmentation (p. 20-21 ¶ [n0043]; p. 4 ¶ [n0006]); p. 70 ¶ [n0131] “the steps described in this disclosure can be executed in parallel…”).

Regarding Claim 7:
Guler et al. in view of Ye et al. teaches the limitations of claim 1.
	Guler et al. further teaches: obtaining the 3D depth data based on a primitive model with a preconfigured 2D or 3D shape (¶ [0108] “A 2D training image 502 is obtained from a training dataset comprising a plurality of 2D images. The 2D images comprises one or more objects 504a, 504b (in the example shown, two objects 504a, 504b, though in general any number of objects may be present) of a given type (e.g. humans). A 3D deformable model is fitted (e.g. iteratively fitted) to objects in the input image 502 to extract one or more supervision signals 506 for the monocular 3D reconstruction.”). 

Regarding Claim 8:
Guler et al. in view of Ye et al. teaches the limitations of claim 7.
Guler et al. further teaches generating 3D integrated depth data combining the 3D depth data obtained based on the primitive model and the 3D depth data obtained based on Ye et al. teaches generating 3D integrated depth data by using the depth data and supervision signals found from a primitive model of a training image (¶ [0108]), then inputting the training image in a neural network to output data, including candidate 3D representations, used for comparing supervision signals and updating the neural network parameters based on the integrated 3D depth data (¶ [0109]).  Guler et al. also teaches AR object tracking based on image data from an image being input in a neural network (¶ [0038]).
Guler et al. fails to disclose generating 3D integrated depth data by combining the 3D depth data obtained based on the primitive model and the 3D depth data obtained based on the first deep learning neural network, which, according to claim one, is used for monocular depth estimation of the input image.  Guler et al. also fails to teach performing AR object tracking based on 3D integrated depth data.
In a related art, Ye et al. teaches: generating 3D integrated depth data by using an initial model (a semantic segmentation model that provides shapes and labels of objects (p. 4 ¶ [0006])), then using the initial model data to determine the depth map (3D data (p. 24 ¶ [n0049])) by using a depth estimation neural network (first deep learning neural network) and a two-dimensional segmentation result map by using a semantic segmentation neural network (p. 5 ¶ [n0007]).
Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the teachings of Guler et al. to incorporate the teachings of Ye et al.’s method of obtaining 3D depth data from the first neural network.  Generating 3D integrated depth data by combining the 3D depth data obtained based on the primitive model, as taught by Guler et al., with the 3D depth data obtained based on the first deep learning neural network, as taught by Ye et al., would allow for the AR object tracking to be performed based on the 3D integrated depth data being claimed.  Doing so would provide the AR object tracker with an alternative method of extracting data from an image, thereby increasing the capabilities and robustness of the system. 

Regarding Claim 10:
Guler et al. in view of Ye et al. teaches the limitations of claim 9.
	Guler et al. further teaches wherein the commands of the tracking application further comprise commands for performing: generating a 3D definition model based on the 3D depth data (¶ [0018] “A computer readable medium that stores a set of instructions that is executable by at least one processor of a computation unit for creating a three-dimensional reconstruction of a scene with multiple objects from a single two-dimensional image, to cause computation unit to perform a method according to the additional aspect.”)., 
determining a target virtual object to be augmented and (¶ [0028] “The methods and systems described herein may be applied to real-time augmented reality applications that involve, for example, the whole human body, allowing users to control objects positioned on them, e.g. graphics assets attached to their hands, while also allowing objects to interact with humans, e.g. balls bouncing back from humans once they contact their body.”).  Guler et al. also teaches how all of the commands may be implemented using computer related embodiments known to a person of ordinary skill in the art (e.g. computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments) (¶ [0121]) and teaches adaptations and modifications of the described embodiments can be made (¶ [0122]).
	Guler et al. fails to teach displaying the targeted object.
	In a related art, Ye et al. discloses a display device for user interaction and operating the system and techniques disclosed, including tracking target objects in an image (p. 66 ¶ [n0128] “To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer.”; Object tracking ¶ [n0003]).
	Therefore, it would have been obvious to a person of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the teachings of Guler et al. to incorporate the teachings of Ye et al.’s and provide the user with a display for interacting with the AR tracking application.  Do so would increase the users’ interaction speed and accuracy, thereby increasing the usability and knowledge gained from the AR object tracking system. 

Relevant art not relied upon
The prior art made of record and not relied upon is considered pertinent to applicant's
disclosure. Luo et al. (US 20210398353 A1): teaches a method for displaying AR information comprising of depth estimation and semantic segmentation obtained by a deep learning network.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMUEL DAVID BAYNES whose telephone number is (571)272-0607. The examiner can normally be reached Monday - Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen R Koziol can be reached at (408)918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.D.B./
Samuel D. BaynesExaminer, Art Unit 2665                                                                                                                                                                                                        

/BOBBAK SAFAIPOUR/Primary Examiner, Art Unit 2665
Read full office action
METHOD AND SYSTEM FOR PROVIDING AUGMENTED REALITY OBJECT TRACKING SERVICE BASED ON DEEP LEARNING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

METHOD AND SYSTEM FOR PROVIDING AUGMENTED REALITY OBJECT TRACKING SERVICE BASED ON DEEP LEARNING

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Strategy Recommendation AI-generated — please review before filing

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email