Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/02/2025 is being considered by the examiner.
Claim Objections
Claim 13 is objected to because of the following informalities: the term “SE(2)” should be defined in the claim. Applicant is recommended to amend the final limitation to “wherein the temporal filter operates in special Euclidean group SE(2),” or in some other manner that aligns with paragraph [0046] of the Specification. Appropriate correction is required.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-4, 7, 9, 11-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Claim 1
Step 1 – YES
Claim 1 discloses a process, and thus falls in one of the statutory categories.
Step 2A, Prong One – YES
Claim 1 recites an abstract idea. Claim 1 recites “inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan; computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays.” The “trained neural network” does not provide detail on the neural network and can merely be a mathematical model to compute relationships and calculations. That is, the model is recited at a high level of generality. Additionally, the step of “computing a two dimensional (2D) position and orientation of the camera” is also a mathematical process and, at a high level, a mental process.
Step 2A, Prong Two – NO
Claim 1 does not recite additional elements that integrate the judicial exception into a practical application. Claim 1 recites “accessing a floorplan of an environment” and “receiving an image of the environment, the image having a viewpoint of a camera in the environment.” The step of accessing a floorplan and receiving an image are both mere data gathering and amount to extrasolution activity. The floorplan and “image having a viewpoint of a camera in the environment” merely generally link the abstract idea to a particular technological environment and do not recite a practical application of the computed 2D position and orientation of the camera.
Step 2B – NO
Claim 1 does not recite additional elements that amount to significantly more than the judicial exception. Claim 1 recites “accessing a floorplan of an environment” and “receiving an image of the environment, the image having a viewpoint of a camera in the environment.” The step of accessing a floorplan and receiving an image are both mere data gathering and amount to extrasolution activity. Thus, the elements do not amount to significantly more than the abstract idea.
Thus, Claim 1 is not eligible subject matter.
Claim 2 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 2 recites “wherein the array of rays is independent of parameters intrinsic to the camera,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations.
Claim 3 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 3 recites “wherein computing the position and orientation of the camera with respect to the floorplan comprises searching for a pose in the floorplan that has the most similar rays as the predicted array of rays,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. Searching for a number of “most similar rays” may be performed by judgment and observation in the mind of one of ordinary skill in the art.
Claim 4 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 4 recites “aligning the image with a gravity direction, the gravity direction being received from a sensor associated with the camera,” but the step of aligning an image and receiving a gravity direction is mere data gathering and amounts to extrasolution activity that does not integrate the abstract idea into a practical application.
Claim 7 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 7 recites “wherein the trained neural network comprises a first neural network which computes the predicted array of rays from only one image,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. That is, “a first neural network” is also recited at a high level of generality and may also be merely a mathematical model.
Claim 9 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 9 recites “wherein the trained neural network comprises a second neural network which computes the predicted array of rays from a sequence of images captured by the camera moving in the environment as well as, for each image in the sequence, a known relative pose of the camera,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. That is, “a second neural network” is also recited at a high level of generality and may also be merely a mathematical model.
Claim 11 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 11 recites “wherein the trained neural network comprises a first neural network and a second neural network and a selector neural network, the selector neural network arranged to predict weights used to compute a weighted combination of the predictions of the first neural network and the second neural network,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. That is, “a first neural network,” “a second neural network,” and “a selector network” are also recited at a high level of generality and may also be merely mathematical models. No detail is given for the selector network and the weights may be determined by mathematical calculation or judgment by one of ordinary skill in the art.
Claim 12 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 12 recites “wherein computing the position and orientation of the camera with respect to the floorplan is done using the array of rays and a prior belief of the position and orientation of the camera with respect to the floorplan,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. The “prior belief” does not integrate the abstract idea into a practical application or amount to significantly more than the judicial exception.
Claim 13 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 13 recites “computing the prior belief using a temporal filter, using data from a previous time step, wherein the temporal filter operates in SE(2),” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. A temporal filter operating in SE(2) may be a mere mathematical principle and thus is also an abstract idea.
Claim 14 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 14 recites “computing the prior belief using a temporal filter, using data from a previous time step, wherein the temporal filter applies different 2D translation filters for different orientations.,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. 2D translation filters may be a mere mathematical principle and thus is also an abstract idea.
Claim 15 recites the additional elements “An apparatus comprising: a processor; a memory storing instructions that, when executed by the processor, perform a method.” However, these are merely generic computer components and do not integrate the abstract ideas into a practical application or amount to significantly more.
Claim 16 recites the additional elements “a camera phone or a head mounted display device.” However, these elements merely generally link the abstract idea to a particular technological environment and do not recite a practical application of the computed 2D position and orientation of the camera.
Claim 17 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 17 recites “computing the position and orientation of the camera with respect to the floorplan from the array of rays comprises using a prior belief of the position and orientation of the camera with respect to the floorplan,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. The “prior belief” does not integrate the abstract idea into a practical application or amount to significantly more than the judicial exception.
Claim 18 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 18 recites “wherein the prior belief is obtained from a temporal filter,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. A temporal filter may be a mere mathematical principle and thus is also an abstract idea.
Claim 19 recites the additional elements “A mobile or portable, or wearable computing device comprising a processor; a memory storing instructions that, when executed by the processor, perform a method” However, these are merely generic computer components and do not integrate the abstract ideas into a practical application or amount to significantly more. These elements merely generally link the abstract idea to a particular technological environment and do not recite a practical application of the computed 2D position and orientation of the camera.
Claim 20 does not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the judicial exception. Claim 20 recites ““The device of claim 19 operable in a previously unvisited environment at 30 frames per second or higher,” but the claimed neural network predicting and position and orientation computation may still merely be a mathematical model to compute relationships and calculations. The additional limitation amounts to mere data gathering and is extrasolution activity and thus does not integrate the abstract idea into a practical application or amount to significantly more than the judicial exception.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 6 recites the limitation "the images" in line 1. There is insufficient antecedent basis for this limitation in the claim. Claim 6 depends indirectly from Claim 1, which recites “an image of the environment,” and depends directly from Claim 5, which recites “the trained neural network has been trained using other floorplans and wherein the training omits information about the floorplan and omits images captured in the environment.” However, it is not clear what “the images” of Claim 6 refer to. That is, it is not clear if the images of Claim 6 are “other floorplan” images of Claim 5, an image of the environment of Claim 1, or some other image.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 4-7, 9, 12, 15-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Howard-Jenkins et al. (LaLaLoc++: Global Floor Plan Comprehension for Layout Localisation in Unvisited Environments, published 2022), in view of Sarlin et al. (OrienterNet: Visual Localization in 2D Public Maps with Neural Matching, published 2023).
Regarding Claim 1, Howard-Jenkins teaches “A method comprising:
accessing a floorplan of an environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “The floor plan comprehension module, фplan, infers the layout structure that is visible across a 2D floor plan”; see Fig. 1)
“receiving an image of the environment, the image having a viewpoint of a camera in the environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “Therefore, LaLaLoc++ performs localisation of a query panorama by finding the location in the floor plan that most accurately matches the visible room layout”; where a query panorama is an image of the environment; see Fig. 1);
“
computing a two dimensional (2D) position and orientation of the camera (Howard-Jenkins, Section 3.1, paragraph 1 discloses “Specifically, we predict the 2 DoF camera pose to an x – y location in the 2D floor plan assuming a known orientation, an assumption we explore in Sect. 5.6”).
PNG
media_image1.png
602
801
media_image1.png
Greyscale
Fig. 1 of Howard-Jenkins
Although Howard-Jenkins discloses determining a reference grid (Howard-Jenkins, Section 3.4 paragraph discloses “At inference time, a dense reference grid of latent layout vectors are computed by passing the 2D floor plan”), Howard-Jenkins does not explicitly teach “inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan; computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays” (emphasis added).
However, in an analogous field of endeavor, Sarlin teaches “inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan” (Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image. We consider D depth planes sampled in front of the camera with a regular interval ∆, i.e. with values {i · ∆|i ∈ {1...D}}. Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N .” Sarlin, Section 3, Overview recites “OrienterNet consists of three modules: 1) The image-CNN extracts semantic features from the image and lifts them to an orthographic Bird’s-Eye View (BEV) representation T by inferring the 3D structure of the
scene. 2) The OSM map is encoded by the map-CNN into a neural map F that embeds semantic and geometric information. 3) We estimate a probability distribution over camera poses ξ by exhaustively matching the BEV against the map.”; where OrienterNet is a trained neural network; where rays in polar representation ¯X is an array of rays; where BEV is a floorplan);
“computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays” (Sarlin, Section 3.3, paragraph 3 recites “Exhaustively matching the neural map F and the BEV T yields a score volume M. Each element is computed by correlating F with T transformed by the corresponding pose as
PNG
media_image2.png
65
573
media_image2.png
Greyscale
where ξ(p) transforms a 2D point p from BEV to map coordinate frame”; where a corresponding pose is a 2D position and orientation; where matching a neural map to BEV T is computing a pose with respect to the floorplan from the array of rays).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Howard-Jenkins to incorporate the teachings of Sarlin by using a CNN to map feature maps to a ray polar representation. The prior art of Howard-Jenkins contained a ‘base’ method upon which the claimed invention can be seen as an ‘improvement.’ Howard-Jenkins teaches a method of performing floor plan localization to determine a 2D pose. The prior art contained a known technique that is applicable to the base method. The prior art Sarlin teaches using a trained CNN on birds eye view to determine pose. One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. That is, one of ordinary skill in the art would have recognized that the known technique of using a trained CNN on a birds eye view to determine pose would be applicable to the 2D pose determination of Howard-Jenkins and resulted in a system that is well-localized (improved). One of ordinary skill in the art would be motivated to combine the Howard-Jenkins and Sarlin references in order to allow visual features to be identified when scale is ambiguous: Sarlin, Section 3.1, paragraph 5 discloses “When the scale is ambiguous and difficult to infer, visual features are spread over multiple depths along the ray but still provide geometric constraint for well-localized map points.” Accordingly, the combination of Howard-Jenkins and Sarlin discloses the invention of Claim 1.
Regarding Claim 2, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 wherein the array of rays is independent of parameters intrinsic to the camera” (Sarlin, Section 3.1, paragraph 4 discloses “Instead of directly regressing the distribution α over depths, we regress a distribution S over scales that are independent from the camera calibration parameters”; where camera calibration parameters are parameters intrinsic to the camera). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 2 and are incorporated herein by reference. Thus, the apparatus recited in Claim 2 is met by Howard-Jenkins and Sarlin.
Regarding Claim 4, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 comprising, prior to inputting the image to the trained neural network, aligning the image with a gravity direction, the gravity direction being received from a sensor associated with the camera” (Sarlin, Section 3.1, paragraph 3 recites “Since the image is gravity-aligned, each of the U columns in X
corresponds to a vertical plane in the 3D space. The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 4 and are incorporated herein by reference. Thus, the apparatus recited in Claim 4 is met by Howard-Jenkins and Sarlin.
Regarding Claim 5, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 wherein the trained neural network has been trained using other floorplans and wherein the training omits information about the floorplan and omits images captured in the environment” (Sarlin, Section 5, paragraph 2 discloses “The resulting Mapillary Geo-Localization (MGL) dataset includes 760k images from 12 cities in Europe and the US, captured by cameras that are handheld or mounted on cars or bikes, with GT poses and OSM data. Models trained on MGL generalize well to other datasets thanks to the diversity of cameras, locations, motions, and maps”). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 5 and are incorporated herein by reference. Thus, the apparatus recited in Claim 5 is met by Howard-Jenkins and Sarlin.
Regarding Claim 6, as best understood in light of the 112(b) rejection above, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 5 wherein the images comprise at least one synthetic image computed by updating one of the images according to an adjusted roll or an adjusted pitch of the camera” (Sarlin, Section 3, paragraph 3 discloses “The image is rectified via a homography computed from the known gravity such that its roll and tilt are zero– its principal axis is then horizontal.”) The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 6 and are incorporated herein by reference. Thus, the apparatus recited in Claim 6 is met by Howard-Jenkins and Sarlin.
Regarding Claim 7, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 wherein the trained neural network comprises a first neural network which computes the predicted array of rays from only one image” (Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image.” and “Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N”; where “the image” is only one image; where Φimage is a first neural network). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 7 and are incorporated herein by reference. Thus, the apparatus recited in Claim 7 is met by Howard-Jenkins and Sarlin.
Regarding Claim 9, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 wherein the trained neural network comprises a second neural network which computes the predicted array of rays from a sequence of images captured by the camera moving in the environment as well as, for each image in the sequence, a known relative pose of the camera” (Sarlin, Section 6.3, paragraph 4 discloses “We now fuse multiple GPS signals or predictions of OrienterNet over the same temporal interval of 10 consecutive keyframes, using imperfect relative poses from VI SLAM”; see also Fig. 12; Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image.” and “Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N”; where Φimage is a second neural network; where consecutive keyframes and Fig. 12 indicate a moving camera in the environment; where Sarlin teaches predicting location by OrienterNet for consecutive images, and thus teaches a first and second neural network which compute an array of rays from a sequence of images; as claimed, the first and second neural network are not required to be different networks). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 9 and are incorporated herein by reference. Thus, the apparatus recited in Claim 9 is met by Howard-Jenkins and Sarlin.
Regarding Claim 12, the combination of Howard-Jenkins and Sarlin teaches “The method of claim 1 wherein computing the position and orientation of the camera with respect to the floorplan is done using the array of rays and a prior belief of the position and orientation of the camera with respect to the floorplan” (Sarlin, Section 3, paragraph 3 recites “We are also given a coarse location prior ξprior. This can be a noisy GPS position or a previous localization estimate and can be off by over 20 meters”). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 1, apply to Claim 12 and are incorporated herein by reference. Thus, the apparatus recited in Claim 12 is met by Howard-Jenkins and Sarlin.
PNG
media_image3.png
907
591
media_image3.png
Greyscale
Figure 12 of Sarlin
Regarding Claim 15, Howard-Jenkins teaches “An apparatus comprising:
accessing a floorplan of an environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “The floor plan comprehension module, фplan, infers the layout structure that is visible across a 2D floor plan”; see Fig. 1);
“receiving an image of the environment, the image depicting a viewpoint of a camera in the environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “Therefore, LaLaLoc++ performs localisation of a query panorama by finding the location in the floor plan that most accurately matches the visible room layout”; where a query panorama is an image of the environment; see Fig. 1);
computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan (Howard-Jenkins, Section 3.1, paragraph 1 discloses “Specifically, we predict the 2 DoF camera pose to an x – y location in the 2D floor plan assuming a known orientation, an assumption we explore in Sect. 5.6”).
Howard-Jenkins does not explicitly teach “a processor; a memory storing instructions that, when executed by the processor, perform a method” and “inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan, wherein the trained neural network has been trained omitting information about the environment;from the array of rays” (emphasis added).
However, in an analogous field of endeavor, Sarlin teaches “a processor; a memory storing instructions that, when executed by the processor, perform a method” (Sarlin, Appendix C, paragraph 2 discloses “We train with a batch size of 9 over 3 V100 GPUs with 16GB VRAM each”)
“inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan” (Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image. We consider D depth planes sampled in front of the camera with a regular interval ∆, i.e. with values {i · ∆|i ∈ {1...D}}. Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N .” Sarlin, Section 3.1, paragraph 6 recites “We map the polar features to a Cartesian grid of size L×D via linear interpolation along the lateral direction from U polar rays to L columns spaced by the same interval ∆. The resulting feature grid is then processed by a small CNN ΦBEV that outputs the neural BEV T and confidence C.”; where a CNN Φimage is a trained neural network; where rays in polar representation ¯X is an array of rays; where BEV is birds eye view; where BEV is a floorplan), “wherein the trained neural network has been trained omitting information about the environment” (Sarlin, Section 5, paragraph 2 discloses “We train a single model that generalizes to unseen locations with arbitrary kinds of images. We collect images from the Mapillary platform, which exposes the camera calibration, noisy GPS measurement, and the 6-DoF pose in a global reference frame, obtained with a fusion of SfM and GPS”; where noisy measurement is omitting information);
“computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays” (Sarlin, Section 3.3, paragraph 3 recites “Exhaustively matching the neural map F and the BEV T yields a score volume M. Each element is computed by correlating F with T transformed by the corresponding pose as
PNG
media_image2.png
65
573
media_image2.png
Greyscale
where ξ(p) transforms a 2D point p from BEV to map coordinate frame”; where a corresponding pose is a 2D position and orientation; where matching a neural map to BEV T is computing a pose with respect to the floorplan from the array of rays).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Howard-Jenkins to incorporate the teachings of Sarlin by using a CNN to map feature maps to a ray polar representation. The prior art of Howard-Jenkins contained a ‘base’ method upon which the claimed invention can be seen as an ‘improvement.’ Howard-Jenkins teaches a method of performing floor plan localization to determine a 2D pose. The prior art contained a known technique that is applicable to the base method. The prior art Sarlin teaches using a trained CNN on birds eye view to determine pose. That is, one of ordinary skill in the art would have recognized that the known technique of using a trained CNN on a birds eye view to determine pose would be applicable to the 2D pose determination of Howard-Jenkins and resulted in a system that is well-localized (improved). One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. One of ordinary skill in the art would be motivated to combine the Howard-Jenkins and Sarlin references in order to allow visual features to be identified when scale is ambiguous: Sarlin, Section 3.1, paragraph 5 discloses “When the scale is ambiguous and difficult to infer, visual features are spread over multiple depths along the ray but still provide geometric constraint for well-localized map points.” Accordingly, the combination of Howard-Jenkins and Sarlin discloses the invention of Claim 15.
Regarding Claim 16, the combination of Howard-Jenkins and Sarlin teaches “The apparatus of claim 15 being a camera phone or a head mounted display device” (Sarlin, Table 4 caption discloses “Localization of head-mounted devices for AR. With data from Aria glasses, OrienterNet outperforms the map retrieval baseline and the embedded GPS sensor in both single- and multi frame settings, in both cities”). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 15, apply to Claim 16 and are incorporated herein by reference. Thus, the apparatus recited in Claim 16 is met by Howard-Jenkins and Sarlin.
Regarding Claim 17, the combination of Howard-Jenkins and Sarlin discloses “The apparatus of claim 15 wherein computing the position and orientation of the camera with respect to the floorplan from the array of rays comprises using a prior belief of the position and orientation of the camera with respect to the floorplan” (Sarlin, Section 3, paragraph 3 recites “We are also given a coarse location prior ξprior. This can be a noisy GPS position or a previous localization estimate and can be off by over 20 meters”). The proposed combination as well as the motivation for combining the Howard-Jenkins and Sarlin references presented in the rejection of Claim 15, apply to Claim 17 and are incorporated herein by reference. Thus, the apparatus recited in Claim 17 is met by Howard-Jenkins and Sarlin.
Regarding Claim 19, Howard-Jenkins teaches
accessing a floorplan of an environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “The floor plan comprehension module, фplan, infers the layout structure that is visible across a 2D floor plan”; see Fig. 1);
“receiving an image of the environment captured by a camera in the environment” (Howard-Jenkins, Section 3, paragraph 1 discloses “Therefore, LaLaLoc++ performs localisation of a query panorama by finding the location in the floor plan that most accurately matches the visible room layout”; where a query panorama is an image of the environment; see Fig. 1);
(Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image. We consider D depth planes sampled in front of the camera with a regular interval ∆, i.e. with values {i · ∆|i ∈ {1...D}}. Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N .” Sarlin, Section 3.1, paragraph 6 recites “We map the polar features to a Cartesian grid of size L×D via linear interpolation along the lateral direction from U polar rays to L columns spaced by the same interval ∆. The resulting feature grid is then processed by a small CNN ΦBEV that outputs the neural BEV T and confidence C.”; where a CNN Φimage is a trained neural network; where rays in polar representation ¯X is an array of rays; where BEV is birds eye view; where BEV is a floorplan), “wherein the trained neural network has been trained omitting information about the environment” (Sarlin, Section 5, paragraph 2 discloses “We train a single model that generalizes to unseen locations with arbitrary kinds of images. We collect images from the Mapillary platform, which exposes the camera calibration, noisy GPS measurement, and the 6-DoF pose in a global reference frame, obtained with a fusion of SfM and GPS”; where noisy measurement is omitting information);
computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan (Howard-Jenkins, Section 3.1, paragraph 1 discloses “Specifically, we predict the 2 DoF camera pose to an x – y location in the 2D floor plan assuming a known orientation, an assumption we explore in Sect. 5.6.” Howard-Jenkins, Section 3.4, paragraph 3 discloses “The refined pose is iteratively updated to minimise the matching cost”).
Howard-Jenkins does not explicitly teach “A mobile or portable, or wearable computing device comprising a processor; a memory storing instructions that, when executed by the processor, perform a method” and “inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan; computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays” (emphasis added).
However, in an analogous field of endeavor, Sarlin teaches teach “A mobile or portable, or wearable computing device comprising a processor; a memory storing instructions that, when executed by the processor, perform a method” (Sarlin, Table 4 caption discloses “Localization of head-mounted devices for AR. With data from Aria glasses, OrienterNet outperforms the map retrieval baseline and the embedded GPS sensor in both single- and multi frame settings, in both cities.” Sarlin, Appendix C, paragraph 2 discloses “We train with a batch size of 9 over 3 V100 GPUs with 16GB VRAM each”)
“inputting the image to a trained neural network to predict an array of rays from the camera to surfaces in the environment indicated in the floorplan” (Sarlin, Section 3.1, paragraph 3 recites “A CNN Φimage first extracts a U×V feature map X ∈ RU×V×N from the image. We consider D depth planes sampled in front of the camera with a regular interval ∆, i.e. with values {i · ∆|i ∈ {1...D}}. Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space. We thus map each column to a ray in the U×D polar representation ¯X ∈ RU×D×N .” Sarlin, Section 3.1, paragraph 6 recites “We map the polar features to a Cartesian grid of size L×D via linear interpolation along the lateral direction from U polar rays to L columns spaced by the same interval ∆. The resulting feature grid is then processed by a small CNN ΦBEV that outputs the neural BEV T and confidence C.”; where a CNN Φimage is a trained neural network; where rays in polar representation ¯X is an array of rays; where BEV is birds eye view; where BEV is a floorplan), “wherein the trained neural network has been trained omitting information about the environment” (Sarlin, Section 5, paragraph 2 discloses “We train a single model that generalizes to unseen locations with arbitrary kinds of images. We collect images from the Mapillary platform, which exposes the camera calibration, noisy GPS measurement, and the 6-DoF pose in a global reference frame, obtained with a fusion of SfM and GPS”; where noisy measurement is omitting information);
“computing a two dimensional (2D) position and orientation of the camera with respect to the floorplan from the array of rays” (Sarlin, Section 3.3, paragraph 3 recites “Exhaustively matching the neural map F and the BEV T yields a score volume M. Each element is computed by correlating F with T transformed by the corresponding pose as
PNG
media_image2.png
65
573
media_image2.png
Greyscale
where ξ(p) transforms a 2D point p from BEV to map coordinate frame”; where a corresponding pose is a 2D position and orientation; where matching a neural map to BEV T is computing a pose with respect to the floorplan from the array of rays).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Howard-Jenkins to incorporate the teachings of Sarlin by using a CNN to map feature maps to a ray polar representation. The prior art of Howard-Jenkins contained a ‘base’ method upon which the claimed invention can be seen as an ‘improvement.’ Howard-Jenkins teaches a method of performing floor plan localization to determine a 2D pose. The prior art contained a known technique that is applicable to the base method. The prior art Sarlin teaches using a trained CNN on birds eye view to determine pose. That is, one of ordinary skill in the art would have recognized that the known technique of using a trained CNN on a birds eye view to determine pose would be applicable to the 2D pose determination of Howard-Jenkins and resulted in a system that is well-localized (improved). One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. One of ordinary skill in the art would be motivated to combine the Howard-Jenkins and Sarlin references in order to allow visual features to be identified when scale is ambiguous: Sarlin, Section 3.1, paragraph 5 discloses “When the scale is ambiguous and difficult to infer, visual features are spread over multiple depths along the ray but still provide geometric constraint for well-localized map points.” Accordingly, the combination of Howard-Jenkins and Sarlin discloses the invention of Claim 19.
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Howard-Jenkins et al. (LaLaLoc++: Global Floor Plan Comprehension for Layout Localisation in Unvisited Environments, published 2022), in view of Sarlin et al. (OrienterNet: Visual Localization in 2D Public Maps with Neural Matching, published 2023), further in view of Narayana et al. (US 20240312136 A1).
Regarding Claim 11, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 11.
However, in an analogous field of endeavor, Narayana teaches “The method of claim 1 wherein the trained neural network comprises a first neural network and a second neural network and a selector neural network, the selector neural network arranged to predict weights used to compute a weighted combination of the predictions of the first neural network and the second neural network” (Narayana, [0088] discloses “The steps may include one or more of the following: performing an alignment using corner inliers, in which a distance threshold is used to filter all the matching pairs from the previous corner matching operations within a certain re-projection image distance (with the resulting corner pairs called corner inliers), and weighted least squares is used to find the best camera position xyz, with confidence scores from the predicted corners of the target image's estimated room shape (e.g., as generated by a neural network model) used as weights in the weighted least square regression to generate a more accurate camera position than the previous camera pose”; where neural network model that generates confidence scores is a selector neural network).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Howard-Jenkins and Sarlin to incorporate the teachings of Narayana by combining previous camera position and current camera position using weights. Narayana is directed to determining a “best camera position.” Sarlin is also directed to visual localization using sequential camera images. Thus, it would be obvious to one of ordinary skill in the art to implement the teachings of Narayana to the method of Sarlin to teach weighting determined camera positions using a neural network model. Additionally, one of ordinary skill in the art would be motivated to combine the Howard-Jenkins, Sarlin, and Narayana references in order to optimize camera pose accuracy: Narayana, [0088] discloses “and performing a differentiable rendering optimization method using image normal predictions, where camera pose is optimized for a lower cost function value.” Accordingly, the combination of Howard-Jenkins, Sarlin, and Narayana discloses the invention of Claim 11.
Claims 13, 14, 18, and 20 rejected under 35 U.S.C. 103 as being unpatentable over Howard-Jenkins et al. (LaLaLoc++: Global Floor Plan Comprehension for Layout Localisation in Unvisited Environments, published 2022), in view of Sarlin et al. (OrienterNet: Visual Localization in 2D Public Maps with Neural Matching, published 2023), further in view of Min et al. (LASER: LAtent SpacE Rendering for 2D Visual Localization, published 2023).
Regarding Claim 13, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 13.
However, in an analogous field of endeavor, Min teaches “The method of claim 12 comprising computing the prior belief using a temporal filter, using data from a previous time step, wherein the temporal filter operates in SE(2)” (Min, Section 3.1, paragraph 6 recites “As MCL needs a large number of samples to approximate the camera pose posterior in SE(2), we systematically reduce the rotation dimension from the MCL sampling step”; where MCL is Monte Carlo Localization; where MCL is a temporal filter).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Howard-Jenkins and Sarlin to incorporate the teachings of Min by approximating camera pose posterior in SE(2). One of ordinary skill in the art would be motivated to combine the Howard-Jenkins, Sarlin, and Min references in order to improve accuracy and speed of 2D visual localization: Min, Section 1, paragraph 3 discloses “Extensive experiments on Structured3D [38] and ZInD datasets [5] show that our proposed framework significantly outperforms state-of-the-art frameworks both in accuracy and speed.” Accordingly, the combination of Howard-Jenkins, Sarlin, and Min discloses the invention of Claim 13.
Regarding Claim 14, the combination of Howard-Jenkins, Sarlin, and Min teaches “The method of claim 12 comprising computing the prior belief using a temporal filter, using data from a previous time step, wherein the temporal filter applies different 2D translation filters for different orientations” (Min, Section 3.1, paragraph 6 recites “For solving Eq.6, we rotate Ft with uniformly sampled θt in [0, 2π), and keep the best. This discretized search initializes rotation to a rough value, which will later be refined as in §3.4”; where rotating and refining rotation is applying 2D translation filters). The proposed combination as well as the motivation for combining the Howard-Jenkins, Sarlin, and Min references presented in the rejection of Claim 13, apply to Claim 14 and are incorporated herein by reference. Thus, the apparatus recited in Claim 14 is met by Howard-Jenkins, Sarlin, and Min.
Regarding Claim 18, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 18.
However, in an analogous field of endeavor, Min teaches “The apparatus of claim 17 wherein the prior belief is obtained from a temporal filter” Min, Section 3.1, paragraph 6 recites “As MCL needs a large number of samples to approximate the camera pose posterior in SE(2), we systematically reduce the rotation dimension from the MCL sampling step”; where MCL is Monte Carlo Localization; where MCL is a temporal filter). The proposed combination as well as the motivation for combining the Howard-Jenkins, Sarlin, and Min references presented in the rejection of Claim 13, apply to Claim 14 and are incorporated herein by reference. Thus, the apparatus recited in Claim 18 is met by Howard-Jenkins, Sarlin, and Min.
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Howard-Jenkins and Sarlin to incorporate the teachings of Min by applying an MCL sampling step in SE(2). One of ordinary skill in the art would be motivated to combine the Howard-Jenkins, Sarlin, and Min references in order to improve accuracy and speed of 2D visual localization: Min, Section 1, paragraph 3 discloses “Extensive experiments on Structured3D [38] and ZInD datasets [5] show that our proposed framework significantly outperforms state-of-the-art frameworks both in accuracy and speed.” Accordingly, the combination of Howard-Jenkins, Sarlin, and Min discloses the invention of Claim 18.
Regarding Claim 20, the combination of Howard-Jenkins and Sarlin does not explicitly teach “The device of claim 19 operable in a previously unvisited environment at 30 frames per second or higher.”
However, in an analogous field of endeavor, Min teaches “The device of claim 19 operable in a previously unvisited environment at 30 frames per second or higher.” (Min, Section 1, paragraph 3 discloses “This representation, namely circular feature, along with the view-dependent feature encoding from the rendering codebook, provides fine-grain structured descriptors for geometry and semantics at a high sampling FPS of 10KHz”; where 10KHz is higher than 30 frames per second).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Howard-Jenkins and Sarlin to incorporate the teachings of Min by sampling images at a particular frame rate. One of ordinary skill in the art would be motivated to combine the Howard-Jenkins, Sarlin, and Min references in order to improve accuracy and speed of 2D visual localization: Min, Section 1, paragraph 3 discloses “Extensive experiments on Structured3D [38] and ZInD datasets [5] show that our proposed framework significantly outperforms state-of-the-art frameworks both in accuracy and speed.” Accordingly, the combination of Howard-Jenkins, Sarlin, and Min discloses the invention of Claim 20.
Allowable Subject Matter
Claim 3 has been rejected above under 35 U.S.C. 101, is not rejected over prior art references, and is objected to as being dependent upon a rejected base claim, but would be allowable if: (a) rewritten in independent form including all of the limitations of the base claims and any intervening claims; and (b) the above-described rejection of these claims under 35 U.S.C. 101 is overcome.
Claims 8 and 10 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
Regarding Claim 3, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 3. Although Sarlin discloses a matching step (Sarlin, Section 3.3., paragraph 3 discloses “Exhaustively matching the neural map F and the BEV T yields a score volume M”), Sarlin does not explicitly teach matching or determining a similarity of rays. Thus, none of the previously cited prior art references, alone or in combination, provide a motivation to teach the ordered combination of “The method of claim 1 wherein computing the position and orientation of the camera with respect to the floorplan comprises searching for a pose in the floorplan that has the most similar rays as the predicted array of rays.”
Regarding Claim 8, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 8. Although Sarlin teaches aligning an image with a gravity direction (Sarlin, Section 3.1, paragraph 3 recites “Since the image is gravity-aligned, each of the U columns in X corresponds to a vertical plane in the 3D space”), Sarlin and Howard-Jenkins do not explicitly teach an attention mechanism masking out pixels that become unobservable due to gravity alignment. Thus, none of the previously cited references, alone or in combination, provide a motivation to teach the ordered combination of “The method of claim 7 comprising, prior to inputting the image to the trained neural network, aligning the image with a gravity direction, the gravity direction being received from a sensor associated with the camera, and using an attention mechanism of the first neural network to mask out pixels that become unobservable by the alignment of the image with the gravity direction.”
Regarding Claim 10, the combination of Howard-Jenkins and Sarlin does not explicitly teach the method of Claim 10. Wu et al. (CN 112270701 A) teaches a cost filter module in the field of visual distance measurement (Wu, page 11, paragraph 2 discloses “The cost filter module” and “Firstly, using a 3 D convolution residual block to pre-filter the 4 D cost body CGD”). However, Wu does not explicitly teach a 2D convolution cost filter and does not teach any motivation to teach the cost filter as claimed in Claim 10. Thus, none of the previously cited references, alone or in combination, provide a motivation to teach the ordered combination of “The method of claim 9 wherein the second neural network uses a learned cost filter which is a two dimensional (2D) convolution.”
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Fleischman et al. (US 2019/0005719 A1) discloses a spatial indexing system using floor images to automatically identify spatial locations.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CAROLINE TABANCAY DUFFY whose telephone number is (703)756-1859. The examiner can normally be reached Monday - Friday 8:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amandeep Saini can be reached at 5712723382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CAROLINE TABANCAY DUFFY/Examiner, Art Unit 2662
/AMANDEEP SAINI/Supervisory Patent Examiner, Art Unit 2662