Last updated: April 19, 2026
Application No. 18/600,424
CROSS-VIEW VISUAL GEO-LOCALIZATION FOR ACCURATE GLOBAL ORIENTATION AND LOCATION

Non-Final OA §103
Filed
Mar 08, 2024
Examiner
XIAO, DI
Art Unit
2178
Tech Center
2100 — Computer Architecture & Software
Assignee
Sri International
OA Round
1 (Non-Final)
Interview Optional

— +21.7% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 600 resolved cases, 2023–2026
Examiner Intelligence

XIAO, DI View full profile →
Grants 77% — above average
Career Allow Rate
463 granted / 600 resolved
+22.2% vs TC avg
Strong +22% interview lift
Without
With
+21.7%
Interview Lift
resolved cases with interview
Typical timeline
3y 4m
Avg Prosecution
24 currently pending
Career history
624
Total Applications
across all art units
Statute-Specific Performance

§101
8.2%
-31.8% vs TC avg
§103
57.6%
+17.6% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
14.2%
-25.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 600 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
1.	This action is responsive to communications: Application filed on March 8, 2024, and Drawings filed on March 8, 2024.
2.	Claims 1–20 are pending in this case. Claim 1, 6, 12, 18 are independent claims. 


In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Allowable Subject Matter
Claims 5, 11, 17, 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.


With regard to claim 5, the prior arts do not disclose The method of claim 1, wherein training the neural network comprises: determining a vector representation of the features of the matching image pairs of the ground images and the geo-referenced, downward-looking reference images; and jointly embedding the feature vector representation of each of the matching image pairs in a common embedding space such that the feature embeddings of matching image pairs of the ground images and the geo-referenced, downward-looking reference images are closer together in the embedding space while the feature embeddings of not matching pairs are further apart.

With regard to claims 11 and 17 and 20, the prior arts do not disclose the aspect wherein  at least one of the determined orientation and location for the query ground image and/or the updated orientation and location for the query ground image of the query ground image is used to insert an augmented reality object into the query ground image and/or to provide navigation information to a real-time navigation system.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 6, 7, 8, 12, 13, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Keski-Valkama, Pub. NO.: US 20240013554 A1, in view of Gautam et al., Pub. No.: 20210402991 A1
With regard to claim 1:
Keski-Valkama discloses a computer-implemented method of training a neural network for providing orientation and location estimates for ground images (see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”), comprising: collecting a set of ground images (the image is collected from camera that can be mounted on a vehicle, paragraph 35: “Mapless positioning is an important use case for, e.g., autonomous vehicles, augmented reality (AR) and extended reality (XR) applications, and any other area/application where a signal of a camera is to be positioned relative to the surrounding area (e.g., a locality or geographic area 101). As shown in FIG. 1, the camera can be equipped, mounted, or otherwise associated with any platform including but not limited to a vehicle 103 (e.g., providing imagery 105 from a street level viewpoint or perspective), a user equipment (UE) device 107 (e.g., executing an application 109 to provide imagery 105 from a street level viewpoint or perspective), and/or a satellite 111 or other aerial vehicle (e.g., airplane, drone, etc.) (e.g., providing imagery 105 from an overhead perspective). Traditional solutions range from simple GPS-type positioning to improving it with Kalman filtering, inertial, vehicle speed, engine, steering and compass information, to relative visual positioning by optical flow or landmarks. For example, autonomous driving is quickly becoming a reality following advances in machine learning, computer vision, and compute power. The ability to perceive the world with an accurate semantic understanding enables vehicles 103 (e.g., an autonomous vehicle) to obey driving rules and avoid collisions.”); determining spatial-aware features for each of the collected ground images (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); collecting a set of geo-referenced, downward-looking reference images (see fig. 6, 7 for satellite/aerial images, paragraph 38: “Different sources of raw data (e.g., imagery 105) can be exploited for positioning and/or other location-based services (e.g., mapping to generate digital map data of a geographic database 113). For example, overhead or top down sources, like satellite, aerial, and drone images, which are accurate photographs of the Earth's surface from an overhead perspective, can be used to precisely determine the location of roads, and other features on the Earth. These images help support positioning or create maps at a much larger scale but may be more limited in resolution because of the large fields of view in each overhead image relative to street-level or ground-based images.”); determining spatial-aware features for each of the collected geo-referenced, downward-looking reference images (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” );  determining a similarity of the spatial-aware features of the ground images with the spatial-aware features of the geo-referenced, downward-looking reference images (Wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa, paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”); pairing ground images and geo-referenced, downward-looking reference images based on the determined similarity (the images are paired and registered together, paragraph 46: “FIG. 3 is a diagram illustrating an example of registering two images based on predicted correspondence masks (e.g., correspondence data 121), according to one embodiment. In this example, a street level image 301 is captured and depicts a roadway in a geographic area of interest. An overhead image 303 that depicts at least part of the same roadway depicted in street level image 301 is retrieved. The images 301 and 303 are then processed using the machine learning system 117 (e.g., a neural network or equivalent machine learning model) to predict correspondence masks 307 and 309 according to the various embodiments described herein. The correspondence mask 307 comprises the pixels of the street level image 301 that matches (e.g., depicts the same portion of the roadway) the pixels of the correspondence mask 309 of the overhead image 309. In one embodiment, the machine learning system 117 uses visual similarities and/or related features between the matching regions of the images 307 and 309 to predict the correspondence masks 307 and 309.”); determining a loss function (paragraph 65: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”); creating a training set including the paired ground images and geo-referenced, downward-looking reference images and the loss function (wherein the image pair is used for training to predict correspondence masks between a first input image and a second input image (paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”) and training, using the training set, the neural network to determine orientation and location estimates of ground images without the use of three-dimensional (3D) data (The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa, paragraph 72: “In other words, in optional step 507, the training module selects a first training image and second training image for training the machine learning system 117 to perform image space coordinate transformation. For example, the first training image is associated with a first set of ground truth image space coordinates and the second training image is associated with a second set of ground truth image space coordinates corresponding to the one or more ground truth correspondence masks. The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa. The trained machine learning model that is trained to determine to one or more predicted correspondence masks is augmented with at least one extra convolutional layer, at least one extra attentional layer, at least one extra dense layer, or a combination thereof which attend to the image coordinate locality of the first input image, the second input image, or a combination thereof.”).
	Keski-Valkama does not disclose the aspect wherein the loss function jointly evaluates both orientation and location information. 
However, Gautam determining a loss function that jointly evaluates both orientation (the system collect data such as orientation and location data for evaluation, paragraph 23: “The systems and methods of the present disclosure can allow for generating temporally-consistent trajectories for detected objects using predicted trajectory data for a plurality of time steps. For example, a computing system can obtain input data indicative of a respective trajectory for each of one or more first objects for a first time step. In some implementations, the one or more first objects can be a first set of objects. In some implementations, the first time step can be a prior time step (e.g., t=T−1). In some implementations, the first input data can be obtained from a first stage model, such as a first stage model configured to generate predicted trajectories for detected objects using sensor data from sensors of an autonomous vehicle. In various implementations, the first input data can include: data indicative of a respective trajectory for each detected object for the first time step (e.g., t=T−1); object detection data (e.g., bounding box data); feature data (e.g., feature map data from the first stage model); probability data (e.g., classification probability data); state data (e.g., position, orientation, velocity, etc. data); and/or identifier data (e.g., identifier data associated with each detected object).”) and location information (the loss function is determined based a position parameter and a orientation parameter, paragraph 119” At 506, the method 500 can include determining a first loss function based at least in part on a comparison of the data descriptive of the first tracked trajectory to a ground truth tracked trajectory. As an example, in some implementations, the first loss function can include an association loss parameter and/or one or more trajectory loss parameters, such as a length loss parameter, a width loss parameter, a position loss parameter, and/or an orientation loss parameter, as described herein. For example, an association loss parameter can describe a difference between an association matrix received as an output of the machine-learned parameter and a comparison to a ground truth association matrix.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Gautam to Keski-Valkama so the system can use orientation and location information to determine a loss function and accurate determine the difference between the downward looking image and the ground image and make more accurate pairing between the images and more accurately determine the locations of the ground images. 



With regard to claim 6:
Keski-Valkama discloses a method for providing orientation and location estimates for a query ground image (see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”), comprising: receiving a query ground image (paragraph 35: “Mapless positioning is an important use case for, e.g., autonomous vehicles, augmented reality (AR) and extended reality (XR) applications, and any other area/application where a signal of a camera is to be positioned relative to the surrounding area (e.g., a locality or geographic area 101). As shown in FIG. 1, the camera can be equipped, mounted, or otherwise associated with any platform including but not limited to a vehicle 103 (e.g., providing imagery 105 from a street level viewpoint or perspective), a user equipment (UE) device 107 (e.g., executing an application 109 to provide imagery 105 from a street level viewpoint or perspective), and/or a satellite 111 or other aerial vehicle (e.g., airplane, drone, etc.) (e.g., providing imagery 105 from an overhead perspective). Traditional solutions range from simple GPS-type positioning to improving it with Kalman filtering, inertial, vehicle speed, engine, steering and compass information, to relative visual positioning by optical flow or landmarks. For example, autonomous driving is quickly becoming a reality following advances in machine learning, computer vision, and compute power. The ability to perceive the world with an accurate semantic understanding enables vehicles 103 (e.g., an autonomous vehicle) to obey driving rules and avoid collisions.”); determining spatial-aware features of the received query ground image (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); and applying a model to the determined spatial-aware features of the received query ground image to determine the orientation and location of the query ground image (The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa, paragraph 72: “In other words, in optional step 507, the training module selects a first training image and second training image for training the machine learning system 117 to perform image space coordinate transformation. For example, the first training image is associated with a first set of ground truth image space coordinates and the second training image is associated with a second set of ground truth image space coordinates corresponding to the one or more ground truth correspondence masks. The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa. The trained machine learning model that is trained to determine to one or more predicted correspondence masks is augmented with at least one extra convolutional layer, at least one extra attentional layer, at least one extra dense layer, or a combination thereof which attend to the image coordinate locality of the first input image, the second input image, or a combination thereof.”), the model having been trained by: collecting a set of ground images(paragraph 35: “Mapless positioning is an important use case for, e.g., autonomous vehicles, augmented reality (AR) and extended reality (XR) applications, and any other area/application where a signal of a camera is to be positioned relative to the surrounding area (e.g., a locality or geographic area 101). As shown in FIG. 1, the camera can be equipped, mounted, or otherwise associated with any platform including but not limited to a vehicle 103 (e.g., providing imagery 105 from a street level viewpoint or perspective), a user equipment (UE) device 107 (e.g., executing an application 109 to provide imagery 105 from a street level viewpoint or perspective), and/or a satellite 111 or other aerial vehicle (e.g., airplane, drone, etc.) (e.g., providing imagery 105 from an overhead perspective). Traditional solutions range from simple GPS-type positioning to improving it with Kalman filtering, inertial, vehicle speed, engine, steering and compass information, to relative visual positioning by optical flow or landmarks. For example, autonomous driving is quickly becoming a reality following advances in machine learning, computer vision, and compute power. The ability to perceive the world with an accurate semantic understanding enables vehicles 103 (e.g., an autonomous vehicle) to obey driving rules and avoid collisions.”); determining spatial-aware features for each of the collected ground images (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); collecting a set of geo-referenced, downward-looking reference images (see fig. 6, 7 for satellite/aerial images, paragraph 38: “Different sources of raw data (e.g., imagery 105) can be exploited for positioning and/or other location-based services (e.g., mapping to generate digital map data of a geographic database 113). For example, overhead or top down sources, like satellite, aerial, and drone images, which are accurate photographs of the Earth's surface from an overhead perspective, can be used to precisely determine the location of roads, and other features on the Earth. These images help support positioning or create maps at a much larger scale but may be more limited in resolution because of the large fields of view in each overhead image relative to street-level or ground-based images.”);  determining spatial-aware features for each of the collected geo-referenced, downward-looking reference images (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); determining a similarity of the spatial-aware features of the ground images with the spatial-aware features of the geo-referenced, downward-looking reference images (wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa, paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”); pairing ground images and geo-referenced, downward-looking reference images based on the determined similarity (the images are paired and registered together, paragraph 46: “FIG. 3 is a diagram illustrating an example of registering two images based on predicted correspondence masks (e.g., correspondence data 121), according to one embodiment. In this example, a street level image 301 is captured and depicts a roadway in a geographic area of interest. An overhead image 303 that depicts at least part of the same roadway depicted in street level image 301 is retrieved. The images 301 and 303 are then processed using the machine learning system 117 (e.g., a neural network or equivalent machine learning model) to predict correspondence masks 307 and 309 according to the various embodiments described herein. The correspondence mask 307 comprises the pixels of the street level image 301 that matches (e.g., depicts the same portion of the roadway) the pixels of the correspondence mask 309 of the overhead image 309. In one embodiment, the machine learning system 117 uses visual similarities and/or related features between the matching regions of the images 307 and 309 to predict the correspondence masks 307 and 309.”);  determining a loss function (paragraph 65: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”); creating a training set including the paired ground images and geo-referenced, downward-looking reference images and the loss function(wherein the image pair is used for training to predict correspondence masks between a first input image and a second input image (paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”); and training, using the training set, the neural network to determine orientation and location estimates of ground images without the use of three-dimensional (3D) data  (The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa, paragraph 72: “In other words, in optional step 507, the training module selects a first training image and second training image for training the machine learning system 117 to perform image space coordinate transformation. For example, the first training image is associated with a first set of ground truth image space coordinates and the second training image is associated with a second set of ground truth image space coordinates corresponding to the one or more ground truth correspondence masks. The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa. The trained machine learning model that is trained to determine to one or more predicted correspondence masks is augmented with at least one extra convolutional layer, at least one extra attentional layer, at least one extra dense layer, or a combination thereof which attend to the image coordinate locality of the first input image, the second input image, or a combination thereof.”).
	Keski-Valkama does not disclose the aspect wherein the loss function jointly evaluates both orientation and location information. 
However, Gautam determining a loss function that jointly evaluates both orientation (the system collect data such as orientation and location data for evaluation, paragraph 23: “The systems and methods of the present disclosure can allow for generating temporally-consistent trajectories for detected objects using predicted trajectory data for a plurality of time steps. For example, a computing system can obtain input data indicative of a respective trajectory for each of one or more first objects for a first time step. In some implementations, the one or more first objects can be a first set of objects. In some implementations, the first time step can be a prior time step (e.g., t=T−1). In some implementations, the first input data can be obtained from a first stage model, such as a first stage model configured to generate predicted trajectories for detected objects using sensor data from sensors of an autonomous vehicle. In various implementations, the first input data can include: data indicative of a respective trajectory for each detected object for the first time step (e.g., t=T−1); object detection data (e.g., bounding box data); feature data (e.g., feature map data from the first stage model); probability data (e.g., classification probability data); state data (e.g., position, orientation, velocity, etc. data); and/or identifier data (e.g., identifier data associated with each detected object).”) and location information (the loss function is determined based a position parameter and a orientation parameter, paragraph 119” At 506, the method 500 can include determining a first loss function based at least in part on a comparison of the data descriptive of the first tracked trajectory to a ground truth tracked trajectory. As an example, in some implementations, the first loss function can include an association loss parameter and/or one or more trajectory loss parameters, such as a length loss parameter, a width loss parameter, a position loss parameter, and/or an orientation loss parameter, as described herein. For example, an association loss parameter can describe a difference between an association matrix received as an output of the machine-learned parameter and a comparison to a ground truth association matrix.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Gautam to Keski-Valkama so the system can use orientation and location information to determine a loss function and accurate determine the difference between the downward looking image and the ground image and make more accurate pairing between the images and more accurately determine the locations of the ground images. 

With regard to claims 7 and 13:
Keski-Valkama and Gautam disclose  the method of claim 6, wherein applying a machine learning model to the determined spatial-aware features of the received ground image to determine the orientation and location of the ground image comprises: projecting the spatial-aware features of the query ground image into an embedding space having been trained by embedding features of matching image pairs of the ground images and the geo-referenced, downward-looking reference image to identify a geo-referenced, downward-looking reference image having features matching the projected features of the query ground image (Keski-Valkama The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa, paragraph 72: “In other words, in optional step 507, the training module selects a first training image and second training image for training the machine learning system 117 to perform image space coordinate transformation. For example, the first training image is associated with a first set of ground truth image space coordinates and the second training image is associated with a second set of ground truth image space coordinates corresponding to the one or more ground truth correspondence masks. The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa. The trained machine learning model that is trained to determine to one or more predicted correspondence masks is augmented with at least one extra convolutional layer, at least one extra attentional layer, at least one extra dense layer, or a combination thereof which attend to the image coordinate locality of the first input image, the second input image, or a combination thereof.”); 
and determining the orientation and location of the query ground image using at least one of information contained in the embedded, matching geo-referenced, downward-looking reference image and/or information captured with the query ground image  (Keski-Valkama see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”).

With regard to claims 8:
Keski-Valkama and Gautam The method of claim 7, wherein an orientation of the query ground image is determined by aligning spatial-aware features of the query image with spatial-aware features of the matching geo-referenced, downward-looking reference image (Keski-Valkama see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”).


Claim 12 is rejected for the same reason as claim 1. 

With regard to claim 18:
Keski-Valkama discloses a system for providing orientation and location estimates for a query ground image, comprising: a neural network module including a model trained for providing orientation and location estimates for ground images (see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”), a cross-view geo-registration module configured to process determined spatial-aware image features (see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); an image capture device; a database configured to store geo-referenced, downward-looking reference images (see fig. 6, 7 for satellite/aerial images, paragraph 38: “Different sources of raw data (e.g., imagery 105) can be exploited for positioning and/or other location-based services (e.g., mapping to generate digital map data of a geographic database 113). For example, overhead or top down sources, like satellite, aerial, and drone images, which are accurate photographs of the Earth's surface from an overhead perspective, can be used to precisely determine the location of roads, and other features on the Earth. These images help support positioning or create maps at a much larger scale but may be more limited in resolution because of the large fields of view in each overhead image relative to street-level or ground-based images.”); and an apparatus comprising a processor and a memory accessible to the processor, the memory having stored therein at least one of programs or instructions executable by the processor to configure the apparatus to: determine spatial-aware features of a received query ground image, captured by the capture device, using the neural network module(see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); and apply the model to the determined spatial-aware features of the received query ground image to determine the orientation and location of the query ground image (see fig. 7 for identifying orientation and location of the ground image 703: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”),  the model having been trained by: collecting a set of ground images using the image capture device; determining spatial-aware features for each of the collected ground images using the neural network module(see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” );  collecting a set of geo-referenced, downward-looking reference images from the database (see fig. 6, 7 for satellite/aerial images, paragraph 38: “Different sources of raw data (e.g., imagery 105) can be exploited for positioning and/or other location-based services (e.g., mapping to generate digital map data of a geographic database 113). For example, overhead or top down sources, like satellite, aerial, and drone images, which are accurate photographs of the Earth's surface from an overhead perspective, can be used to precisely determine the location of roads, and other features on the Earth. These images help support positioning or create maps at a much larger scale but may be more limited in resolution because of the large fields of view in each overhead image relative to street-level or ground-based images.”);  determining spatial-aware features for each of the collected geo-referenced, downward-looking reference images using the neural network module(see fig. 6 for identifying spatial-aware features such as the road in part 607 and the corresponding satellite image in part 605, paragraph 62: “FIG. 6 is a diagram illustrating example image data depicting the same locality/geographic area from different perspectives, according to one embodiment. In this example, the training module 401 retrieves or generates a set of overhead images 601 from a first source (e.g., an aerial source such as a satellite 111), and a set of street level images 603 from a second source (e.g., a street level source such as a camera mounted on a vehicle 103 or a UE 107). The overhead images 601 and street level images 603 depict at least a part of the same locality, geographic area, or map from different respective perspectives. By way of example, overhead imagery refers to images or image data that are captured from a top-down or aerial perspective so that the camera is pointed down towards the area of interest from an overhead height. Street level imagery refers to images or image data captured while the camera is located at street level or mounted on a vehicle 103, UE 107, or system that is located at street level with the camera pointed towards the feature or object being captured. As shown, ground truth correspondence mask 605 can be labeled in overhead images 601 and correspondence mask 607 in street level images 603 to identify the image pixels or regions that depict the same locality.” ); determining a similarity of the spatial-aware features of the ground images with the spatial-aware features of the geo-referenced, downward-looking reference images using the cross-view geo-registration module (This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair, paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”); pairing ground images and geo-referenced, downward-looking reference images based on the determined similarity using the cross-view geo-registration module (paragraph 46: “FIG. 3 is a diagram illustrating an example of registering two images based on predicted correspondence masks (e.g., correspondence data 121), according to one embodiment. In this example, a street level image 301 is captured and depicts a roadway in a geographic area of interest. An overhead image 303 that depicts at least part of the same roadway depicted in street level image 301 is retrieved. The images 301 and 303 are then processed using the machine learning system 117 (e.g., a neural network or equivalent machine learning model) to predict correspondence masks 307 and 309 according to the various embodiments described herein. The correspondence mask 307 comprises the pixels of the street level image 301 that matches (e.g., depicts the same portion of the roadway) the pixels of the correspondence mask 309 of the overhead image 309. In one embodiment, the machine learning system 117 uses visual similarities and/or related features between the matching regions of the images 307 and 309 to predict the correspondence masks 307 and 309.”); determining a loss function (paragraph 65: “FIG. 7 is a diagram illustrating an example of using texture projection as a loss function for training a machine learning model/system for registering imagery with different perspectives, according to one embodiment. In this example, a correspondence mask 701 is predicted for street level image 703 and a correspondence mask 705 is predicted for the paired overhead image 707. The image texture of the correspondence mask 705 of the overhead image 707 is projected into the ground plane of the correspondence mask 701 of the street level image 703 to generate image 709 with the overhead image texture 711 replacing the portion of the image delineated by the correspondence mask 701. This generated image 709 is then compared to the original street level image 713 (e.g., with the original street level image texture in the correspondence mask 701 to determine the similarity of the two images (e.g., in terms of optical pixel values and/or equivalent image characteristics). This similarity can then be used as the image texture similarity metric. Higher similarities will result in higher reinforcement scores or lower loss, and lower similarities will result in lower reinforcement scores or higher loss for purposes of training the machine learning system 117.”); creating a training set including the paired ground images and geo-referenced, downward-looking reference images and the loss function using the cross-view geo-registration module (wherein the image pair is used for training to predict correspondence masks between a first input image and a second input image (paragraph 61: “It is contemplated that the training module 401 can use any process to generate training data from the retrieved or generated image pairs. The image pairs can be imagery 105 that has been captured by cameras (e.g., equipped on vehicles 103, UEs 107, satellites 111, other aerial vehicles, etc.) from different viewpoints. For example, in step 503, the training module 401 can initiate the labeling or annotation of the ground truth correspondence masks between two images used as a supervised ground truth. In other words, the training module initiates a labeling of one or more ground truth correspondence masks between the first training image and the second training image, wherein the one or more ground truth correspondence masks denote an image region of the first training image that matches a corresponding image region of the second training image or vice versa. The training module 401 can then use the one or more ground truth correspondence masks to train a machine learning model to determine one or more predicted correspondence masks between a first input image and a second input image (step 505). This labeling or annotation can be performed manually by presenting human annotators with training images to identify and delineate the ground truth correspondence masks in each training image pair (e.g., as illustrated in FIG. 3). The labeled ground truth correspondence masks can then be provided as ground truth training data.”); and training, using the training set, the neural network to determine orientation and location estimates of ground images without the use of three-dimensional (3D) data (The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa, paragraph 72: “In other words, in optional step 507, the training module selects a first training image and second training image for training the machine learning system 117 to perform image space coordinate transformation. For example, the first training image is associated with a first set of ground truth image space coordinates and the second training image is associated with a second set of ground truth image space coordinates corresponding to the one or more ground truth correspondence masks. The training module 401 then uses the first set of ground truth image space coordinates and the second set of ground truth image space coordinates to further train the machine learning model to transform a first image space coordinate of the first input image to a second image space coordinate of the second image space coordinate of the second input image or vice versa. The trained machine learning model that is trained to determine to one or more predicted correspondence masks is augmented with at least one extra convolutional layer, at least one extra attentional layer, at least one extra dense layer, or a combination thereof which attend to the image coordinate locality of the first input image, the second input image, or a combination thereof.”).
	Keski-Valkama does not disclose the aspect wherein the loss function jointly evaluates both orientation and location information. 
However, Gautam determining a loss function that jointly evaluates both orientation (the system collect data such as orientation and location data for evaluation, paragraph 23: “The systems and methods of the present disclosure can allow for generating temporally-consistent trajectories for detected objects using predicted trajectory data for a plurality of time steps. For example, a computing system can obtain input data indicative of a respective trajectory for each of one or more first objects for a first time step. In some implementations, the one or more first objects can be a first set of objects. In some implementations, the first time step can be a prior time step (e.g., t=T−1). In some implementations, the first input data can be obtained from a first stage model, such as a first stage model configured to generate predicted trajectories for detected objects using sensor data from sensors of an autonomous vehicle. In various implementations, the first input data can include: data indicative of a respective trajectory for each detected object for the first time step (e.g., t=T−1); object detection data (e.g., bounding box data); feature data (e.g., feature map data from the first stage model); probability data (e.g., classification probability data); state data (e.g., position, orientation, velocity, etc. data); and/or identifier data (e.g., identifier data associated with each detected object).”) and location information (the loss function is determined based a position parameter and a orientation parameter, paragraph 119” At 506, the method 500 can include determining a first loss function based at least in part on a comparison of the data descriptive of the first tracked trajectory to a ground truth tracked trajectory. As an example, in some implementations, the first loss function can include an association loss parameter and/or one or more trajectory loss parameters, such as a length loss parameter, a width loss parameter, a position loss parameter, and/or an orientation loss parameter, as described herein. For example, an association loss parameter can describe a difference between an association matrix received as an output of the machine-learned parameter and a comparison to a ground truth association matrix.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Gautam to Keski-Valkama so the system can use orientation and location information to determine a loss function and accurate determine the difference between the downward looking image and the ground image and make more accurate pairing between the images and more accurately determine the locations of the ground images. 

Claims 2, 9, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Keski-Valkama, Pub .NO.: US 20240013554 A1, in view of Gautam, and further in view of Xiong, Pub. NO.: CN 114842343 A. 
With regard to claims 2, 9 and 14:
Keski-Valkama and Gautam do not disclose the aspect wherein the spatial-aware features for the ground images and the spatial-aware features for the geo-referenced, downward-looking reference images are determined using at least one neural network including a vision transformer.
However Xiong disclose the aspect wherein the spatial-aware features for the ground images or the spatial-aware features for the geo-referenced, downward-looking reference images are determined using at least one neural network including a vision transformer (Wherein ViT is the acronym for vision transformer: “step S3: constructing a ViT-based aerial image recognition model, model structure shown in FIG. 3. The model is composed of encoder F (.), classification head G (.) and projection head P (.) only used for training stage: The encoder F (.) is composed of ViT pre-trained on the ImageNet data set, for learning and encoding the image global features. Specifically, the encoder F (.) comprises a linear layer and a coder two parts: the linear layer is used for embedding the image; The encoder is composed of a multi-head self-attention layer and a multi-layer sensing machine block, for learning global features of the image. applying the LayerNorm normalization before each block, applying residual connection after each block. The training image input characteristic encoder F (.), using the first token of the last layer of encoder as the global feature represents hi. The hi is then input into the classifier and the projector to calculate the total loss value.”) It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Xiong to Keski-Valkama and Gautam so vision transformers can be used to accurate identify spatial-aware features with superior global context understanding. 


	Claims 3 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Keski-Valkama, Pub .NO.: US 20240013554 A1, in view of Gautam, and further in view of RODRIGUES, Pub. No.:  WO 2022215236 A1.
With regard to claims 3 and 19:
Keski-Valkama and Gautam do not disclose the aspect of further comprising applying a polar transformation to at least one of the geo-referenced, downward-looking reference images prior to determining the spatial-aware features for the geo-referenced, downward-looking reference images
However Rodrigues discloses the aspect further comprising applying a polar transformation to at least one of the geo-referenced, downward-looking reference images prior to determining the spatial-aware features for the geo-referenced, downward-looking reference images (“Specifically, one of the CNNs acquires a ground-view image and extracts features therefrom. The other one acquires a polar-transformed aerial-view image, and extracts features therefrom. Then, the system estimates the orientation alignment of the ground-view image with respect to the aerial-view image by computing the correlation between the features extracted from the ground-view image and those extracted from the aerial-view image. Based on the orientation alignment, the system shifts and crops out a section of the features of the aerial-view image that is estimated to correspond to the ground-view image. Then, the features of the ground-view image are compared with the cropped features of the aerial-view image to determine whether the ground-view image matches the aerial-view image.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Rodrigues to Keski-Valkama and Gautam using polar transformation to improve visualization and analysis of data. 

Claims 4 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Keski-Valkama, Pub. NO.: US 20240013554 A1, in view of Gautam, and further in view of Zhou, Pub. No.: WO 2021213157 A1. 
With regard claims 4 and 15:
Keski-Valkama and Gautam do not disclose the aspect comprising applying an orientation-weighted triplet ranking loss function to train the neural network 
However Zhou discloses the aspect comprising applying an orientation-weighted triplet ranking loss function to train the neural network (“Exemplarily, a preset neural network is determined, and the loss function of the preset neural network includes a fully connected loss function, a distance loss function, an area loss function, and an area loss function. The preset neural network is stored in the blockchain, and the preset neural network is determined from the blockchain. The fully connected loss function includes the softmax loss loss function and the softmax rank loss loss function. The softmax loss loss function is used for learning The accuracy of vehicle classification; the softmax rank loss loss function is used to learn the incremental relationship of vehicle classification accuracy. The distance loss function includes a triplet loss loss function and a triplet rank loss loss function. The triplet loss loss function is used to learn the metric of the vehicle feature recognizer; the triplet rank loss loss function is used to learn the metric relationship of the vehicle feature recognizer. Increasing relationship. The region loss function includes the region loss loss function. The region loss loss function is used to learn the vehicle feature relationship. The vehicle feature relationship includes the symmetric relationship between the car lights and the license plate-car logo axis, the left and right relationship between the two lights, and the car logo. The up and down relationship of the license plate, the angle relationship between the key points, etc. The area loss function includes the Area loss loss function, which is used to learn the hierarchical relationship of the cutting area.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Zhou to Keski-Valkama and Gautam to use triplet loss function to provide high precision similarity comparisons to determine whether the pairing is accurate in order to accurately determine the location of the ground image. 

Claims 10 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Keski-Valkama, Pub. NO.: US 20240013554 A1, in view of Gautam, and further in view of Himane, Pub. NO.: 2020/0089313A1. 
With regard to claims 10 and 16:
Keski-Valkama and Gautam do not disclose The method of claim 6, wherein the determined orientation and location for the query ground image is used to update at least one of an orientation or a location of the query ground image
However Himane discloses the aspect wherein the determined orientation and location for the query ground image is used to update at least one of an orientation or a location of the query ground image (paragraph 19: “Once relative position and orientation have been determined based on the marker or otherwise using image data, the relative position and orientation of the devices can be updated via inertial measurements. In some implementations, one or both of the first device and the second device include a relative inertial measurement system that determines relative inertial motion based on inertial measurements. For example, the inertial measurements from the first device and the second device may be synchronized in order to determine relative motion of the first device and the second device. In some implementations, most recently received inertial measurements may be used to determine relative movement, e.g., inertial measurements may be measured at a time interval and a previous time interval to determine relative movement. The relative inertial motion may be used to determine the relative position and orientation of the second device to the first device. However, based on the frequency of the inertial measurements, inaccuracies such as a drift affect may result.”). It would have been obvious to one of ordinary skill in the art, at the time the filing was made to apply Himane to Keski-Valkama and Gautam so the orientation and location information can be used to update at least one of an orientation or a location of the query ground image to keep the information accurate and cross check the old information to see if it is still accurate. 

Pertinent Arts
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Aggarwal, Patent Number: 12145800 B1: The position sensor 140 represents one or more sensors, detectors, or other components suitable for determining the location of the unit 20 in any appropriate manner. For example, in particular embodiments, the workspace 70 associated with the inventory system 10 includes a number of fiducial markers distributed to cover all or a portion of the workspace 70. In such embodiments, the position sensor 140 may include a camera and suitable image-processing and/or video-processing components, such as an appropriately-programmed digital signal processor, to allow the position sensor 140 to detect fiducial markers within the camera's field of view. The control module 170 may store location information that is updated repeatedly to reflect the position and orientation of the unit 20 as determined via processing an image captured by the position sensor 140 encompassing one or more of the fiducial markers. As a result, the control module 170 can maintain an accurate indication of the location and orientation of the unit 20 to aid in navigation when moving the unit 20 within the workspace 70.

Handa, Pub. No.: US 20230191596 A1: A technique for training a neural network, including generating a plurality of input vectors based on a first plurality of task demonstrations associated with a first robot performing a first task in a simulated environment, wherein each input vector included in the plurality of input vectors specifies a sequence of poses of an end-effector of the first robot, and training the neural network to generate a plurality of output vectors based on the plurality of input vectors. Another technique for generating a task demonstration, including generating a simulated environment that includes a robot and at least one object, causing the robot to at least partially perform a task associated with the at least one object within the simulated environment based on a first output vector generated by a trained neural network, and recording demonstration data of the robot at least partially performing the task within the simulated environment.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DI XIAO whose telephone number is (571)270-1758. The examiner can normally be reached 9Am-5Pm est M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen Hong can be reached at (571) 272-4124. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DI XIAO/Primary Examiner, Art Unit 2178
Read full office action
Prosecution Timeline

Mar 08, 2024
Application Filed
Feb 13, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

19/217,216
Patent 12599341
AUTONOMOUS, CONSENT DRIVEN AND GENERATIVE DEVICE, SYSTEM AND METHOD THAT PROMOTES USER PRIVACY, SELF-KNOWLEDGE AND WELL-BEING
2y 5m to grant Granted Apr 14, 2026
17/608,834
Patent 12597519
METHODS FOR CHARACTERIZING AND TREATING A CANCER TYPE USING CANCER IMAGES
2y 5m to grant Granted Apr 07, 2026
17/999,986
Patent 12588967
PRESENTATION OF PATIENT INFORMATION FOR CARDIAC SHUNTING PROCEDURES
2y 5m to grant Granted Mar 31, 2026
17/870,301
Patent 12586456
SYSTEMS AND METHODS FOR PROVIDING SECURITY SYSTEM INFORMATION USING AUGMENTED REALITY EFFECTS
2y 5m to grant Granted Mar 24, 2026
18/256,332
Patent 12579773
DISPLAY APPARATUS AND DISPLAY METHOD
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
77%
Grant Probability
99%
With Interview (+21.7%)
3y 4m
Median Time to Grant
Low
PTA Risk
Based on 600 resolved cases by this examiner. Grant probability derived from career allow rate.