Office Action Analysis: 18889305 — SYSTEMS AND METHODS FOR MOBILE DEVICE CONTROL USING LANGUAGE INPUT

Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. Under a broadest reasonable interpretation (BRI), words of the claim must be given their plain meaning, unless such meaning is inconsistent with the specification. The plain meaning of a term means the ordinary and customary meaning given to the term by those of ordinary skill in the art at the relevant time. The ordinary and customary meaning of a term may be evidenced by a variety of sources, including the words of the claims themselves, the specification, drawings, and prior art. However, the best source for determining the meaning of a claim term is the specification - the greatest clarity is obtained when the specification serves as a glossary for the claim terms. The words of the claim must be given their plain meaning unless the plain meaning is inconsistent with the specification. 2111.01 (I). See also In re Marosi, 710 F.2d 799, 802, 218 USPQ 289, 292 (Fed. Cir. 1983) ("'[C]laims are not to be read in a vacuum, and limitations therein are to be interpreted in light of the specification in giving them their ‘broadest reasonable interpretation.'"2111.01 (II).

Specification
The abstract of the disclosure is objected to because the abstract fails for properly introduce the claimed invention. The abstract is too short and fails to summarize all aspects of the claimed invention. A corrected abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text. See MPEP § 608.01(b).
Applicant is reminded of the proper content of an abstract of the disclosure.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 5, 6, 12, 15, 16, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over HU (Z. Hu, J. Pan, T. Fan, R. Yang and D. Manocha, "Safe Navigation With Human Instructions in Complex Scenes," in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 753-760, April 2019) in view of SAMARASEKERA (US 20150269438 A1).
Regarding claim 1:
HU discloses:
at a mobile device in communication with one or more input devices: (See at least HU, Pg 1, Section III, "Our approach contains four steps: pre-processing, phrase classification, goal and constraint grounding, and motion planning. First, the pre-processing step takes the command sentence as the input and divides it into phrases according to conjunctions and commas. Next, we assume that each phrase has one of three labels: goal, constraint, or uninformative phrase. For example, the phrase “go to the restaurant” is a goal phrase, “keep away from people” is a constraint phrase, and the phrase “you know” is uninformative. We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase. After recognizing the constraint and goal phrases, we need to ground them with the physical world. In particular, we ground the goal phrase by computing the similarity between the noun extracted from the goal phrase and the location name in the predefined semantic map, and use the most similar location as the goal configuration for the robot navigation.")
receiving, via the one or more input devices, an input including a description associated with an object in a mapped environment; (See at least HU, Pg 1, Section III, "Our approach contains four steps: pre-processing, phrase classification, goal and constraint grounding, and motion planning. First, the pre-processing step takes the command sentence as the input and divides it into phrases according to conjunctions and commas. Next, we assume that each phrase has one of three labels: goal, constraint, or uninformative phrase. For example, the phrase “go to the restaurant” is a goal phrase, “keep away from people” is a constraint phrase, and the phrase “you know” is uninformative. We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase. After recognizing the constraint and goal phrases, we need to ground them with the physical world. In particular, we ground the goal phrase by computing the similarity between the noun extracted from the goal phrase and the location name in the predefined semantic map, and use the most similar location as the goal configuration for the robot navigation."; Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location.")
generating, using a text encoder of a multi-modal model, an embedding corresponding to the input; (See at least HU, Pg 3, Section III, "We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase."; Pg 4, Section IV, "More specifically, as shown in Figure 3, our classification subsystem contains four layers: the embedding layer, the bidirectional LSTM layer, the attention layer, and the output layer. The embedding layer transforms a phrase S with length T: S = fx1; x2;       ; xT g into a list of real-valued vectors E = fe1; e2;       ; eT g. We transform a word xi into the embedding vector through an embedding matrix We: ei = Wevi, where vi is a vector of value 1 at index ei and 0 otherwise. Next, we extract the representation for the sequential data E using LSTM, which provides an elegant way of modeling sequential data. But in standard LSTM, the information encoded in the inputs can only flow in one direction and the future hidden unit cannot affect the current unit. To overcome this drawback, we employ a bi-directional LSTM layer [21], which can be trained using all the available inputs from two directions to improve the prediction performance. A bi-directional LSTM consists of a forward and backward LSTM. The forward LSTM reads the input embedded vector sequence from e1 to eT and computes a sequence of forward hidden states ( 􀀀! h1; :::; 􀀀! hT ), where 􀀀!h i 2 Rp and p is the dimensionality of hidden states. The backward LSTM reads the embedded vector sequence in the reverse order, i.e., from eT to e1, resulting in a sequence of backward hidden states ( 􀀀! hT ; :::; 􀀀! h1), where 􀀀! hT 2 Rp. By adding the forward and backward states 􀀀!h i and  􀀀 hi, we can obtain the final latent vector representation as hi = 􀀀!h i +  􀀀 hi, where hi 2 Rp.")
of the mobile device in the mapped environment corresponding to the embedding, wherein determining the target pose of the mobile device in the mapped environment corresponding to the embedding comprises: (See at least HU, Pgs. 1-2, Section I, "In this paper, we present an algorithm to enable a mobile robot to understand the navigation goal and trajectory constraints from natural language human instructions and sensor measurements, in order to generate a high quality navigation trajectory following user’s requirements. We first use a Long Short-Term Memory (LSTM) recursive arXiv:1809.04280v1 [cs.RO] 12 Sep 2018 1 neural network to parse and interpret the commands and generate the navigation goal and a set of constraint phrases. The navigation goal is in form of a location name in the semantic map. The constraint phrase describes the spatial relationship between the robot and a target object, e.g., in “keep away from the desk”, “desk” is the target object and “keep away from” is the spatial relation."; Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location.")
identifying a threshold similarity of the embedding to one of a plurality of embeddings of a multi-modal node graph; (See at least HU, Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location."; Pg 5, Section V, "Once we label the objects in the current scene, we compute the similarity between the nouns that occur in the constraint phrase and those object labels discovered by the scene understanding. Again, we use the Word2Vec embedding network [22], [23] to embed all words in the vector space and compute the cosine similarity between word vectors of the constraint object and objects detected in the instance segmentation. If the similarity is larger than a predefined threshold, we add this constraint to the motion planner.")
wherein the multi-modal node graph includes 
the plurality of embeddings correspond to the plurality of poses of the mobile device in the mapped environment; and (See at least HU, Pg 5, Section V, "The robot then starts to follow the globally planned trajectory, but it needs to perform local planning to adapt to the dynamically added constraints. In particular, the local planning algorithm will maintain a costmap in the robot’s local coordinate. The costmap contains both static obstacles and constraints. The static obstacles like walls, rooms, and buildings can be directly added into the costmap by transforming a subset of the global map. The constraints can also be conveniently modeled in the costmap. Given the location (x0; y0) of a constraint object computed via constraint grounding, we can mark the cells inside the region f(x; y) : (x 􀀀 x0)2 + (y 􀀀 y0)2   a2g as the obstacle cells, where a is a parameter indicating the influencing radius of the constraint. Finally, we perform the smooth inflation operation over all the obstacle cells in the grid map, in order to enable the robot to keep a safe distance from the obstacle. Samples of the costmap computed are shown in Figure 6. ")
moving the mobile device to a location and an orientation of the target pose in the mapped environment. (See at least HU, Pg 1, Abstract, "In this paper, we present a robotic navigation algorithm with natural language interfaces, which enables a robot to safely walk through a changing environment with moving persons by following human instructions such as “go to the restaurant and keep away from people”."; Pg 3, Section III, "To dynamically add the target object in a constraint into the planner’s costmap, we use an object detection module to dynamically look for the object mentioned in the constraint phrase and then output the object’s 3D location in the robot’s local coordinate system. To deal with objects that are moving (like “people”) or objects that are occluded, we will perform the grounding in a dynamic manner along with the navigation. That is, whenever a constraint object occurs in the camera frame, we add it to the local planner’s costmap. Finally, our costmap motion planner will compute a suitable trajectory based on the costmap which is updated in real-time according to the grounded constraint result. To improve the navigation performance in crowd scenarios, we further use the planning result to guide a reinforcement learning based local collision avoidance approach developed in our previous work [20]. An overview of our proposed navigation system is shown in Figure 2. ")
HU does not disclose, but SAMARASEKERA teaches:
determining a target pose (See at least SAMARASEKERA, ¶ 0037, “Referring now to FIG. 2, components of an embodiment of the multi-modal navigation and geospatial mapping subsystem 126 of FIG. 1 are shown in more detail. Each of the components shown in FIG. 2 may be embodied as software, hardware, firmware, or a combination thereof. In FIG. 2, the components of the navigation and geospatial mapping subsystem 126 are shown in the context of an environment 201 that may be created during the operation of the computing system 100 (e.g., a physical and/or virtual execution or "runtime" environment). The illustrative navigation and mapping subsystem 126 includes a temporal association module 210, a spatial association module 212, a geospatial association module 214, and a multi-modal localization and mapping module 216 (which includes a 6DOF pose estimation module 218, described in more detail below with reference to FIG. 6). The components of the navigation and mapping subsystem 126 operate to maintain accurate position and orientation information for each of the sensors 102, 104, 106, and make temporal, spatial and geospatial associations across the data 112, 114, 116 obtained from the various different sensors 102, 104, 106. The operations performed by the components of the navigation and mapping subsystem 126 allow the map representation 128 to be created over time. The navigation and mapping subsystem 126 is capable of exploiting various combinations of multi-modal observations derived from the data 124 to generate a navigation path and map 128 for the platform 120. For example, the navigation and mapping subsystem 126 can utilize a combination of data obtained from an IMU and video data to estimate an initial navigation path for the platform 120, and then use the relative 6 degrees of freedom (6DOF) poses that are estimated by the 6DOF pose estimation module 218 to generate a 3D map based on data obtained from, e.g., a scanning LIDAR sensor. The 3D LIDAR features can then be further exploited to improve the navigation path and the map previously generated based on, e.g., the IMU and video data.”)
representations of a plurality of poses of the mobile device in the mapped environment and (See at least SAMARASEKERA, ¶ 0037)
poses (See at least SAMARASEKERA, ¶ 0037; ¶ 0042,” FIGS. 13A and 13B show examples of visualization output resulting from the operations of an embodiment of the live analytics subsystem 134, including operations of the change detection module 310. FIG. 13A illustrates a visualization 1301 of a map of 3D and image data of a particular geographic location, which is stored in a reference database. FIG. 13B illustrates a visualization 1302 of a map of 3D and image data of the same geographic location as in FIG. 13A, taken at a later point in time than the map of FIG. 13A. In the visualization of FIG. 13B, annotations 1314, 1316 indicate areas of the map that have changed in comparison to the prior map 1301. Notably, the changes detected by the live analytics subsystem 134 and highlighted on the visualization 1302 do not simply reflect changes due to occlusion or frame of reference. Rather, the annotations indicate actual changes in physical features of the real world environment depicted by the map 132.”)
an orientation of the target pose (See at least SAMARASEKERA, ¶ 0037; ¶ 0042)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance within HU to include the tracking of poses (3d space) for generating an accurate map within SAMARASEKERA to yield a more effective robot navigation system that can account for the orientation of the robot rather than merely the position.

Regarding claim 5:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU does not disclose, but SAMARASEKERA further teaches:
a pose of the plurality of poses is represented using at least three values representing a location and an orientation of the mobile device in the mapped environment. (see at least SAMARASEKERA, ¶ 0003, “In robot navigation technology, cameras and other sensors are used to determine the robot's location and orientation with respect to its surrounding real world environment (i.e., the robot's frame of reference). Computer vision techniques and mathematical computations are performed to interpret digital images of the environment within the robot's frame of reference, generate a mathematical representation of the environment, and generate a mapping of objects in the real world to the mathematical representation of the environment (e.g., a "map"). The robot uses the map to navigate about its environment. In order to navigate, the robot performs mathematical computations to develop a navigational path to a goal location.”; ¶ 0037, “Referring now to FIG. 2, components of an embodiment of the multi-modal navigation and geospatial mapping subsystem 126 of FIG. 1 are shown in more detail. Each of the components shown in FIG. 2 may be embodied as software, hardware, firmware, or a combination thereof. In FIG. 2, the components of the navigation and geospatial mapping subsystem 126 are shown in the context of an environment 201 that may be created during the operation of the computing system 100 (e.g., a physical and/or virtual execution or "runtime" environment). The illustrative navigation and mapping subsystem 126 includes a temporal association module 210, a spatial association module 212, a geospatial association module 214, and a multi-modal localization and mapping module 216 (which includes a 6DOF pose estimation module 218, described in more detail below with reference to FIG. 6). The components of the navigation and mapping subsystem 126 operate to maintain accurate position and orientation information for each of the sensors 102, 104, 106, and make temporal, spatial and geospatial associations across the data 112, 114, 116 obtained from the various different sensors 102, 104, 106. The operations performed by the components of the navigation and mapping subsystem 126 allow the map representation 128 to be created over time. The navigation and mapping subsystem 126 is capable of exploiting various combinations of multi-modal observations derived from the data 124 to generate a navigation path and map 128 for the platform 120. For example, the navigation and mapping subsystem 126 can utilize a combination of data obtained from an IMU and video data to estimate an initial navigation path for the platform 120, and then use the relative 6 degrees of freedom (6DOF) poses that are estimated by the 6DOF pose estimation module 218 to generate a 3D map based on data obtained from, e.g., a scanning LIDAR sensor. The 3D LIDAR features can then be further exploited to improve the navigation path and the map previously generated based on, e.g., the IMU and video data.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance within HU to include the tracking of poses (3d space) for generating an accurate map within SAMARASEKERA to yield a more effective robot navigation system that can account for the orientation of the robot rather than merely the position.
Regarding claim 6:
HU in view of SAMARASEKERA discloses the limitations within claim 5 and HU further discloses:
the location is represented using at least two coordinates of a two dimensional coordinate system of the mobile device in the mapped environment and (see at least HU, Pg 5, Section VI, "For both global and local planning, we assume that the robot can accurately localize itself in a predefined global semantic map. We use a state-of-the-art 2D localization technique [24] for the robot localization. The global semantic map is constructed using SLAM (simultaneous localization and mapping) algorithm. The resulting map consists of grid cells which may be one of three types: free space, obstacle, and no information.")
HU does not disclose, but SAMARASEKERA teaches:
the orientation is represented by yaw of the mobile device in the mapped environment. (see at least SAMARASEKERA, ¶ 0039, “Referring now in more detail to the multi-modal geo-spatial data integration module 130 of FIG. 1, components of the illustrative multi-modal geo-spatial data integration module 130 integrate and fuse the multi-modal navigation path and mapping data 128 in a 3D geospatial map. The accurate position estimation of the different sensors 102, 104, 106 with respect to real world coordinates and to the other sensors 102, 104, 106, performed by the navigation and mapping subsystem 126, combined with the accurate synchronization of the sensor data 112, 114, 116 provided by the data capture and synchronization module 122, allows the data integration module 130 to integrate and fuse the data 128 in a 3D geospatial map. For example, a 3D map generated from LIDAR can be used to identify hyper-spectral or RGB video features associated with the 3D surface generated from the map. Illustratively, the data integration module 130 fuses the sensor data 112, 114, 116 by combining the output of the different data sources (e.g., sensors 102, 104, 106) into a single channel or layer of the integrated map. For example, the output of different cameras with overlapping fields of view may be combined into a single RGB overlay, or the LIDAR points may be given color attributes from the overlay of the RGB or Hyper-Spectral imagery. The output of the data integration module 130 is geospatially aligned and integrated multi-modal data 132, which can be accessed for immediate compression and visualization by the compression subsystem 142 and visualization module 146, and/or analyzed by the analytics subsystem 134 and data correlation module 136.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance within HU to include the tracking of poses (3d space) for generating an accurate map within SAMARASEKERA to yield a more effective robot navigation system that can account for the orientation of the robot rather than merely the position.
Regarding claim 12:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU further discloses:
in accordance with identifying less than the threshold similarity of the embedding to the plurality of embeddings of the multi-modal node graph, forgoing moving the mobile device to the location and the orientation of the target pose in the environment. (see at least HU, Pg 5, Section V, "Once we label the objects in the current scene, we compute the similarity between the nouns that occur in the constraint phrase and those object labels discovered by the scene understanding. Again, we use the Word2Vec embedding network [22], [23] to embed all words in the vector space and compute the cosine similarity between word vectors of the constraint object and objects detected in the instance segmentation. If the similarity is larger than a predefined threshold, we add this constraint to the motion planner. ")
Regarding claim 15:
With regards to claim 15, this claim is the non-transitory computer readable storage medium claim to method claim 1 and is substantially similar to claim 1 and is therefore rejected using the same references and rationale.

Regarding claim 16:
With regards to claim 16, this claim is the mobile device claim to method claim 1 and is substantially similar to claim 1 and is therefore rejected using the same references and rationale.
Regarding claim 20:
With regards to claim 20, this claim is substantially similar to claim 5 and is therefore rejected using the same references and rationale.

Claims 2, 3, 7-9, 11, 14, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over HU (Z. Hu, J. Pan, T. Fan, R. Yang and D. Manocha, "Safe Navigation With Human Instructions in Complex Scenes," in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 753-760, April 2019) in view of SAMARASEKERA (US 20150269438 A1) in further view of ZHANG C. (US 20230118864 A1).
Regarding claim 2:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU further discloses:
the multi-modal model generates embeddings in a shared dimensionality space for one (see at least HU, Pg 4, Section IV, "Next, we extract the representation for the sequential data E using LSTM, which provides an elegant way of modeling sequential data. But in standard LSTM, the information encoded in the inputs can only flow in one direction and the future hidden unit cannot affect the current unit. To overcome this drawback, we employ a bi-directional LSTM layer [21], which can be trained using all the available inputs from two directions to improve the prediction performance. A bi-directional LSTM consists of a forward and backward LSTM. The forward LSTM reads the input embedded vector sequence from e1 to eT and computes a sequence of forward hidden states ( 􀀀! h1; :::; 􀀀! hT ), where 􀀀!h i 2 Rp and p is the dimensionality of hidden states. The backward LSTM reads the embedded vector sequence in the reverse order, i.e., from eT to e1, resulting in a sequence of backward hidden states ( 􀀀! hT ; :::; 􀀀! h1), where 􀀀! hT 2 Rp. By adding the forward and backward states 􀀀!h i and  􀀀 hi, we can obtain the final latent vector representation as hi = 􀀀!h i +  􀀀 hi, where hi 2 Rp. "; Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location.")
HU does not disclose, but ZHANG C. teaches:
or more language inputs or one or more image inputs. (see at least ZHANG C., ¶ 0190, “Consequently, given N D-dimension local pixel descriptors as input, and K automatically found cluster centres, the output of ‘NetVLAD’ at each pixel is a K×D matrix based on the descriptor distance to each cluster center in the feature space. In ‘NetVLAD’ all pixel matrices are simply aggregated into a single matrix. This matrix is then flattened and used as the image embedding for image retrieval.”; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0244, “In an example a reference set of images is maintained. The reference set of images comprises a plurality of images. Optionally each image is associated with location information. A graph embedding is obtained for each image in the reference set once the system has been trained (e.g. once the graph embedding network has been trained). This is achieved by providing the image as an input in step 201, and obtaining the output from step 207 in FIG. 2. The graph embedding associated with each image is stored as part of the reference set.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard images.

 Regarding claim 3:
HU in view of SAMARASEKERA in further view of ZHANG C. discloses the limitations within claim 2 and HU further discloses:
a second respective embedding of the embeddings output by the multi-modal model for a second respective input including a text description of the respective object (see at least HU, Pg 3, Section III, "We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase."; Pg 4, Section IV, "More specifically, as shown in Figure 3, our classification subsystem contains four layers: the embedding layer, the bidirectional LSTM layer, the attention layer, and the output layer. The embedding layer transforms a phrase S with length T: S = fx1; x2;       ; xT g into a list of real-valued vectors E = fe1; e2;       ; eT g. We transform a word xi into the embedding vector through an embedding matrix We: ei = Wevi, where vi is a vector of value 1 at index ei and 0 otherwise. Next, we extract the representation for the sequential data E using LSTM, which provides an elegant way of modeling sequential data. But in standard LSTM, the information encoded in the inputs can only flow in one direction and the future hidden unit cannot affect the current unit. To overcome this drawback, we employ a bi-directional LSTM layer [21], which can be trained using all the available inputs from two directions to improve the prediction performance. A bi-directional LSTM consists of a forward and backward LSTM. The forward LSTM reads the input embedded vector sequence from e1 to eT and computes a sequence of forward hidden states ( 􀀀! h1; :::; 􀀀! hT ), where 􀀀!h i 2 Rp and p is the dimensionality of hidden states. The backward LSTM reads the embedded vector sequence in the reverse order, i.e., from eT to e1, resulting in a sequence of backward hidden states ( 􀀀! hT ; :::; 􀀀! h1), where 􀀀! hT 2 Rp. By adding the forward and backward states 􀀀!h i and  􀀀 hi, we can obtain the final latent vector representation as hi = 􀀀!h i +  􀀀 hi, where hi 2 Rp.")
are generated in the shared dimensionality space and have a similarity above the threshold similarity. (see at least HU, Pg 4, Section IV, "Next, we extract the representation for the sequential data E using LSTM, which provides an elegant way of modeling sequential data. But in standard LSTM, the information encoded in the inputs can only flow in one direction and the future hidden unit cannot affect the current unit. To overcome this drawback, we employ a bi-directional LSTM layer [21], which can be trained using all the available inputs from two directions to improve the prediction performance. A bi-directional LSTM consists of a forward and backward LSTM. The forward LSTM reads the input embedded vector sequence from e1 to eT and computes a sequence of forward hidden states ( 􀀀! h1; :::; 􀀀! hT ), where 􀀀!h i 2 Rp and p is the dimensionality of hidden states. The backward LSTM reads the embedded vector sequence in the reverse order, i.e., from eT to e1, resulting in a sequence of backward hidden states ( 􀀀! hT ; :::; 􀀀! h1), where 􀀀! hT 2 Rp. By adding the forward and backward states 􀀀!h i and  􀀀 hi, we can obtain the final latent vector representation as hi = 􀀀!h i +  􀀀 hi, where hi 2 Rp. "; Pg 5, Section V, "Once we label the objects in the current scene, we compute the similarity between the nouns that occur in the constraint phrase and those object labels discovered by the scene understanding. Again, we use the Word2Vec embedding network [22], [23] to embed all words in the vector space and compute the cosine similarity between word vectors of the constraint object and objects detected in the instance segmentation. If the similarity is larger than a predefined threshold, we add this constraint to the motion planner.")
HU does not disclose, but ZHANG C. teaches:
a first respective embedding of the embeddings output by the multi-modal model for a first respective input including an image of a respective object and (see at least ZHANG C., ¶ 0190, “Consequently, given N D-dimension local pixel descriptors as input, and K automatically found cluster centres, the output of ‘NetVLAD’ at each pixel is a K×D matrix based on the descriptor distance to each cluster center in the feature space. In ‘NetVLAD’ all pixel matrices are simply aggregated into a single matrix. This matrix is then flattened and used as the image embedding for image retrieval.”; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0244, “In an example a reference set of images is maintained. The reference set of images comprises a plurality of images. Optionally each image is associated with location information. A graph embedding is obtained for each image in the reference set once the system has been trained (e.g. once the graph embedding network has been trained). This is achieved by providing the image as an input in step 201, and obtaining the output from step 207 in FIG. 2. The graph embedding associated with each image is stored as part of the reference set.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard images.
 Regarding claim 7:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU does not disclose, but ZHANG C. teaches:
the one or more input devices include one or more image sensors, the method further comprising: receiving, via the one or more image sensors, an image input; and (see at least ZHANG C., ¶ 0035, “In an embodiment the image is a visual representation of the scene, the image of the scene is divided into a plurality of pixels and wherein the information identifying the image comprises a value associated with each pixel (e.g. an RGB value), where the value for each pixel is generated by taking a picture of the scene using a camera and obtaining the value by reading a sensor within the camera.”)
generating, using an image encoder of the multi-modal model, an embedding corresponding to the image input. (see at least ZHANG C., ¶ 0190, “Consequently, given N D-dimension local pixel descriptors as input, and K automatically found cluster centres, the output of ‘NetVLAD’ at each pixel is a K×D matrix based on the descriptor distance to each cluster center in the feature space. In ‘NetVLAD’ all pixel matrices are simply aggregated into a single matrix. This matrix is then flattened and used as the image embedding for image retrieval.”; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0244, “In an example a reference set of images is maintained. The reference set of images comprises a plurality of images. Optionally each image is associated with location information. A graph embedding is obtained for each image in the reference set once the system has been trained (e.g. once the graph embedding network has been trained). This is achieved by providing the image as an input in step 201, and obtaining the output from step 207 in FIG. 2. The graph embedding associated with each image is stored as part of the reference set.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard images.
 Regarding claim 8:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU further discloses:
the mapped environment is mapped using the multi-modal model; (see at least HU, Pg 5, Section VI, "For both global and local planning, we assume that the robot can accurately localize itself in a predefined global semantic map. We use a state-of-the-art 2D localization technique [24] for the robot localization. The global semantic map is constructed using SLAM (simultaneous localization and mapping) algorithm. The resulting map consists of grid cells which may be one of three types: free space, obstacle, and no information. "; Pg 5, Section VI, 3) Dynamic Obstacle Avoidance, "After global planning and local planning, the robot can avoid static obstacles (e.g., walls, heavy desks) that appear in the global semantic map and constraint obstacles that appear in the human instructions. However, there exist some other obstacles in the navigation environment, including static obstacles like chairs and non-static obstacles like carts and dogs, which appear neither in the human instruction nor in the semantic map. To guarantee the robot to avoid these obstacles reliably, our navigation algorithm combines the output of the local planning with a reinforcement learning-based local collision avoidance policy developed in our previous work [20]."; Pg 7, Section VII, 2) Real-world Environment, "For the real-world experiment, we first run the SLAM algorithm [24] around our lab to construct a global amp, and then manually annotate this map with semantic information. The resulting semantic maps are shown in Figure 11, including semantic locations such as restaurants, information desks, laboratory, lifts, hall, rest regions, and workstations. Our navigation algorithm can enable a robot to follow human’s navigation instructions successfully in these scenarios. Please refer to the video for more details.")
HU does not disclose, but ZHANG C. teaches:
mapping the environment includes generating a plurality of nodes of the multi-modal node graph; and (see at least ZANG C., ¶ 0022, “According to a first aspect there is provided a computer-implemented method for place recognition. The method comprising: obtaining information identifying an image of a first scene; identifying a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generating a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generating a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; adding a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generating a vector representation of the graph; calculating a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determining that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0209, “The encoder 702 is configured to map node and edge features from the scene graph generated in step 206 to initial node and edge vectors h.sub.i.sup.o and e.sub.ij.sup.o respectively, where h.sub.i.sup.o is the initial (i.e. time=0) node vector associated with the i-th node, where i ∈ V, V being the set of nodes that form the graph. The initial node vector h.sub.i.sup.o is set equal to the feature vector x.sub.i associated with the i-th node (i.e. h.sub.i.sup.o=x.sub.i), while the edge vector e.sub.ij.sup.o (representing the edge between nodes i and j) is unused.”)
generating the plurality of nodes of the multi-modal node graph includes generating, using an image encoder of the multi-modal model, a respective embedding corresponding to a respective pose corresponding to a respective image. (see at least ZHANG C., ¶ 0002, “Visual place recognition comprises determining a camera's location given its current view. Place recognition is an important problem in computer vision and robotics and is applicable to a wide range of applications including, but not limited to, autonomous driving and augmented reality.”; ¶ 0022; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0209)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard images.
Regarding claim 9:
HU in view of SAMARASEKERA in further view of ZANG C. discloses the limitations within claim 8 and HU does not disclose, but ZHANG C. teaches:
generating the plurality of nodes of the multi-modal node graph includes: (see at least ZHANG C., ¶ 0022, “According to a first aspect there is provided a computer-implemented method for place recognition. The method comprising: obtaining information identifying an image of a first scene; identifying a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generating a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generating a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; adding a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generating a vector representation of the graph; calculating a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determining that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0209, “The encoder 702 is configured to map node and edge features from the scene graph generated in step 206 to initial node and edge vectors h.sub.i.sup.o and e.sub.ij.sup.o respectively, where h.sub.i.sup.o is the initial (i.e. time=0) node vector associated with the i-th node, where i ∈ V, V being the set of nodes that form the graph. The initial node vector h.sub.i.sup.o is set equal to the feature vector x.sub.i associated with the i-th node (i.e. h.sub.i.sup.o=x.sub.i), while the edge vector e.sub.ij.sup.o (representing the edge between nodes i and j) is unused.”)
receiving a first embedding output by the image encoder of the multi-modal model at a first pose; (see at least ZHANG C., ¶ 0002, “Visual place recognition comprises determining a camera's location given its current view. Place recognition is an important problem in computer vision and robotics and is applicable to a wide range of applications including, but not limited to, autonomous driving and augmented reality.”; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0209)
in accordance with a determination that the first pose has less than a threshold similarity with the plurality of poses at the plurality of nodes of the multi-modal node graph, adding a new node to the multi-modal node graph with the first embedding and the first pose; and (see at least ZHANG C., ¶ 0097, “According to a seventh aspect there is provided an apparatus for place recognition, the apparatus configured to: obtain information identifying an image of a first scene; identify a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generate a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generate a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; add a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generate a vector representation of the graph; calculate a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determine that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0289, “Representing the scene as graph has further advantages. For example, representing the scene as a graph allows easier manipulation such that nodes (i.e. objects) can be added or removed from the scene graph representation of the input image.”; ¶ 0290, “In an example there is a method comprising modifying a scene graph (e.g. generated by step 206) to add additional nodes (associated with an additional object) and edges connecting the node to other nodes of the graph. In this way, the scene graph can be modified/augmented to include an object that was not present in the image obtained in step 201. Adding edges between nodes controls the position of the additional object within the scene. Additionally or alternatively the scene graph generated by step 206 is modified before generating a graph embedding (e.g. in step 207) to remove nodes from the graph representation. Removing nodes may be advantageous where, for example, the image data obtained in step 201 is known to include a temporary object that would likely not be present in the reference set. Consequently, by removing this node the associated object will not be taken into account when performing similarity comparisons with graph embeddings in the reference set. Optionally the modified/augmented is generated based on the output of step 206 (generating a scene graph) and provided to the input of step 207 (generating a graph embedding).”)
in accordance with a determination that the first embedding has a threshold similarity with a second embedding at a node of the multi-modal node graph corresponding to the first pose, forgoing adding the new node to the multi-modal node graph with the first embedding and the first pose. (see at least ZHANG C., ¶ 0097; ¶ 0289; ¶ 0290)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes with threshold comparisons for determining meaningful changes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing collected images.
Regarding claim 11:
HU in view of SAMARASEKERA in further view of ZANG C. discloses the limitations within claim 8 and HU further discloses:
while moving the mobile device to the location and the orientation of the target pose in the mapped environment: (see at least HU, Pg 1, Abstract, "In this paper, we present a robotic navigation algorithm with natural language interfaces, which enables a robot to safely walk through a changing environment with moving persons by following human instructions such as “go to the restaurant and keep away from people”."; Pg 3, Section III, "To dynamically add the target object in a constraint into the planner’s costmap, we use an object detection module to dynamically look for the object mentioned in the constraint phrase and then output the object’s 3D location in the robot’s local coordinate system. To deal with objects that are moving (like “people”) or objects that are occluded, we will perform the grounding in a dynamic manner along with the navigation. That is, whenever a constraint object occurs in the camera frame, we add it to the local planner’s costmap. Finally, our costmap motion planner will compute a suitable trajectory based on the costmap which is updated in real-time according to the grounded constraint result. To improve the navigation performance in crowd scenarios, we further use the planning result to guide a reinforcement learning based local collision avoidance approach developed in our previous work [20]. An overview of our proposed navigation system is shown in Figure 2.")
HU does not disclose, but SAMARASEKERA teaches:
orientation of the target pose (see at least SAMARASEKERA, ¶ 0037, “Referring now to FIG. 2, components of an embodiment of the multi-modal navigation and geospatial mapping subsystem 126 of FIG. 1 are shown in more detail. Each of the components shown in FIG. 2 may be embodied as software, hardware, firmware, or a combination thereof. In FIG. 2, the components of the navigation and geospatial mapping subsystem 126 are shown in the context of an environment 201 that may be created during the operation of the computing system 100 (e.g., a physical and/or virtual execution or "runtime" environment). The illustrative navigation and mapping subsystem 126 includes a temporal association module 210, a spatial association module 212, a geospatial association module 214, and a multi-modal localization and mapping module 216 (which includes a 6DOF pose estimation module 218, described in more detail below with reference to FIG. 6). The components of the navigation and mapping subsystem 126 operate to maintain accurate position and orientation information for each of the sensors 102, 104, 106, and make temporal, spatial and geospatial associations across the data 112, 114, 116 obtained from the various different sensors 102, 104, 106. The operations performed by the components of the navigation and mapping subsystem 126 allow the map representation 128 to be created over time. The navigation and mapping subsystem 126 is capable of exploiting various combinations of multi-modal observations derived from the data 124 to generate a navigation path and map 128 for the platform 120. For example, the navigation and mapping subsystem 126 can utilize a combination of data obtained from an IMU and video data to estimate an initial navigation path for the platform 120, and then use the relative 6 degrees of freedom (6DOF) poses that are estimated by the 6DOF pose estimation module 218 to generate a 3D map based on data obtained from, e.g., a scanning LIDAR sensor. The 3D LIDAR features can then be further exploited to improve the navigation path and the map previously generated based on, e.g., the IMU and video data.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance within HU to include the tracking of poses (3d space) for generating an accurate map within SAMARASEKERA to yield a more effective robot navigation system that can account for the orientation of the robot rather than merely the position.

HU in view of SAMARASEKERA does not disclose, but ZHANG C. teaches:
receiving a first embedding output by the image encoder of the multi-modal model at a first pose; (see at least ZHANG C., ¶ 0190, “Consequently, given N D-dimension local pixel descriptors as input, and K automatically found cluster centres, the output of ‘NetVLAD’ at each pixel is a K×D matrix based on the descriptor distance to each cluster center in the feature space. In ‘NetVLAD’ all pixel matrices are simply aggregated into a single matrix. This matrix is then flattened and used as the image embedding for image retrieval.”; ¶ 0207, “In step 207 a graph embedding is generated by translating the generated scene graph into a vector that can be used for subsequent comparisons (e.g. for image recognition). Optionally, the graph embedding is generated by a graph embedding network.”; ¶ 0244, “In an example a reference set of images is maintained. The reference set of images comprises a plurality of images. Optionally each image is associated with location information. A graph embedding is obtained for each image in the reference set once the system has been trained (e.g. once the graph embedding network has been trained). This is achieved by providing the image as an input in step 201, and obtaining the output from step 207 in FIG. 2. The graph embedding associated with each image is stored as part of the reference set.”)
in accordance with a determination that the first pose has a threshold similarity with a second pose at a respective node of the plurality of nodes of the multi-modal node graph and the first embedding has less than a threshold similarity with a second embedding at the respective node, updating the respective node to include the first embedding at the first pose; and (see at least ZHANG C., ¶ 0097, “According to a seventh aspect there is provided an apparatus for place recognition, the apparatus configured to: obtain information identifying an image of a first scene; identify a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generate a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generate a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; add a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generate a vector representation of the graph; calculate a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determine that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0289, “Representing the scene as graph has further advantages. For example, representing the scene as a graph allows easier manipulation such that nodes (i.e. objects) can be added or removed from the scene graph representation of the input image.”; ¶ 0290, “In an example there is a method comprising modifying a scene graph (e.g. generated by step 206) to add additional nodes (associated with an additional object) and edges connecting the node to other nodes of the graph. In this way, the scene graph can be modified/augmented to include an object that was not present in the image obtained in step 201. Adding edges between nodes controls the position of the additional object within the scene. Additionally or alternatively the scene graph generated by step 206 is modified before generating a graph embedding (e.g. in step 207) to remove nodes from the graph representation. Removing nodes may be advantageous where, for example, the image data obtained in step 201 is known to include a temporary object that would likely not be present in the reference set. Consequently, by removing this node the associated object will not be taken into account when performing similarity comparisons with graph embeddings in the reference set. Optionally the modified/augmented is generated based on the output of step 206 (generating a scene graph) and provided to the input of step 207 (generating a graph embedding).”)
in accordance with a determination that the first embedding has a threshold similarity with a second embedding at the respective node of the plurality of nodes of the multi-modal node graph corresponding to the first pose, forgoing updating the respective node. (see at least ZHANG C., ¶ 0097, “”; ¶ 0289, “”; ¶ 0290, “”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes with threshold comparisons for determining meaningful changes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard images.
Regarding claim 14:
HU in view of SAMARASEKERA in further view of ZANG C. discloses the limitations within claim 8 and HU further discloses:
the one or more input devices (see at least HU, Pg 1, Section III, "Our approach contains four steps: pre-processing, phrase classification, goal and constraint grounding, and motion planning. First, the pre-processing step takes the command sentence as the input and divides it into phrases according to conjunctions and commas. Next, we assume that each phrase has one of three labels: goal, constraint, or uninformative phrase. For example, the phrase “go to the restaurant” is a goal phrase, “keep away from people” is a constraint phrase, and the phrase “you know” is uninformative. We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase. After recognizing the constraint and goal phrases, we need to ground them with the physical world. In particular, we ground the goal phrase by computing the similarity between the noun extracted from the goal phrase and the location name in the predefined semantic map, and use the most similar location as the goal configuration for the robot navigation."; Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location.")
HU does not teach, but SAMARASEKERA discloses:
includes an image capture device, (see at least SAMARASEKERA, ¶ 0030, “In operation, the platform 120 receives sensor data streams 112, 114, 116 from a number of sensors 102, 104, 106. The sensors 102, 104, 106 are of "N" different sensor types (where "N" is a positive integer). For example, the sensor 102 may be embodied as a two-dimensional (2D) image sensor (e.g., a 2D still or video camera); the sensor 104 may be embodied as a 3D image sensor (e.g., LIDAR), and the sensor 106 may be embodied as an inertial sensor such as an inertial measurement unit (IMU) (including, e.g., an accelerometer and a gyroscope) or another type of sensor capable of producing motion, location, and/or orientation data. In the illustrative embodiments, the sensor 106 is a "low-end" IMU, such as a Micro-Electro-Mechanical Systems (MEMS) IMU that is readily commercially available and thus suitable for use in connection with a wide range of consumer-grade applications. In other embodiments, the sensors 102, 104, 106 may include a "higher-end" IMU and GPS-based integrated navigation system.”
a motion sensor, or an odometry sensor. (see at least SAMARASEKERA, ¶ 0030)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance within HU to include the tracking of poses (3d space) using an image sensor and IMU for generating an accurate map within SAMARASEKERA to yield a more effective robot navigation system that can account for the orientation of the robot rather than merely the position.
Regarding claim 17:
With regards to claim 17, this claim is substantially similar to claim 2 and is therefore rejected using the same references and rationale.
Regarding claim 18:
With regards to claim 18, this claim is substantially similar to claim 3 and is therefore rejected using the same references and rationale.

Claims 4 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over HU (Z. Hu, J. Pan, T. Fan, R. Yang and D. Manocha, "Safe Navigation With Human Instructions in Complex Scenes," in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 753-760, April 2019) in view of SAMARASEKERA (US 20150269438 A1) in further view of RADFORD (A. Radford, J. Kim, C. Hallacy, A. Ramesh, et al., “Learning Transferable Visual Models From Natural Language Supervision.”, International Conference on Machine Learning, 2021).
Regarding claim 4:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU does not disclose, but RADFORD teaches:
the multi-modal model includes a contrastive language image pre-training model. (see at least RADFORD, Pg 3, Figure 2; Pg 2 Col 2 - Pg 3 Col 1, "A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.")
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify, with a reasonable expectation of success, the model for motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to utilize a contrastive language image pre-training (CLIP) model within RADFORD to yield a more efficient planning model that scales better when working on a training set of millions of images as anticipated by RADFORD.
Regarding claim 19:
With regards to claim 19, this claim is substantially similar to claim 4 and is therefore rejected using the same references and rationale.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over HU (Z. Hu, J. Pan, T. Fan, R. Yang and D. Manocha, "Safe Navigation With Human Instructions in Complex Scenes," in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 753-760, April 2019) in view of SAMARASEKERA (US 20150269438 A1) in further view of ZHANG C. (US 20230118864 A1) in further view of XIE (US 20210097739 A1).
Regarding claim 10:
HU in view of SAMARASEKERA in further view of ZHANG C. (US 20230118864 A1) discloses the limitations within claim 8 and HU does not disclose, but ZHANG C. teaches:
the generating the plurality of nodes of the multi-modal node graph includes: (see at least ZHANG C., ¶ 0022, “According to a first aspect there is provided a computer-implemented method for place recognition. The method comprising: obtaining information identifying an image of a first scene; identifying a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generating a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generating a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; adding a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generating a vector representation of the graph; calculating a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determining that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0209, “The encoder 702 is configured to map node and edge features from the scene graph generated in step 206 to initial node and edge vectors h.sub.i.sup.o and e.sub.ij.sup.o respectively, where h.sub.i.sup.o is the initial (i.e. time=0) node vector associated with the i-th node, where i ∈ V, V being the set of nodes that form the graph. The initial node vector h.sub.i.sup.o is set equal to the feature vector x.sub.i associated with the i-th node (i.e. h.sub.i.sup.o=x.sub.i), while the edge vector e.sub.ij.sup.o (representing the edge between nodes i and j) is unused.”)
updating the multi-modal node graph (see at least ZHANG C., ¶ 0097, “According to a seventh aspect there is provided an apparatus for place recognition, the apparatus configured to: obtain information identifying an image of a first scene; identify a plurality of pixel clusters in the information identifying the image, the plurality of pixel clusters comprising: a first pixel cluster; and a second pixel cluster; generate a set of feature vectors from the information identifying the image, the set of feature vectors comprising: a first feature vector associated with the first pixel cluster; and a second feature vector associated with the second pixel cluster; generate a graph of the scene, the graph comprising: a first node representing the first pixel cluster, the first node associated with the first feature vector; and a second node representing the second pixel cluster, the second node associated with the second feature vector; add a first edge between the first node and the second node in response to determining that a first property associated with the first pixel cluster is similar to a second property associated with the second pixel cluster; generate a vector representation of the graph; calculate a measure of similarity between the vector representation of the graph and a reference vector representation, wherein the reference vector representation is associated with a second scene; and determine that the first scene and the second scene are associated with a same place in response to determining that the measure of similarity is less than a threshold.”; ¶ 0290, “In an example there is a method comprising modifying a scene graph (e.g. generated by step 206) to add additional nodes (associated with an additional object) and edges connecting the node to other nodes of the graph. In this way, the scene graph can be modified/augmented to include an object that was not present in the image obtained in step 201. Adding edges between nodes controls the position of the additional object within the scene. Additionally or alternatively the scene graph generated by step 206 is modified before generating a graph embedding (e.g. in step 207) to remove nodes from the graph representation. Removing nodes may be advantageous where, for example, the image data obtained in step 201 is known to include a temporary object that would likely not be present in the reference set. Consequently, by removing this node the associated object will not be taken into account when performing similarity comparisons with graph embeddings in the reference set. Optionally the modified/augmented is generated based on the output of step 206 (generating a scene graph) and provided to the input of step 207 (generating a graph embedding)”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with natural language recognition for determining the goal and local planning to examine current positions for collision avoidance with the tracking of 3d poses within HU in view of SAMARASEKERA to include using images to create vector graph embeddings of vector nodes for mapping within ZHANG C. to effectively yield a robot capable of mapping the given environment utilizing onboard cameras.

HU in view of SAMARASEKERA in further view of ZHANG C. does not disclose, but XIE teaches:
identifying one or more loop closures for the multi-modal node graph based on embeddings output by the image encoder of the multi-modal node graph, and (see at least XIE, ¶ 0003, “Loop closures occur when revisiting an area e.g. when it is identified by the front-end that a robot (or person) has returned to a location visited previously. A transformation between the pose of the robot on the first visit and the pose of the robot on the subsequent visit can therefore be calculated this transformation links nodes on two different trajectories and is therefore described as a loop closure. Loop closures are represented by inserting additional edges (or constraints) into the graph.”; ¶ 0102, “In the embodiment being described, this is normally assessed in two steps: [0103] (i) First, a loop closure is detected between two poses (graph nodes) by checking sensor data similarity, e.g., looking at image appearance and magnetic magnitude; and [0104] (ii) Then, if the similarity is high, the transformation between the two poses (graph nodes) is calculated. This transformation can be determined in the front-ends by using different techniques, which depend on the sensor type, system requirement, and even operating environment etc.”; ¶ 0145, “The intuition is that the relative poses encapsulated in the loop closure pair should be consistent with the odometry information of the outbound and inbound trajectory segments (i.e. the portions of the trajectory between pose nodes of the loop closures). Here a segment, as will be understood by the person skilled in the art, is a section (i.e. segment) of a trajectory, such as for example each T1 and T2. Two loop closures and the trajectories between them form a closed chain, or circle in the graph, as shown in FIG. 4. An assumption made by systems 300 of the embodiment being described is that odometry edges only suffer from limited cumulative drift, and large, abrupt odometry errors are not experienced. Under this assumption, the consistency of the loop closure pair can be tested based on the combination of a probability propagation by dead-reckoning and a χ.sup.2 test.”)
based on the one or more loop closures. (see at least XIE, ¶ 0034, “According to a first aspect of the invention, there is provided a computer-implemented method of constructing a model of the motion of a mobile device. The method may comprise using a sensor of the device to obtain data providing an estimated pose of the mobile device; the data may be referred to as positional data. An initial graph is then generated which may be based upon the positional data from the sensor, nodes of which graph provide a series of possible poses of the device, and edges of which graph represent odometry and/or loop closure constraints. The method may then process the graph to estimate confidence scores for each loop closure. Typically, the confidence score is generated by performing pairwise consistency tests between each loop closure and a set of other loop closures. Conveniently, the method then generates an augmented graph from the initial graph.”; ¶ 0035, “Generating the augmented graph may comprise retaining and/or deleting each loop closure based upon the confidence scores.”)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine, with a reasonable expectation of success, the motion planning with images to create vector nodes within  HU in view of SAMARASEKERA in further view of  ZHANG to include loop closure detection and modification of the pose graph based on loop closures as within XIE to effectively yield an improved pose estimation system that corrects for cumulative odometry error as disclosed in XIE ¶ 0003-0004.


Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over HU (Z. Hu, J. Pan, T. Fan, R. Yang and D. Manocha, "Safe Navigation With Human Instructions in Complex Scenes," in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 753-760, April 2019) in view of SAMARASEKERA (US 20150269438 A1) in further view of GRZESIAK (KR20210109722A).
Regarding claim 13:
HU in view of SAMARASEKERA discloses the limitations within claim 1 and HU further discloses:
text representation of the description associated with the object in the mapped environment. (see at least HU, Pg 1, Section III, "Our approach contains four steps: pre-processing, phrase classification, goal and constraint grounding, and motion planning. First, the pre-processing step takes the command sentence as the input and divides it into phrases according to conjunctions and commas. Next, we assume that each phrase has one of three labels: goal, constraint, or uninformative phrase. For example, the phrase “go to the restaurant” is a goal phrase, “keep away from people” is a constraint phrase, and the phrase “you know” is uninformative. We train a Long Short-Term Memory (LSTM) network [19] to classify phrases into those three types. The LSTM is suitable for our task because the classification output does not depend only on individual words but also on the meaning of the entire phrase. After recognizing the constraint and goal phrases, we need to ground them with the physical world. In particular, we ground the goal phrase by computing the similarity between the noun extracted from the goal phrase and the location name in the predefined semantic map, and use the most similar location as the goal configuration for the robot navigation."; Pg 4, Section V, "For goal grounding, we compute the goal location by extracting the noun from the goal phrase and then computing the similarity between this noun and the location names in the predefined semantic map. The similarity is computed as the cosine similarity between the embedded vectors of two given word items. The embedding is accomplished using the Word2Vec embedding network [22], [23] that can convert a word into a vector based on the Wiki corpus. We use the 2D coordinate of the location that has the highest similarity with the noun in the goal phrase as the goal location.")
HU does not disclose, but GRZESIAK 
wherein the one or more input devices include one or more audio sensors, the input includes a voice command, and receiving the input includes: capturing, via the one or more audio sensors, the voice command; and converting the voice command into a text representation (see at least GRZESIAK, ¶ 0001, "The present disclosure relates to a device and a method for controlling the device that performs voice recognition based on a user's utterance status and generates a control command based on the voice recognition, and more specifically, to a device that recognizes a voice command based on a user's actual utterance and generates a control command for controlling the device based on the recognized voice command, or to a device that generates a control command using movement information based on movement of a part of the user's body and the recognized voice command."; ¶ 0006, "As a technical means for achieving the above-described technical task, a device according to an embodiment of the present disclosure includes a memory in which at least one program is stored; a microphone for receiving an audio signal; a sensor module for obtaining vibration information according to a user's speech state; and at least one processor for generating control information corresponding to a user's voice input identified from the audio signal by executing the at least one program, wherein the at least one program may include instructions for executing: a step of identifying the user's voice input from an audio signal received through the microphone based on the vibration information; and a step of generating control information corresponding to the user's voice input based on the identified user's voice input.")
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify, with a reasonable expectation of success, the phrase collection and processing within HU to implement voice commands within GRZESIAK  to effectively yield a more versatile command system that allows for user voice to be inputted for goal setting.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
GLESINGER (US 20230140125 A1)
¶ 0268, “FIG. 14C summarizes the process flow of recursive streams of attention (or “consciousness”) and/or autonomous behaviors of computer-based system 925 in accordance with some embodiments. The first step 665 comprises automatically prioritizing potential focuses of attention and selecting a focus of attention based on the prioritization. The potential focuses of attention can be derived from external sources via, for example, information attained via sensors, from accessible content, or from internally stored information such as saved communications 250c, chains, or images. The focus of attention can be with respect to specific natural language elements during an automatic scan of the elements as can be performed by transformer-type deep learning models. Prioritization of the potential focuses of attention may be through application of precedence rules or scoring algorithms, such as, but not limited to, and everything else being equal, assigning a higher priority for attending to user 200 requests or requirements, assigning a higher priority based on recency considerations, and/or assigning a higher priority based on probabilistic evaluations and/or value of information considerations. Language-based sources of potential focuses of attention (such as processing speech from a user 200 or processing digitized content) can be directly converted to one or more syntactical elements such as behavioral chains, semantic chains, or composite chains as described herein. In the case of non-language-based sources of information such as images, an associated language-based description comprising syntactical elements is first determined for each image (such as, but not limited to, by means of the application of a trained neural network to the images, for example), and then this language-based description can be converted to derivative syntactical elements such as behavioral chains, semantic chains, or composite chains. A particular focus of attention as represented by one or more behavioral chains, semantic chains, or composite chains is then automatically identified based on the application of the prioritization rules or algorithms, including but not limited to considerations such as recency, uncertainty, value of information, and prioritization of required response, to the derived behavioral chains, semantic chains, or composite chains and associated uncertainties associated with each of the potential focuses of attention.”
ABARMSON (US 20230178076 A1)
¶ 0082, “In some implementations, the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment, e.g., to perform one or more selected actions in the real-world environment. For example, the agent may be a robot interacting with the environment to accomplish a goal, e.g., to locate an object of interest in the environment, to move an object of interest to a specified location in the environment, to physically manipulate an object of interest in the environment in a specified way, or to navigate to a specified destination in the environment, that is specified by the natural language inputs received from other agent(s); or the agent may be an autonomous or semi-autonomous land, air, or sea vehicle navigating through the environment to a specified destination in the environment.”
ZHANG J. (Zhang, Jinbao & Liu, Xiaojuan & Liao, Weilin & Li, Xia. (2022). Deep-learning generation of POI data with scene images. ISPRS Journal of Photogrammetry and Remote Sensing. 188. 201-219.)
Pg 205, Section 2.3, "Identifying the POI-related texts lines from scene images is a complex task, involving two-step solutions. First, the algorithm, for POI generation, must distinguish the POI names with other attributes, such as addresses, telephones and so forth, in the POI-related ROI. This solution involves the text semantic information and some visual features like the location and font size. Second, since the instance segmentation model may identify some interference regions (i.e., billboards), the algorithm here needs to classify the ROIs into real POI-related regions and false POI-related regions. However, only using the visual features for ROI classification is absolutely unconvincing because some billboards have high similarities with signboards in vision. Therefore, we propose an innovative Visual-Linguistic Multi-Classification (VLMC) model to classify ROI and text by fusing language and visual features.

YANG (US 20150185002 A1)
¶ 0081, “In some demonstrative embodiments, orientation estimator 132 may classify the yaw angular rotation of device 102 as an angular rotation caused by the turn of the user of device 102 ("an intentional yaw change"); or as an angular rotation, which is not caused by a turn of the user of device 012 ("an unintentional yaw change"). For example, the unintentional yaw change may result from small changes in a posture and/or in stride phases of the user of device 102, from the gyroscope bias drift and/or from the measurement noise of gyroscope 125”

Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAFAEL VELASQUEZ VANEGAS whose telephone number is (571)272-6999. The examiner can normally be reached M-F 8 - 4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VIVEK KOPPIKAR can be reached at (571) 272-5109. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/RAFAEL VELASQUEZ VANEGAS/Patent Examiner, Art Unit 3667                                                                                                                                                                                                        
/JOAN T GOODBODY/Examiner, Art Unit 3667
Read full office action
SYSTEMS AND METHODS FOR MOBILE DEVICE CONTROL USING LANGUAGE INPUT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEMS AND METHODS FOR MOBILE DEVICE CONTROL USING LANGUAGE INPUT

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email