Last updated: April 19, 2026
Application No. 18/617,999
MOVING OBJECT CONTROL SYSTEM, INFORMATION PROCESSING APPARATUS, METHOD FOR A MOVING OBJECT CONTROL SYSTEM, METHOD FOR GENERATING ONE OR MORE MACHINE LEARNING MODELS

Non-Final OA §103
Filed
Mar 27, 2024
Examiner
LIU, XIAO
Art Unit
2664
Tech Center
2600 — Communications
Assignee
Keio University
OA Round
1 (Non-Final)
Interview Optional

— +11.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 290 resolved cases, 2023–2026
Examiner Intelligence

LIU, XIAO View full profile →
Grants 89% — above average
Career Allow Rate
257 granted / 290 resolved
+26.6% vs TC avg
Moderate +12% lift
Without
With
+11.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 9m
Avg Prosecution
44 currently pending
Career history
334
Total Applications
across all art units
Statute-Specific Performance

§101
8.8%
-31.2% vs TC avg
§103
50.9%
+10.9% vs TC avg
§102
17.0%
-23.0% vs TC avg
§112
17.4%
-22.6% vs TC avg
Black line = Tech Center average estimate • Based on career data from 290 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/08/2025 has/have been considered by the examiner.
Specification
The disclosure is objected to because it contains an embedded hyperlink and/or other form of browser-executable code paragraph (see [0003] of the specification). Applicant is required to delete the embedded hyperlink and/or other form of browser-executable code; references to websites should be limited to the top-level domain name without any prefix such as http:// or other browser-executable code. See MPEP § 608.01.

Applicant is reminded of the proper language and format for an abstract of the disclosure.
The abstract should be in narrative form and generally limited to a single paragraph on a separate sheet within the range of 50 to 150 words in length. The abstract should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.
The language should be clear and concise and should not repeat information given in the title. It should avoid using phrases which can be implied, such as, “The disclosure concerns,” “The disclosure defined by this invention,” “The disclosure describes,” etc.  In addition, the form and legal phraseology often used in patent claims, such as “means” and “said,” should be avoided.
The abstract of the disclosure is objected to because it has phrase “… in the present disclosure performs”.  A corrected abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text. See MPEP § 608.01(b).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3, 6, 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dogan et al (arXiv:2107.04658v1 9 July 2021), hereinafter Dogan in view of Yang et al (2022 CVPR).
-Regarding claim 1, Dogan discloses a moving object control system comprising (Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”): a memory; and one or more processors, wherein when instructions stored in the memory is executed by the one or more processors (one or more processors and memories has to be used in order to implement the method shown in Dogan’s FIG. 2 and Algorithm 1), the instructions cause the one or more processors to (Abstract; FIGS. 1-6; Algorithm 1

    PNG
    media_image1.png
    390
    1160
    media_image1.png
    Greyscale
): acquire an image (FIG. 2, RGB-D scene); acquire a user instruction in a natural language including a relative positional relationship (FIG. 2, referring expression, “Can you bring the bowl next to the glass”); and predict a region in the image corresponding to a position in a scene indicated by the user instruction (FIG. 2, most probable target object regions; FIGS. 5-6) based on a fused feature obtained by fusing an image feature indicating a feature of the scene captured in the image, a depth of the scene captured in the image (FIG. 2, RGB, Depth, combined activations) by using one or more machine learning models (FIG. 2. Grad-CAM; Algorithm 1; Page 1, 2nd Col., 2nd paragraph).
	 Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. However, Dogan does disclose a referring expression generation model taking the depth dimension as an input in addition to RGB features (Page 2, 1st Col., 1st paragraph; FIG. 2).
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
-Regarding claim 2, Dogan in view of Yang teaches the moving object control system of claim 1. The combination further teaches wherein predicting a region in the image corresponding to a position in the scene indicated by the user instruction based on a fused feature obtained by fusing the image feature, the depth, and the language feature for each of predetermined unit regions of the image by using the one or more machine learning models (Dogan: FIG. 2; Page 3, 1st Col., 4th paragraph, “each pixels”; Algorithm 1, number of unconnected areas, each pixel; See also Yang: p. 18136, Sec. 3.1, pixel-wise). 
-Regarding claim 3, Dogan in view of Yang teaches the moving object control system of claim 1. 
Dogan discloses using the one or more machine learning models, extract, from the image, an image feature indicating a feature of a scene captured in the image; predict, from the image, a depth of the scene captured in the image; and predict a region in the image corresponding to a position in the scene indicated by the user instruction based on a fused feature obtained by fusing the image feature and the depth feature (Dogan: FIG. 2).
Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction and extracting a language feature indicating a linguistic feature related to the user instruction. However, Dogan does disclose a referring expression generation model taking the depth dimension as an input in addition to RGB features (Page 2, 1st Col., 1st paragraph; FIG. 2).
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph) and extracting a language feature indicating a linguistic feature related to the user instruction (Yang: FIG. 2).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
-Regarding claim 6, Dogan discloses a moving object control system comprising one or more processors (one or more processors and memories has to be used in order to implement the method shown in Dogan’s FIG. 2 and Algorithm 1) configured to execute processing of one or more machine learning models (Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”; Abstract; FIGS. 1-6; Algorithm 1), wherein the one or more machine learning models include a first machine learning model that extracts, from an acquired image, an image feature indicating a feature of a scene captured in the image (FIG. 2; Grad-CAM; FIG. 3; Page 3, 1st Col., 2nd paragraph, “NeuralTalk2 … RGB activations”), a second machine learning model that predicts, from the image, a depth of the scene captured in the image (FIG. 2; Grad-CAM; FIG. 3; Page 3, 1st Col., 2nd paragraph, “depth dimension of the scenes”), and a fourth machine learning model that predicts a region in the image corresponding to a position in the scene indicated by the user instruction (FIG. 2, K-means clustering, Most probable target object regions; Algorithm 1) based on a fused feature obtained by fusing the image feature and the depth, (FIG. 2, RGB, Depth, combined activations).
Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. Dogan does not disclose a third machine learning model that extracts a language feature indicating a linguistic feature for a user instruction in a natural language including a relative positional relationship, However, Dogan does disclose a referring expression generation model taking the depth dimension as an input in addition to RGB features (Page 2, 1st Col., 1st paragraph; FIG. 2).
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph). Yang further teaches a third machine learning model that extracts a language feature indicating a linguistic feature for a user instruction in a natural language including a relative positional relationship (Yang: FIG. 2, BERT; FIGS. 1, 3). Yang also teaches a first machine learning model that extracts, from an acquired image, an image feature indicating a feature of a scene captured in the image (Yang: FIGS. 1-3), a fourth machine learning model that predicts a region in the image corresponding to a position in the scene indicated by the user instruction based on a fused feature obtained by fusing the image feature, and the language feature (FIGS. 1-3). 
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
-Regarding claim 11, Dogan discloses a method executed in a moving object control system, the method comprising (Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”; Abstract; FIGS. 1-6; Algorithm 1): acquiring an image (FIG. 2, RGB-D scene); acquiring a user instruction in a natural language including a relative positional relationship (FIG. 2, referring expression, “Can you bring the bowl next to the glass”); and predicting a region in the image corresponding to a position in a scene indicated by the user instruction (FIG. 2, most probable target object regions; FIGS. 5-6) based on a fused feature obtained by fusing an image feature indicating a feature of the scene captured in the image, a depth of the scene captured in the image (FIG. 2, RGB, Depth, combined activations) by using one or more machine learning models (FIG. 2. Grad-CAM; Algorithm 1; Page 1, 2nd Col., 2nd paragraph).
	 Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. However, Dogan does disclose a referring expression generation model taking the depth dimension as an input in addition to RGB features (Page 2, 1st Col., 1st paragraph; FIG. 2).
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
Claim(s) 4-5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dogan et al (arXiv:2107.04658v1 9 July 2021), hereinafter Dogan in view of Yang et al (2022 CVPR), and further in view of Nicoleta et al (WO 2025056893 A1), hereinafter Nicoleta.
-Regarding claim 4, Dogan in view of Yang teaches the moving object control system of claim 2.
Dogan in view of Yang teaches combining image feature and depth feature for each of the predetermined unit regions (Dogan: FIG. 2; Page 1, 2nd Col., 2nd paragraph; Page 3, 1st Col., 4th paragraph, “each pixels”; Algorithm 1, number of unconnected areas, each pixel).
Dogan in view of Yang does not teach that the features are combined through concatenating the two features.
However, Nicoleta is an analogous art pertinent to the problem to be solved in this application and teaches a method for scene segmentation. Nicoleta further teaches concatenating the image feature and depth feature (Nicoleta: FIG. 3; [0082])
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Dogan in view of Yang with the teaching of Nicoleta by concatenating the image feature and depth feature in order to generate fused features for image segmentation.
-Regarding claim 5, Dogan in view of Yang, and further in view of Nicoleta teaches the moving object control system of claim 4. The modification further teaches wherein the one or more machine learning models further include a pixel-wise attention mechanism (PWAM) that fuses the language feature with the concatenated feature for each of the predetermined unit regions (Yang; FIG. 2; p. 18136, Sec. 3.1, pixel-wise).
Claim(s) 7-10 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dogan et al (arXiv:2107.04658v1 9 July 2021), hereinafter Dogan in view of Yang et al (2022 CVPR), and further in view of Piergiovanni et al (US 20240029413 A1), hereinafter Piergiovanni.
-Regarding claim 7, Dogan discloses an information processing apparatus configured to cause one or more machine learning models to be trained (Page 3, 1st Col., 1st paragraph, “during training”, 2nd paragraph, “was trained”; Page 7, 1st Col., 2nd paragraph, “pre-trained”), the information processing apparatus comprising (Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”): a memory; and one or more processors, wherein when instructions stored in the memory is executed by the one or more processors (one or more processors and memories has to be used in order to implement the method shown in Dogan’s FIG. 2 and Algorithm 1), the instructions cause the one or more processors to (Abstract; FIGS. 1-6; Algorithm 1): acquire an image (FIG. 2, RGB-D scene); acquire a user instruction in a natural language including a relative positional relationship (FIG. 2, referring expression, “Can you bring the bowl next to the glass”); predicting a region, indicated by the user instruction, in the image corresponding to a position in a scene captured in the image based on the image and the user instruction by using the one or more machine learning models (FIG. 2); and the one or more machine learning models (FIG. 2. Grad-CAM; Algorithm 1) predict the region in the image corresponding to the position in the scene indicated by the user instruction (FIG. 2, most probable target object regions; FIGS. 5-6) based on a fused feature obtained by fusing an image feature indicating a feature of the scene captured in the image (FIG. 2, RGB, Depth, combined activations), and a depth of the scene captured in the image (FIG. 2. Grad-CAM; Algorithm 1; Page 1, 2nd Col., 2nd paragraph).
	 Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. 
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph). Yang also teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”) and correct answer data indicating a region in the image indicated by the user instruction (Yang: FIG. 6, ground truth).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
Dogan in view of Yang does not teach causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data. However, a person of ordinary skills in the arts would understand that ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data has to be provided if supervised trainings are conducted for these machine learning models. Yang teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”).
Piergiovanni is an analogous art pertinent to the problem to be solved in this application and teaches A method involves the training of a model by dynamically adjusting the number of examples within each training batch for cross-modal vision-language tasks (Piergiovanni: Abstract; FIGS. 1-10). Piergiovanni further teaches correct answer data indicating a region in the image indicated by the user instruction (Piergiovanni: FIG. 5, steps 510-520; [0002]; [0025], “trained … using supervised”) and causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted results and the ground truth data (Piergiovanni: FIG. 5, steps 530-540; [0002], “generate a loss … respect to the training … determining a difference between the untrained model output and a known ‘ground truth’ model output …”; [0057]).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Dogan in view of Yang with the teaching of Piergiovanni by providing ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and loss function based on a difference between the predicted region in the image and ground truth region in the image in order to perform supervised training for the machine learning models.
-Regarding claim 8, Dogan in view of Yang, and further in view of Piergiovanni teaches the information processing apparatus of claim 7. The modification further teaches wherein the loss function includes a function that calculates a binary cross-entropy loss (Piergiovanni: [0080]; FIG. 6)
-Regarding claim 9, Dogan in view of Yang, and further in view of Piergiovanni teaches the information processing apparatus of claim 7. The modification further teaches wherein the causing the one or more machine learning models to be trained with the correct answer data in a lower half region of the region of the image (Dogan: FIGS. 1-2. Yang: FIG. 6). Please note that Yang has no restriction on ground truth regions.
Dogan in view of Yang does not teach causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data. However, a person of ordinary skills in the arts would understand that ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data has to be provided if supervised trainings are conducted for these machine learning models. Yang teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”).
Piergiovanni is an analogous art pertinent to the problem to be solved in this application and teaches A method involves the training of a model by dynamically adjusting the number of examples within each training batch for cross-modal vision-language tasks (Piergiovanni: Abstract; FIGS. 1-10). Piergiovanni further teaches correct answer data indicating a region in the image indicated by the user instruction (Piergiovanni: FIG. 5, steps 510-520; [0002]; [0025], “trained … using supervised”) and causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted results and the ground truth data (Piergiovanni: FIG. 5, steps 530-540; [0002], “generate a loss … respect to the training … determining a difference between the untrained model output and a known ‘ground truth’ model output …”; [0057]).
-Regarding claim 10, Dogan discloses an information processing apparatus configured to cause one or more machine learning models to be trained (Page 3, 1st Col., 1st paragraph, “during training”, 2nd paragraph, “was trained”; Page 7, 1st Col., 2nd paragraph, “pre-trained”), the information processing apparatus comprising (Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”): a memory; and one or more processors, wherein when instructions stored in the memory is executed by the one or more processors (one or more processors and memories has to be used in order to implement the method shown in Dogan’s FIG. 2 and Algorithm 1), the instructions cause the one or more processors to (Abstract; FIGS. 1-6; Algorithm 1): acquire an image (FIG. 2, RGB-D scene); acquire a user instruction in a natural language including a relative positional relationship (FIG. 2, referring expression, “Can you bring the bowl next to the glass”); predicting a region, indicated by the user instruction, in the image corresponding to a position in a scene captured in the image based on the image and the user instruction by using the one or more machine learning models (FIG. 2); and the one or more machine learning models (FIG. 2. Grad-CAM; Algorithm 1) predict the region in the image corresponding to the position in the scene indicated by the user instruction (FIG. 2, most probable target object regions; FIGS. 5-6) based on a fused feature obtained by fusing an image feature indicating a feature of the scene captured in the image (FIG. 2, RGB, Depth, combined activations), and a depth of the scene captured in the image (FIG. 2. Grad-CAM; Algorithm 1; Page 1, 2nd Col., 2nd paragraph); wherein the one or more machine learning models include a first machine learning model that extracts, from an acquired image, an image feature indicating a feature of a scene captured in the image (FIG. 2; Grad-CAM; FIG. 3; Page 3, 1st Col., 2nd paragraph, “NeuralTalk2 … RGB activations”), a second machine learning model that predicts, from the image, a depth of the scene captured in the image (FIG. 2; Grad-CAM; FIG. 3; Page 3, 1st Col., 2nd paragraph, “depth dimension of the scenes”), and a fourth machine learning model that predicts a region in the image corresponding to a position in the scene indicated by the user instruction (FIG. 2, K-means clustering, Most probable target object regions; Algorithm 1) based on a fused feature obtained by fusing the image feature and the depth, (FIG. 2, RGB, Depth, combined activations).
Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. Dogan does not disclose a third machine learning model that extracts a language feature indicating a linguistic feature for a user instruction in a natural language including a relative positional relationship, However, Dogan does disclose a referring expression generation model taking the depth dimension as an input in addition to RGB features (Page 2, 1st Col., 1st paragraph; FIG. 2).
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph). Yang further teaches a third machine learning model that extracts a language feature indicating a linguistic feature for a user instruction in a natural language including a relative positional relationship (Yang: FIG. 2, BERT; FIGS. 1, 3). Yang also teaches a first machine learning model that extracts, from an acquired image, an image feature indicating a feature of a scene captured in the image (Yang: FIGS. 1-3), a fourth machine learning model that predicts a region in the image corresponding to a position in the scene indicated by the user instruction based on a fused feature obtained by fusing the image feature, and the language feature (FIGS. 1-3). Yang also teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”) and correct answer data indicating a region in the image indicated by the user instruction (Yang: FIG. 6, ground truth).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
Dogan in view of Yang does not teach causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data. However, a person of ordinary skills in the arts would understand that ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data has to be provided if supervised trainings are conducted for these machine learning models. Yang teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”).
Piergiovanni is an analogous art pertinent to the problem to be solved in this application and teaches A method involves the training of a model by dynamically adjusting the number of examples within each training batch for cross-modal vision-language tasks (Piergiovanni: Abstract; FIGS. 1-10). Piergiovanni further teaches correct answer data indicating a region in the image indicated by the user instruction (Piergiovanni: FIG. 5, steps 510-520; [0002]; [0025], “trained … using supervised”) and causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted results and the ground truth data (Piergiovanni: FIG. 5, steps 530-540; [0002], “generate a loss … respect to the training … determining a difference between the untrained model output and a known ‘ground truth’ model output …”; [0057]).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Dogan in view of Yang with the teaching of Piergiovanni by providing ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and loss function based on a difference between the predicted region in the image and ground truth region in the image in order to perform supervised training for the machine learning models.
-Regarding claim 12, Dogan discloses a method for generating one or more machine learning models, the method being executed in an information processing apparatus, the method comprising (Abstract; FIGS. 1-6; Algorithm 1; Page 1, Sec. 1., 1st paragraph,  “a robot … helping a user to pick up a described object”; FIG. 1; Page 7, 1st Col., 2nd paragraph, “system … deployed to a robot”): acquiring an image (FIG. 2, RGB-D scene), a user instruction in a natural language including a relative positional relationship (FIG. 2, referring expression, “Can you bring the bowl next to the glass”); predicting a region, indicated by the user instruction, in the image corresponding to a position in a scene captured in the image based on the image and the user instruction by using the one or more machine learning models (FIG. 2); and the one or more machine learning models (FIG. 2. Grad-CAM; Algorithm 1) predict the region in the image corresponding to the position in the scene indicated by the user instruction (FIG. 2, most probable target object regions; FIGS. 5-6) based on a fused feature obtained by fusing an image feature indicating a feature of the scene captured in the image (FIG. 2, RGB, Depth, combined activations), and a depth of the scene captured in the image (FIG. 2. Grad-CAM; Algorithm 1; Page 1, 2nd Col., 2nd paragraph).
	 Dogan does not disclose fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction. 
In the same field of endeavor,  Yang teaches a method for referring image segmentation (Yang: Abstract; FIGS. 1-5; Tables 1-4). Yang further teaches fusion of the image feature and a language feature indicating a linguistic feature related to the user instruction (Yang: FIGS. 1-3; p. 18134, 2nd Col, last paragraph;  p. 18135,  1st Col., 3rd paragraph, last paragraph; p. 18136, 1st Col., last paragraph). Yang also teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”) and correct answer data indicating a region in the image indicated by the user instruction (Yang: FIG. 6, ground truth).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to combine the teaching of Dogan with the teaching of Yang by further fusing a language feature indicating a linguistic feature related to the user instruction in order to leverage the referring expression for highlighting relevant positions in the image and provide accurate segmentation with a light-weight mask predictor (Yang: Abstract).
Dogan in view of Yang does not teach causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data. However, a person of ordinary skills in the arts would understand that ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and a loss function based on a difference between the predicted region in the image and the region in the image indicated by the correct answer data has to be provided if supervised trainings are conducted for these machine learning models. Yang teaches supervised training for the machine leaning models (Yang: p. 18138, 1st Col., last paragraph, “pre-trained on ImageNet-22K”; 2nd Col., 1st paragraph, “We train our model for 40 epochs with batch size 32”, 2nd Col., Sec. 4., 1st paragraph, “we train our model on the training set of that dataset”).
Piergiovanni is an analogous art pertinent to the problem to be solved in this application and teaches A method involves the training of a model by dynamically adjusting the number of examples within each training batch for cross-modal vision-language tasks (Piergiovanni: Abstract; FIGS. 1-10). Piergiovanni further teaches correct answer data indicating a region in the image indicated by the user instruction (Piergiovanni: FIG. 5, steps 510-520; [0002]; [0025], “trained … using supervised”) and causing the one or more machine learning models to be trained by using a loss function based on a difference between the predicted results and the ground truth data (Piergiovanni: FIG. 5, steps 530-540; [0002], “generate a loss … respect to the training … determining a difference between the untrained model output and a known ‘ground truth’ model output …”; [0057]).
Therefore, it would have been obvious to one of ordinary skills in the art before the effective filing date of the claimed invention to modify the teaching of Dogan in view of Yang with the teaching of Piergiovanni by providing ground truth data (i.e., correct answer data indicating a region in the image indicated by the user instruction) and loss function based on a difference between the predicted region in the image and ground truth region in the image in order to perform supervised training for the machine learning models.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to XIAO LIU whose telephone number is (571)272-4539. The examiner can normally be reached Monday-Thursday and Alternate Fridays 8:30-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/XIAO LIU/ Primary Examiner, Art Unit 2664
Read full office action
Prosecution Timeline

Mar 27, 2024
Application Filed
Feb 02, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/863,567
Patent 12603972
WIRELESS TRANSMITTER IDENTIFICATION IN VISUAL SCENES
2y 5m to grant Granted Apr 14, 2026
18/270,222
Patent 12592069
OBJECT RECOGNITION METHOD AND APPARATUS, AND DEVICE AND MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/319,896
Patent 12579834
Information Extraction Method and Apparatus for Text With Layout
2y 5m to grant Granted Mar 17, 2026
18/324,644
Patent 12576873
SYSTEM AND METHOD OF CAPTIONS FOR TRIGGERS
2y 5m to grant Granted Mar 17, 2026
18/268,374
Patent 12573175
TARGET TRACKING METHOD, TARGET TRACKING SYSTEM AND ELECTRONIC DEVICE
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
89%
Grant Probability
99%
With Interview (+11.5%)
2y 9m
Median Time to Grant
Low
PTA Risk
Based on 290 resolved cases by this examiner. Grant probability derived from career allow rate.