Last updated: May 29, 2026
Application No. 18/470,778
TEXT-AUGMENTED OBJECT CENTRIC RELATIONSHIP DETECTION

Final Rejection §103
Filed
Sep 20, 2023
Examiner
NASHER, AHMED ABDULLALIM-M
Art Unit
2675
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +32.8% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 81% grant rate with +32.8% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 101 resolved cases, 2023–2026
Examiner Intelligence

NASHER, AHMED ABDULLALIM-M View full profile →
Grants 81% — above average
Career Allowance Rate
82 granted / 101 resolved
+19.2% vs TC avg
Strong +33% interview lift
Without
With
+32.8%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
12 currently pending
Career history
120
Total Applications
across all art units
Statute-Specific Performance

§101
0.4%
-39.6% vs TC avg
§103
87.7%
+47.7% vs TC avg
§102
5.7%
-34.3% vs TC avg
§112
1.3%
-38.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 101 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li (US 20230154188 A1) and further in view of Nguyen (US 20240233440 A1).
Regarding claims 1, 14, 15 and 16, Li discloses at least one processor ([0046]  As shown in FIG. 6, computing device 600 includes a processor 610 coupled to memory 620. Operation of computing device 600 is controlled by processor 610.);
at least one memory including instructions executable by the at least one processor ([0049] For example, as shown, memory 620 includes instructions for a video-and-language alignment module 630 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein.); and
obtaining an image and an input text including a subject from the image and a location of the subject in the image ([0024] A video encoder 220 and a text encoder 222 encode video crops 202 of a video frame 102, and the text input 204 of text descriptions, respectively. The pre-training module 225 further includes an additional multimodal encoder 230 to further capture the interaction between the two modalities output from the video encoder 220 and text encoder 222.);
encoding the image to obtain an image embedding ([0056] At step 704, a video encoder (e.g., 220 in FIG. 3) may encode the plurality of video frames into video feature representations. (This limitation is the same as claim 15));
encoding the input text to obtain a text embedding ([0031] In one embodiment, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1, . . . , t.sub.N.sub.t}, with t.sub.iϵcustom-character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. (This limitation is the same as claim 16));
combining the image embedding and the text embedding to obtain a combined embedding ("claim 7: transforming an embedding of a video start token from the video encoder into a normalized video embedding; transforming an embedding of a text start token from the text encoder into a normalized text embedding; and computing a dot product of the normalized video embedding and the normalized text embedding.
claim 8: encoding, by a multi-modal video-text encoder, the video feature representations and the text feature representations into a set of multimodal embeddings; and generating, by a classifier, an entity prediction from the set of multimodal embeddings."); and generating, using a vision-language transformer ([0019] Sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively.), 
an output text by transforming the combined embedding ([0019] A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss) 
to obtain a sequence of tokens ([0029] The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens. Learnable positional embeddings are also added to the patch tokens from the linear projection layer 305. Then the TimeSformer applies self-attention along the temporal and spatial dimensions separately in order, leading to per-frame features {tilde over (v)}ϵcustom-character.sup.N.sup.p.sup.×K×d with d the feature dimension.).
Li does not explicitly disclose but in a similar field of endeavor of human-object interaction, Nguyen teaches  ([0029] The HOI classifier 204 can use the HOI tuples to generate scores for HOI classes. For example, the HOI classifier receives an image and human-object tuples from the HOI tuple generation module 202. In some examples (e.g., in an end-to-end arrangement), the HOI classifier extracts an object feature, a human feature, a pose feature, and a relationship feature (e.g., human-object position relationship as represented by overlap of bounding boxes). In some examples, each of the object feature, the human feature, the human pose, and the human-object relationship is provided as an embedding (a multi-dimensional vector). The HOI classifier 204 processes the object feature, the human feature, the pose feature, and the relationship feature as respective streams to determine one or more HOIs depicted in the image.), wherein the output text includes the relation of the subject to an object from the image and the location of the object in the image ([0045] For example, for the image 300, the HOI detection system can identify an object “phone,” a human, a pose of the human, and a relationship between the phone and the human (e.g., relationship between bounding boxes). The HOI detection system can generate HOI tuples for the image 300 that include two potential HOIs: “a person holding a phone” and “a person talking on a phone.” The HOI tuple can include a set of features, such as object feature, human feature, pose feature, relationship feature.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known methods of tokenization and vision-language transformers, as taught by Li, with the known teaching of human-object interaction with an output text, as taught Nguyen, in order to yield the predictable results of acquiring categories of the human, object, relationship and positional information of the human and object, and adding tokens for each category. This allows machine learning to focus on individual objects (e.g., "cup," "phone") rather than just uniform pixels, mirroring human cognitive processes for better scene understanding and localizing.
Regarding claims 2, 12, and 17, Li discloses the output text is generated based on the combined embedding ([0024] A video encoder 220 and a text encoder 222 encode video crops 202 of a video frame 102, and the text input 204 of text descriptions, respectively. The pre-training module 225 further includes an additional multimodal encoder 230 to further capture the interaction between the two modalities output from the video encoder 220 and text encoder 222.).
Regarding claims 3, 13, and 18, Li does not explicitly disclose but Nguyen teaches generating a first portion of the output text indicating the relation ([0045] For example, for the image 300, the HOI detection system can identify an object “phone,” a human, a pose of the human, and a relationship between the phone and the human (e.g., relationship between bounding boxes).); and 
generating a second portion of the output text indicating the location of the object based on the relation ("[0029] The HOI classifier 204 can use the HOI tuples to generate scores for HOI classes. For example, the HOI classifier receives an image and human-object tuples from the HOI tuple generation module 202. In some examples (e.g., in an end-to-end arrangement), the HOI classifier extracts an object feature, a human feature, a pose feature, and a relationship feature (e.g., human-object position relationship as represented by overlap of bounding boxes).
[0045] For example, for the image 300, the HOI detection system can identify an object “phone,” a human, a pose of the human, and a relationship between the phone and the human (e.g., relationship between bounding boxes). The HOI detection system can generate HOI tuples for the image 300 that include two potential HOIs: “a person holding a phone” and “a person talking on a phone.” ").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine Li’s disclosure of video and text embeddings with Nguyen’s teaching of object location, in order to achieve improved accuracy using weighted factorization of the present disclosure that enables the ML models to be smaller and less complicated, because accuracy in HOI detection is not dependent on the ML models alone (0061).
Regarding claim 4, Li does not explicitly disclose but Nguyen teaches the location of the subject comprises coordinates for a bounding box surrounding the subject ([0028] In some examples, the HOI tuple generation module 202 determines features for each object and each human, as well as respective bounding boxes.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine Li’s disclosure of video and text embeddings with Nguyen’s teaching of object location, in order to achieve improved accuracy using weighted factorization of the present disclosure that enables the ML models to be smaller and less complicated, because accuracy in HOI detection is not dependent on the ML models alone (0061).
Regarding claim 5, Li does not explicitly disclose but Nguyen teaches the input text comprises a symbol between the subject and the location of the subject ([0028] In some examples, the bounding boxes overlap, the overlap indicative of a relationship between the human and the object. In some examples, object, human, interaction detection is performed by a pre-trained ML model prior to training of the HOI classifier 204 and the weighted factorization model 208. The HOI classifier 204 can obtain HOI tuples.), 
and wherein the output text comprises the symbol between the relation and the location of the object (fig. 3, ref 304, cup bounded with the subject box).
Regarding claims, 6, 20 Li discloses obtaining the subject in the image ("fig. 7, ref 704
[0056] At step 704, a video encoder (e.g., 220 in FIG. 3) may encode the plurality of video frames into video feature representations."); and
generating the input text based on the obtaining ("fig. 7, ref 706
[0057] At step 706, a text encoder (e.g., 222 in FIG. 3) may encode the plurality of text descriptions into text feature representations.").
Regarding claim 7, Li does not explicitly disclose but Nguyen teaches modifying the image to obtain a modified image based on the subject, the object, and the relation of the subject to the object ("fig. 4, ref 402 (receive image) to ref 410, provide texts for image 
[0028] In some examples, the HOI tuple generation module 202 determines features for each object and each human, as well as respective bounding boxes. (image is modified with bounding boxes)").
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine Li’s disclosure of video and text embeddings with Nguyen’s teaching of object location, in order to achieve improved accuracy using weighted factorization of the present disclosure that enables the ML models to be smaller and less complicated, because accuracy in HOI detection is not dependent on the ML models alone (0061).
Regarding claim 8, Li discloses receiving training data including a training image, a training input text including a subject from the training image, a ground-truth relation of the subject to an object from the training image, and a ground-truth location of the object in the training image ([0078] For finetuning on retrieval, the video-text matching head is used during pre-training and optimize the sum of both VTC and VTM losses. Similarity scores are computed from the output of VTM head during inference. For QA task, a simple MLP is added on the multimodal [CLS] token for classification and optimize the conventional cross-entropy loss between predictions and ground-truth answer labels. During inference, predictions are obtained as the answer with the highest probability. All the finetuning experiments are performed on 8 NVIDIA A100 GPUs, taking one to five hours to complete depending on the datasets.); encoding the image to obtain an image embedding ([0056] At step 704, a video encoder (e.g., 220 in FIG. 3) may encode the plurality of video frames into video feature representations. (This limitation is the same as claim 15));
encoding the input text to obtain a text embedding ([0031] In one embodiment, the text encoder 222 may be a 6-layer transformer model to represent text tokens in the text input 304. Given an input text description 304 of N.sub.t tokens, the text encoder 222 outputs an embedding sequence {t.sub.cls, t.sub.1, . . . , t.sub.N.sub.t}, with t.sub.iϵcustom-character.sup.d and t.sub.cls the embedding 316 of the text [CLS] token. (This limitation is the same as claim 16));
combining the image embedding and the text embedding to obtain a combined embedding ("claim 7: transforming an embedding of a video start token from the video encoder into a normalized video embedding; transforming an embedding of a text start token from the text encoder into a normalized text embedding; and computing a dot product of the normalized video embedding and the normalized text embedding.
claim 8: encoding, by a multi-modal video-text encoder, the video feature representations and the text feature representations into a set of multimodal embeddings; and generating, by a classifier, an entity prediction from the set of multimodal embeddings."); 
an output text by transforming the combined embedding ([0019] A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss) 
to obtain a sequence of tokens ([0029] The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens. Learnable positional embeddings are also added to the patch tokens from the linear projection layer 305. Then the TimeSformer applies self-attention along the temporal and spatial dimensions separately in order, leading to per-frame features {tilde over (v)}ϵcustom-character.sup.N.sup.p.sup.×K×d with d the feature dimension.).
Li does not explicitly disclose but Nguyen teaches training, using the training data, a  ([0029] The HOI classifier 204 can use the HOI tuples to generate scores for HOI classes. For example, the HOI classifier receives an image and human-object tuples from the HOI tuple generation module 202. In some examples (e.g., in an end-to-end arrangement), the HOI classifier extracts an object feature, a human feature, a pose feature, and a relationship feature (e.g., human-object position relationship as represented by overlap of bounding boxes). In some examples, each of the object feature, the human feature, the human pose, and the human-object relationship is provided as an embedding (a multi-dimensional vector). The HOI classifier 204 processes the object feature, the human feature, the pose feature, and the relationship feature as respective streams to determine one or more HOIs depicted in the image.), wherein the output text includes the relation of the subject to the object and a location of the object in the training image ([0025] In further detail, during training, the HOI detection system 200 can receive the set of training images 220 as input. Each training image in the set of training images 220 is associated with at least one label in the set of labels. Each label includes an HOI that is represented in a training image and serves as a ground truth for determining loss during training, as described in further detail herein. The HOI detection system 200 can process each image in the set of training images 220 using the HOI tuple generation module 202 for object detection, human detection, and one or more potential interactions between the human(s) and object(s) detected in the image.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known methods of tokenization and vision-language transformers, as taught by Li, with the known teaching of human-object interaction with an output text, as taught Nguyen, in order to yield the predictable results of acquiring categories of the human, object, relationship and positional information of the human and object, and adding tokens for each category. This allows machine learning to focus on individual objects (e.g., "cup," "phone") rather than just uniform pixels, mirroring human cognitive processes for better scene understanding and localizing.
Regarding claim 9, Li does not explicitly disclose but Nguyen teaches obtaining additional training data including an additional training image, an additional training input including an additional subject from the additional training image, an additional ground-truth relation of the additional subject to an additional object from the additional training image, wherein the machine learning model is trained based on the additional training data ([0025] In further detail, during training, the HOI detection system 200 can receive the set of training images 220 as input. Each training image in the set of training images 220 is associated with at least one label in the set of labels. Each label includes an HOI that is represented in a training image and serves as a ground truth for determining loss during training, as described in further detail herein. The HOI detection system 200 can process each image in the set of training images 220 using the HOI tuple generation module 202 for object detection, human detection, and one or more potential interactions between the human(s) and object(s) detected in the image.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine Li’s disclosure of video and text embeddings with Nguyen’s teaching of object location, in order to achieve improved accuracy using weighted factorization of the present disclosure that enables the ML models to be smaller and less complicated, because accuracy in HOI detection is not dependent on the ML models alone (0061).
Regarding claim 10, Li implicitly discloses obtaining a plurality of captions (fig. 4 ref 203

    PNG
    media_image1.png
    606
    522
    media_image1.png
    Greyscale
); and
parsing a caption of the plurality of captions, wherein the additional training input is obtained based on the parsing (fig. 4, ref 216). 
Li does not explicitly disclose but Nguyen teaches obtaining a plurality of captions ([0053] Text embeddings are provided (410). For example, and as described herein, the HOI tuples can be used to determine texts of potential HOIs. For each HOI, a language embedder can convert the respective human, object, interaction text into a text embedding (multi-dimensional vector). In some examples, the text embedding includes word embedding or sentence embedding.); and
parsing a caption of the plurality of captions, wherein the additional training input is obtained based on the parsing ([0043] In accordance with implementations of the present disclosure, text of each HOI is provided to the language embedder 206, which converts the texts into respective word embeddings (or sentence embeddings), as described herein to provide a set of text embeddings for each HOI. The weighted factorization model 208′ processes the sets of text embeddings to generate sets of weights for the image 252.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine Li’s disclosure of video and text embeddings with Nguyen’s teaching of object location, in order to achieve improved accuracy using weighted factorization of the present disclosure that enables the ML models to be smaller and less complicated, because accuracy in HOI detection is not dependent on the ML models alone (0061).
Regarding claims, 11 Li discloses generating a predicted output text based on the image embedding and the text embedding (claim 8. The method of claim 1, further comprising: encoding, by a multi-modal video-text encoder, the video feature representations and the text feature representations into a set of multimodal embeddings; and generating, by a classifier, an entity prediction from the set of multimodal embeddings.); and
computing a loss function based on the predicted output text, the ground-truth relation, and the ground-truth location, wherein the machine learning model is trained based on the loss function ("[0019] A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss. In this way, instance-level alignment is learned by applying the video-text contrastive loss on the unimodal features, which encourages paired video-text instances to have similar representations
[0042] The resulting embeddings from the video encoder 220, text encoder 222 are then passed to the multi-modal encoder 230 to generate embeddings for the MLM loss module 514. The MLM loss module 514 may then predict the masked text tokens and compare the predicted masked tokens with the actual masked tokens to compute a MLM loss.
[0069] At step 812, a first loss may be computed based on a cross-entropy between the entity prediction and the entity pseudo label, e.g., according to Eq. (5).
[0078]  For QA task, a simple MLP is added on the multimodal [CLS] token for classification and optimize the conventional cross-entropy loss between predictions and ground-truth answer labels. During inference, predictions are obtained as the answer with the highest probability. All the finetuning experiments are performed on 8 NVIDIA A100 GPUs, taking one to five hours to complete depending on the datasets.").
Regarding claim, 19 Li discloses a training component configured to compute a loss function based on training data and to train the vision-language transformer based on the loss function ([0019] A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss. In this way, instance-level alignment is learned by applying the video-text contrastive loss on the unimodal features, which encourages paired video-text instances to have similar representations).
Response to Arguments
Applicant's arguments filed 12/17/2025 have been fully considered but they are not persuasive.
Regarding argument 1 (page 3), that the cited references do not disclose or teach new limitations of generating, using a vision-language transformer an output text by transforming the combined embedding, to obtain a sequence of tokens including a first token corresponding to an object from the image, a second token corresponding to a relation of the subject to the object, and a third token corresponding to a location of the object in the image, the examiner most respectfully disagrees.
Prior art Li is made to disclose vision-language transformer in: [0019] Sampled frames and texts are independently encoded using a transformer-based video encoder and a text encoder, respectively.
Tokens in: [0029] The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens
an output text by transforming the combined embedding ([0019] A video-text contrastive loss is computed by comparing the outputs from the video encoder and the text encoder. The video encoder and the text encoder may then be jointly updated by at least the video-text contrastive loss) 
to obtain a sequence of tokens ([0029] The video encoder 220, the TimeSformer, may first partitions each frame into K non-overlapping patches, which are flattened and fed to a linear projection layer 305 to produce a sequence of patch tokens. Learnable positional embeddings are also added to the patch tokens from the linear projection layer 305. Then the TimeSformer applies self-attention along the temporal and spatial dimensions separately in order, leading to per-frame features {tilde over (v)}ϵcustom-character.sup.N.sup.p.sup.×K×d with d the feature dimension.).
And combining image and text embeddings: claim 7: transforming an embedding of a video start token from the video encoder into a normalized video embedding; transforming an embedding of a text start token from the text encoder into a normalized text embedding; and computing a dot product of the normalized video embedding and the normalized text embedding.
claim 8: encoding, by a multi-modal video-text encoder, the video feature representations and the text feature representations into a set of multimodal embeddings; and generating, by a classifier, an entity prediction from the set of multimodal embeddings.").

Prior art Nguyen is made to teach  ([0029] The HOI classifier 204 can use the HOI tuples to generate scores for HOI classes. For example, the HOI classifier receives an image and human-object tuples from the HOI tuple generation module 202. In some examples (e.g., in an end-to-end arrangement), the HOI classifier extracts an object feature, a human feature, a pose feature, and a relationship feature (e.g., human-object position relationship as represented by overlap of bounding boxes). In some examples, each of the object feature, the human feature, the human pose, and the human-object relationship is provided as an embedding (a multi-dimensional vector). The HOI classifier 204 processes the object feature, the human feature, the pose feature, and the relationship feature as respective streams to determine one or more HOIs depicted in the image.), wherein the output text includes the relation of the subject to an object from the image and the location of the object in the image ([0045] For example, for the image 300, the HOI detection system can identify an object “phone,” a human, a pose of the human, and a relationship between the phone and the human (e.g., relationship between bounding boxes). The HOI detection system can generate HOI tuples for the image 300 that include two potential HOIs: “a person holding a phone” and “a person talking on a phone.” The HOI tuple can include a set of features, such as object feature, human feature, pose feature, relationship feature.).

    PNG
    media_image2.png
    516
    680
    media_image2.png
    Greyscale

The instant applicant teaches: 
[0047] Image processing apparatus 110 generates, via a decoder, an output text based on the image embedding and the text embedding. The output text includes a relation of the subject to an object from the image and a location of the object in the image. In the above example, image processing apparatus 110 generates at least three instances of output text, e.g., “person wearing hoodie”, “person wearing shorts”, and “person wearing tennis shoes”. An identified object is “hoodie” and the relation of the subject to the object is “wear” or “wearing”. The location of object “hoodie” is indicated by an additional bounding box surrounding the object “hoodie”. In some cases, the output text includes location coordinates of the object.
Fig. 3 (Nguyen) has a bounding box around the object as well as a different bounding box around the human. Under BRI, one of ordinary skill in the art can determine the location of the object (cup, phone) from the image sentence embedding (S1 and S2) and the bounding box. Since only a person and an object are detected and in their own bounding boxes, the text shows the location of the object as well as the relationship between the object and human. Accordingly, person wearing hoodie, with a bounding box around the hoodie, as taught in the instant application, would be the same as person drinking with cup, with a bounding box around a cup, as taught by prior art Nguyen. A person drinking with a cup would mean, under BRI to one of ordinary skill in the art, that the cup (object) is positioned near the person’s face. 
Regarding argument 2 (page 13), that the cited references do not have the motivation to combine limitations of generating, using a vision-language transformer an output text by transforming the combined embedding, to obtain a sequence of tokens including a first token corresponding to an object from the image, a second token corresponding to a relation of the subject to the object, and a third token corresponding to a location of the object in the image, wherein the output text includes the relation of the subject to an object from the image and the location of the object in the image, the examiner most respectfully disagrees.
	These two arts are meant to teach two different features. It would have been obvious to one of ordinary skill in the art before the effective filing date of this invention to combine the known methods of tokenization and vision-language transformers, as taught by Li, with the known teaching of human-object interaction with an output text, as taught Nguyen, in order to yield the predictable results of acquiring categories of the human, object, relationship and positional information of the human and object, and adding tokens for each category. This allows machine learning to focus on individual objects (e.g., "cup," "phone") rather than just uniform pixels, mirroring human cognitive processes for better scene understanding and localizing. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
US 12346806 B1 with regards to claim 5 in Col 7, lines 32-44: As shown in FIG. 2, each labeled item of a scene image 204 of the has three pieces of information: the bounding box coordinates 206 (xmin, ymin, xmax, ymax) annotated on the original image, a class label 208, and a plain-background image 211 associated to the relevant item from scene image 204. For example, the sofa in scene image 204 is associated with bounding box coordinates 206, class label 208 (e.g., “sofa”), and a plain-background image 211 of the particular couch. During training, the objects in a scene image 204 may be divided (e.g., randomly or pseudo-randomly) into two sets (or sequences)—input set 218 and output set 220. The training goal is to predict the tokens in the output set (or sequence) 220 using the input set 218.
US 20240153239 A1 with respect to localizing texts with objects in claim 1 and coordinates of bounding box in claim 4: [0057] The position (or index) embeddings 452 include the bounding box identifiers 416, each representing an index of the respective object (or bounding box for the object) and the text identifiers 426, each representing an index or position for each word in the text. The bounding box identifiers 416 in the position embeddings 452 are illustrated in FIG. 4C as Q0, Q1, . . . Q_n (each relating to an index of BOX1, BOX2, . . . BOX_n, respectively). The text identifiers 426 are illustrated in FIG. 4C as P2, P3, P4, P5, P6 (relating to TEXT1, TEXT2, TEXT3, TEXT4 and TEXT5, respectively). As illustrated in FIG. 4C, the special tokens are allocated position embeddings P0 (for the <CLS>token), P1 (for the <s>token), and P7 (for the </s>token).
[0060] The NN 480 is trained to classify the CLS embeddings 434 and provide as output the object of interest 440 as an index of the bounding box (or object) providing the best match to the text 104 associated with the image 102.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AHMED A NASHER whose telephone number is (571)272-1885. The examiner can normally be reached Mon - Fri 0800 - 1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Moyer can be reached at (571) 272-9523. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AHMED A NASHER/Examiner, Art Unit 2675                                                                                                                                                                                                        
/ANDREW M MOYER/Supervisory Patent Examiner, Art Unit 2675
Read full office action
Prosecution Timeline

Show 4 earlier events
Dec 17, 2025
Response Filed
Dec 18, 2025
Examiner Interview Summary
Mar 26, 2026
Final Rejection mailed — §103
May 10, 2026
Interview Requested
May 18, 2026
Applicant Interview (Telephonic)
May 21, 2026
Examiner Interview Summary
May 26, 2026
Request for Continued Examination
May 28, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/284,225
Patent 12626524
Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption
2y 7m to grant Granted May 12, 2026
18/598,176
Patent 12614390
VEHICLE LOCATION RECOGNITION SYSTEM AND VEHICLE LOCATION RECOGNITION METHOD
2y 1m to grant Granted Apr 28, 2026
17/396,379
Patent 12601840
TUNING PARAMETER DETERMINATION METHOD FOR TRACKING AN OBJECT, A GROUP DENSITY-BASED CLUSTERING METHOD, AN OBJECT TRACKING METHOD, AND AN OBJECT TRACKING APPARATUS USING A LIDAR SENSOR
4y 8m to grant Granted Apr 14, 2026
17/631,480
Patent 12586329
MODELING METHOD, DEVICE, AND SYSTEM FOR THREE-DIMENSIONAL HEAD MODEL, AND STORAGE MEDIUM
4y 1m to grant Granted Mar 24, 2026
18/618,872
Patent 12582373
GENERATING SYNTHETIC ELECTRON DENSITY IMAGES FROM MAGNETIC RESONANCE IMAGES
1y 12m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
81%
Grant Probability
99%
With Interview (+32.8%)
2y 8m (~0m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 101 resolved cases by this examiner. Grant probability derived from career allowance rate.