Last updated: May 29, 2026

Application No. 18/690,550

DETECTING OBJECTS IN IMAGES BY GENERATING SEQUENCES OF TOKENS

Non-Final OA §102§103

Filed

Mar 08, 2024

Priority

Sep 17, 2021 — provisional 63/245,783 +2 more

Examiner

SHERRILLO, DYLAN JOSEPH

Art Unit

2665

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +11.4% interview lift. Interview lift (+11.4%) is below the 15.0% threshold. A written response is recommended.

Based on 44 resolved cases, 2023–2026

Examiner Intelligence

SHERRILLO, DYLAN JOSEPH View full profile →

Grants 91% — above average

Career Allowance Rate

40 granted / 44 resolved

+28.9% vs TC avg

Moderate +11% lift

Without

With

+11.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 10m

Avg Prosecution

8 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

1.5%

-38.5% vs TC avg

§103

72.1%

+32.1% vs TC avg

§102

19.1%

-20.9% vs TC avg

§112

2.9%

-37.1% vs TC avg

Black line = Tech Center average estimate • Based on career data from 44 resolved cases

Office Action

§102 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 4/17/2025, 12/01/2025, and 1/30/2026 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Status of Claim(s)
Claim(s) 8-12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim(s) 1-6 and 13-20  is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jason Beal (NPL: Toward Transformer-Based Object Detection).
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jason Beal (NPL: Toward Transformer-Based Object Detection) in view of Wikipedia Contributors (NPL: Attention (machine learning).

Claim Objections
Claims 8-12 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claim(s) 1-6 and 13-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jason Beal (NPL: Toward Transformer-Based Object Detection).

Regarding Claim 1:
	Beal teaches: A method performed by one or more computers, the method comprising (Page 2, “We believe ViT-FRCNN shows that the commonly applied paradigm of large scale pretraining on massive datasets followed by rapid fine-tuning to specific tasks can be scaled up even further in the field of computer vision, owing to the model capacity observed in transformer-based architectures and the flexible features learned in such backbones”):
obtaining an input image (Abstract, “The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input,”); 
processing the input image using an object detection neural network to generate an output sequence that comprises respective token at each of a plurality of time steps, wherein each token is selected from a vocabulary of tokens that comprises (i) a first set of tokens that each represent a respective discrete number from a set of discretized numbers and (ii) a second set of tokens that each represent a respective object category from a set of object categories (Page 3, “The RPN identifies regions of interest likely to contain objects by producing multiple predictions per location on the feature map: each prediction corresponds to a different anchor of varying size and aspect ratio, centered at the lo cation of the feature. Each prediction consists of a binary classification (object vs. no object) and a regression to box coordinates. The bounding boxes are predicted as offsets from anchor boxes, using the parameterization…); and 
generating, from the tokens in the output sequence, data identifying one or more bounding boxes in the input image and, for each bounding box, a respective object category from the set of object categories to which an object depicted in the bounding box belongs (Page 7, “We construct a novel subset of ObjectNet, referred to as OBJECTNET-D, by selecting the COCO 2017overlapping categories with bounding box annotations that were created in the reanalysis of the dataset. The OBJECTNET-D data set consists of 4,971 test images and 23 object categories. Nearly all of the objects (99.9%) in this dataset are “large” per  the object size definitions in COCO 2017. Therefore, we do not focus on the “small” and “medium” object detection performance in our analysis. Figure 5 visualizes a few indicative samples from the OBJECTNET-D dataset. This dataset enables a robust test of the detection models with respect to domain shift. The models are evaluated on the corresponding categories without any kind of fine-tuning.”_

	Regarding Claim 2:
	Beal teaches: The method of claim 1, and further teaches wherein the output sequence comprises a respective subsequence corresponding to each of the one or more bounding boxes, and wherein generating the data identifying the one or more bounding boxes comprises, for each bounding box (Fig. 1 and page 2, “We now describe our model, ViT-FRCNN, which aug ments a Vision Transformer backbone with a detection net work so that it can produce bounding box classifications and coordinates. In doing so, we demonstrate that the ViT is capable of transferring representations learned for classifica tions to other tasks such as object detection, paving the path for a general class of transformer-based vision models.”):
identifying, from tokens in the corresponding subsequence that belong to the first set of tokens, coordinates of the bounding box in the input image (Page 2, Section 3. Method, Production of coordinates for bounding boxes); and 
identifying, as the respective object category to which the object depicted in the bounding box belongs, the object category represented by a token in the corresponding subsequence that belongs to the second set of tokens (Page 2, Section 3. Method, Production of classifications for bounding boxes).

	Regarding Claim 3:
	Beal teaches: The method of claim 2, wherein the respective subsequence includes four tokens from the first set of tokens and wherein the four discrete numbers that are represented by the four tokens specify coordinates in the input image of two corners of the bounding box (Fig. 1, “Detection Network”, figure shows coordinates of corners for box) .

Regarding Claim 4:
	Beal teaches: The method of claim 2, wherein the respective subsequence includes four tokens from the first set of tokens and wherein the four discrete numbers that are represented by the four tokens specify coordinates in the input image of a center of the bounding box and a height and width of the bounding box (Page 3, Column 1, Bounding boxes can determine box center, width, and height for coordinates of bounding box).

	Regarding Claim 5:
	Beal teaches: The method of claim 1, wherein processing the input image using the object detection neural network comprises:
processing the input image using an encoder neural network to generate an encoded representation of the input image (Fig. 1, Encoded representation of image split into parts); and 
processing the encoded representation of the input image using a decoder neural network to generate the output sequence (Page 2, ViT and Faster R-CNN processes encoded representation into a output).

Regarding Claim 6:
	Beal teaches: The method of claim 1, wherein the object detection neural network is configured to generate a respective score distribution over the tokens in the vocabulary for each time step conditioned on (i) the input image and (ii) the tokens at any earlier time steps in the output sequence, and wherein processing the input image using the object detection neural network to generate an output sequence comprises, for each time step (Page 3, “The RPN identifies regions of interest likely to contain objects by producing multiple predictions per location on the feature map: each prediction corresponds to a different anchor of varying size and aspect ratio, centered at the lo cation of the feature. Each prediction consists of a binary classification (object vs. no object) and a regression to box coordinates. The bounding boxes are predicted as offsets from anchor boxes, using the parameterization…):
selecting the respective token at the time step in the output sequence using the respective score distribution generated by the object detection neural network for the time step (Page 3, 3.1 Implementation details, 2000 discrete steps are identified in detection and object detection).

	Regarding Claim 13:
	Beal teaches: The method of claim1, further comprising:
outputting the data identifying the one or more bounding boxes in the input image and, for each bounding box, the respective object category from the set of object categories to which the object depicted in the bounding box belongs (Fig. 1 amd Page 2, Right column, final paragraph discusses categorization of bounding boxes from input images).

Regarding Claim 14:
	Beal teaches: A method of training an object detection neural network, the method comprising (Page 2, “We believe ViT-FRCNN shows that the commonly ap plied paradigm of large scale pretraining on massive datasets followed by rapid fine-tuning to specific tasks can be scaled up even further in the field of computer vision, owing to the model capacity observed in transformer-based architectures and the flexible features learned in such backbones”):
obtaining a batch of training images and, for each training image, a target output that identifies one or more ground truth bounding boxes in the image and a respective ground truth object category for each bounding box (Page 7, Section 4.3 Curriculum pretraining, Paragraph 2, “For this investigation, we consider OpenImages V6[20], a supervised data set consisting of 1.7 million images, with 15.8 million bounding boxes and 600 categories. Relative to COCO 2017 train, which consists of 118k images and 860k  bounding boxes, this data set is an order of magnitude larger in terms of image and bounding box count.”); 
for each training image, generating a target output sequence that includes, for each ground truth bounding box, a respective subsequence that includes (i) a set of first tokens that define a location of the bounding box in the image and (ii) a second token that represents the ground truth object category for the bounding box (Page 3, “The RPN identifies regions of interest likely to contain objects by producing multiple predictions per location on the feature map: each prediction corresponds to a different anchor of varying size and aspect ratio, centered at the location of the feature. Each prediction consists of a binary classification (object vs. no object) and a regression to box coordinates. The bounding boxes are predicted as offsets from anchor boxes, using the parameterization…); and 
training the object detection neural network to maximize, for each training image and for each token in at least a subset of the tokens in the target output sequence for the training image, a log likelihood of the token conditioned on any preceding tokens in the target output sequence and the training image (Page 7, Section 4.3 Curriculum pretraining, “The simplified Transformer model is pretrained for 100 epochs on the Open Images V6 dataset, using the AdamW [24] optimizer with a base learning rate of 3e-4,  weight decay of 0.1, and a total batch size of 4,096. The ViT-B/32 backbone is first pretrained on ImageNet-21k for these curriculum pretraining experiments. As seen in Table6, the addition of the pretraining phase on OpenImagesV6 yields a +1.1 AP improvement for the ViT-B/32-FRCNN model, and a +0.4AP improvement for the ViT-B/32-FRCNN model with overlapping patches. This phase of pretraining is shown to be most beneficial for im proving the performance on small and medium objects.”).

	Regarding Claim 15:
	Beal teaches: The method of claim 14, wherein obtaining a batch of training images and, for each training image, a target output that identifies one or more ground truth bounding boxes in the image and a respective ground truth object category for each bounding box comprises:
generating one or more of the training images in the batch by applying one or more image augmentation policies to a corresponding initial training image (Pages 6-7, Final paragraph of paragraph 6 to first 3 lines of page 7.).

	Regarding Claim 16:
	Beal teaches: The method of claim 14,wherein obtaining a batch of training images and, for each training image, a target output that identifies one or more ground truth bounding boxes in the image and a respective ground truth object category for each bounding box comprises:
for a particular bounding box in a particular training image, generating the bounding box by applying noise to an initial ground truth bounding box in the particular training image (Page 4, Alterations to resolution for identification of bounding boxes to training images).

	Regarding Claim 17:
	Beal teaches: The method of claim 14, wherein for each training image, generating a target output sequence comprises:
generating one or more random bounding boxes in the training image (Page 3, Predicted bounding boxes on training images); and
for each random bounding box, including, in the target output sequence, (i) a set of first tokens that define a location of the random bounding box in the training image and (ii) a second token that represents a noise object category that is not in the set of object categories (Page 3, Location of bounding boxes defined by coordinates and resolution is altered to account for noise).

	Regarding Claim 18:
	Beal teaches: The method of claim 17, wherein the object detection neural network is not trained to maximize the log likelihood of the tokens in the sets of first tokens for the random bounding boxes (Page 5, Figure 3, use of non-maximum suppression for detection of objects using bounding boxes).

Regarding Claim 19:
	Beal teaches: The method of claim 14, wherein for each training image, generating a target output sequence comprises:
ordering the respective subsequences in a random order within the target output sequence (Page 3, Figure 1 and Paragraph 2, Order of identified objects is can happen in any order to create an output).

	Regarding Claim 20:
	Beal teaches: A system comprising (Page 2, “We believe ViT-FRCNN shows that the commonly applied paradigm of large scale pretraining on massive datasets followed by rapid fine-tuning to specific tasks can be scaled up even further in the field of computer vision, owing to the model capacity observed in transformer-based architectures and the flexible features learned in such backbones”):
one or more computers (Page 2, “We believe ViT-FRCNN shows that the commonly applied paradigm of large scale pretraining on massive datasets followed by rapid fine-tuning to specific tasks can be scaled up even further in the field of computer vision, owing to the model capacity observed in transformer-based architectures and the flexible features learned in such backbones”); and
one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising (Page 2, “We believe ViT-FRCNN shows that the commonly applied paradigm of large scale pretraining on massive datasets followed by rapid fine-tuning to specific tasks can be scaled up even further in the field of computer vision, owing to the model capacity observed in transformer-based architectures and the flexible features learned in such backbones”): [One of ordinary skill in the art can identify that a computer will include a storage device included for image processing as shown by Beal]
obtaining an input image (Abstract, “The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input,”);
	processing the input image using an object detection neural network to generate an output sequence that comprises respective token at each of a plurality of time steps, wherein each token is selected from a vocabulary of tokens that comprises (i) a first set of tokens that each represent a respective discrete number from a set of discretized numbers and (ii) a second set of tokens that each represent a respective object category from a set of object categories (Page 3, “The RPN identifies regions of interest likely to contain objects by producing multiple predictions per location on the feature map: each prediction corresponds to a different anchor of varying size and aspect ratio, centered at the lo cation of the feature. Each prediction consists of a binary classification (object vs. no object) and a regression to box coordinates. The bounding boxes are predicted as offsets from anchor boxes, using the parameterization…); and
	generating, from the tokens in the output sequence, data identifying one or more bounding boxes in the input image and, for each bounding box, a respective object category from the set of object categories to which an object depicted in the bounding box belongs (Page 7, “We construct a novel subset of ObjectNet, referred to as OBJECTNET-D, by selecting the COCO 2017overlapping categories with bounding box annotations that were created in the reanalysis of the dataset. The OBJECTNET-D data set consists of 4,971 test images and 23 object categories. Nearly all of the objects (99.9%) in this dataset are “large” per  the object size definitions in COCO 2017. Therefore, we do not focus on the “small” and “medium” object detection performance in our analysis. Figure 5 visualizes a few indicative samples from the OBJECTNET-D dataset. This dataset enables a robust test of the detection models with respect to domain shift. The models are evaluated on the corresponding categories without any kind of fine-tuning.”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Jason Beal (NPL: Toward Transformer-Based Object Detection) in view of Wikipedia Contributors (NPL: Attention (machine learning).

Regarding Claim 7:
Beal teaches the methods of claim 6 as applied above.
	Beal does not explicitly teach the following: however, in related art, Wikipedia teaches: wherein selecting the respective token comprises selecting the token with the highest score in the respective score distribution (Wikipedia, Page 3 and Figure 2, Interpreting attention weights, Highest scores are demonstrated on a diagonal matrix and selection of score is based on user preference.).
	Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of attention machine learning functions such as selection of scores from a distribution with that of Beal’s transformer vision object detection system.

Relevant Prior Art Directed to State of Art
MISRA (US 20240380923 A1)
Burke (US 20240045928 A1)
FINLAY (US 20240354553 A1)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DYLAN J SHERRILLO whose telephone number is (703)756-5605. The examiner can normally be reached 1st week of bi-week: Mon-Wed 7am-5:30pm PST, Thurs: 7am-4:30pm PST, Fri off / 2nd week of bi-week: Mon-Wed 7am-5:30pm PST, Thurs-Fri: 7am-4:30pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen R Koziol can be reached at (408) 918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/D.J.S./Examiner, Art Unit 2665                                                                                                                                                                                                        
/Stephen R Koziol/Supervisory Patent Examiner, Art Unit 2665

Read full office action

Prosecution Timeline

Mar 08, 2024

Application Filed

Apr 07, 2026

Non-Final Rejection mailed — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/933,756

Patent 12608830

DISTANCE ESTIMATION USING A GEOMETRICAL DISTANCE AWARE MACHINE LEARNING MODEL

3y 7m to grant Granted Apr 21, 2026

18/453,389

Patent 12591907

SYSTEM AND METHOD TO DETECT A GAZE AT AN OBJECT BY UTILIZING AN IMAGE SENSOR

2y 7m to grant Granted Mar 31, 2026

18/064,132

Patent 12579798

IMAGE PROCESSING METHOD AND APPARATUS

3y 3m to grant Granted Mar 17, 2026

18/081,195

Patent 12567166

DEVICE FOR PROCESSING IMAGE AND OPERATING METHOD THEREOF

3y 2m to grant Granted Mar 03, 2026

17/928,087

Patent 12541825

MODEL TRAINING METHOD, IMAGE PROCESSING METHOD, COMPUTING AND PROCESSING DEVICE AND NON-TRANSIENT COMPUTER-READABLE MEDIUM

3y 2m to grant Granted Feb 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

91%

Grant Probability

99%

With Interview (+11.4%)

2y 10m (~7m remaining)

Median Time to Grant

Low

PTA Risk

Based on 44 resolved cases by this examiner. Grant probability derived from career allowance rate.