Last updated: May 29, 2026
Application No. 18/476,101
VIDEO FACE CLUSTERING

Final Rejection §103
Filed
Sep 27, 2023
Examiner
RODRIGUEZ, ANTHONY JASON
Art Unit
2672
Tech Center
2600 — Communications
Assignee
Flawless Holdings Limited
OA Round
2 (Final)
Interview Optional

— -23.5% interview lift. Interview lift (-23.5%) is below the 15.0% threshold. A written response is recommended.
Based on 21 resolved cases, 2023–2026
Examiner Intelligence

RODRIGUEZ, ANTHONY JASON View full profile →
Grants only 19% of cases
Career Allowance Rate
4 granted / 21 resolved
-43.0% vs TC avg
Minimal -24% lift
Without
With
+-23.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
21 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§103
85.2%
+45.2% vs TC avg
§102
2.1%
-37.9% vs TC avg
§112
12.7%
-27.3% vs TC avg
Black line = Tech Center average estimate • Based on career data from 21 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments, see Remarks pages 12-15, filed 02/06/2026, with respect to the rejections of amended claims 1, 11, and 16 under 36 U.S.C. 103 have been fully considered but they are not persuasive. 
On pages 13-14 of Remarks, Applicant argues:

    PNG
    media_image1.png
    491
    735
    media_image1.png
    Greyscale

Examiner respectfully disagrees.
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
	Figure 2 of Wang discloses “Fig. 2. An illustration of the general training and evaluation framework of the proposed video-centralised transformer. Top: The training framework. For each face track, a ConvNet (CNN) is first employed to extract embedding, followed by a temporal augmentation             
                τ
            
        . The sampled clip is fed into a transformer encoder with an MLP head to a new latent space with reduced dimensionality. The video-centralised learning is performed in this new space. Bottom: The evaluation pipeline. The embedding of each face track is also extracted first, while the whole track is input into the transformer encoder without augmentation             
                τ
            
        . The output from the encoder is used as the video representation, while the MLP head is also discarded. The final clustering results are obtained by a HAC clustering algorithm. (Best seen in colour).” Wherein a transformer encoder is trained based on a plurality of face tracks and a loss function, as is further described in Section 3.4 of Wang, in order to encode each face track into latent space based on similarities to other face tracks. Upon having trained the model, each face track is grouped with other face tracks based on the clustering of track embeddings generated by processing each face track with the trained transformer encoder. 
Thus, Wang discloses the limitations “grouping the plurality of face tracks into common identity clusters based at least in part on similarities, as measured by the same loss function as used to train a face identification model, between respective embeddings generated using the finetuned face identification model for image frame crops within different face tracks.”
However, Wang fails to disclose expressly “fine-tuning, using the determined plurality of face tracks and a loss function, a pretrained face identification model to generate, for image frame crops of a common face track, respective embeddings that have a mutually high degree of similarity as measured by the loss function; grouping the plurality of face tracks into common identity clusters based at least in part on similarities, as measured by the same loss function as used to fine-tune the pretrained face identification model.” Thus, Wang fails to disclose the fine tuning of its face identification model, wherein its loss function will influence the face track similarity determination.
	Section I Introduction of Zhao discloses “The target network is fed with a masked image while the online tokenizer with the original image. The goal is to let the target network recover each masked patch token to its corresponding tokenizer output…our tokenizer captures high-level visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens… When pre-trained with ImageNet-22K, iBOT with ViT-L/16 achieves a linear probing accuracy of 82.3% and a fine-tuning accuracy of 87.8%, which is 1.0% and 1.8% higher than previous best results.” Wherein the disclosed model frame work consists of a student and teacher network, as further disclosed in Figure 3 of Zhao, that is pre-trained and fine-tuned for the extraction of similar embeddings for images of the same class, which can be implemented for task related to classification and object detection as disclosed by Section 6 Conclusion of Zhao.
	It would have been obvious for one of ordinary skill in the art, prior to the effective filing date of the claimed invention to substitute the CNN and transformer encoder disclosed by Wang with the pretrained transformer framework disclosed by Zhou. Thus, resulting in the face tracks being grouped based on the similarities determined based on the fine-tuned transformer framework, and thus based on the face tracks and loss function used for fine-tuning. 
	Therefore, Wang in view of Zhao discloses “fine-tuning, using the determined plurality of face tracks and a loss function, a pretrained face identification model to generate, for image frame crops of a common face track, respective embeddings that have a mutually high degree of similarity as measured
by the loss function; grouping the plurality of face tracks into common identity clusters based at least in
part on similarities, as measured by the same loss function as used to fine-tune the pretrained face identification model, between respective embeddings generated using the fine-tuned face identification model for image frame crops within different face tracks.”
As per claim(s) 11 and 16, arguments made in rejecting claim(s) 1 are analogous.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 4-9, 11-12, 14-17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (Self-supervised Video-centralised Transformer for Video Face Clustering) hereinafter referenced as Wang, in view of Zhou (IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER) hereinafter referenced as Zhou.
Regarding claim 1, Wang discloses: A system comprising: at least one processor; and at least one memory storing machine-readable instructions which, when executed by the at least one processor, cause the at least one processor to carry out operations (Wang: Section: 4.2.4 Implementations: “We implement our method in the PyTorch framework [82], and all experiments are run on an Amazon AWS server with eight A100 GPUs, each training session deployed on a single GPU”) comprising:
determining, using a motion tracker, a plurality of face tracks from one or more sequences of image frames, each face track corresponding to a respective instance of a respective face and comprising a respective sequence of image frame crops (Wang: Figure 2: Face Tracks A & B; 
Section: 1 Introduction: “the temporally consecutive facial images with overlapped detection boxes are deemed as from the same person, [9], [10], [11] and they can be grouped together to formulate a face track [12].”; 
Section: 4.1 Dataset: “We also construct the EasyCom-Clustering dataset as another dataset for evaluation. This dataset contains 22 sessions of egocentric recordings collected in a simulated restaurant setting, and each session lasts for around 30 minutes…we make use of all the 22 sessions, and we have detected, embedded and annotated a total of 94 047 face tracks with 1 623 633 facial images from 53 participants, using RetinaFace [74] as the face detector and an Arcface [39] model for facial embedding.”); 
grouping the plurality of face tracks into common identity clusters based at least in part on similarities (Wang: Figure 2: Bottom; Section: 4.2.1 Evaluation Metrics: “In this work, we examine the quality of the video representation under two different clustering settings…This is achieved by applying different stopping criteria to HAC…while for the case of unknown cluster numbers, a pre-defined distance threshold is required as the stopping criterion.”; Wherein the output track embeddings, output by the trained transformer, are clustered based on embedding distances which constitutes a similarity measure), as measured by a loss function, between respective embeddings generated using a trained face identification model for image frame crops within different face tracks (Wang: Figure 1: Right; Section: 3.2 System Framework: “The idea of video-centralised learning is based on the simple intuition that for each track Ta, a distinct video centre ca is maintained in the latent space. The transformer-based representation of a sampled clip ETa should be attracted to ca, which is an implicit way of utilising the must-link constraints. As for the cannot-link constraints, if two face tracks Ta and Tb co-occur and are known to have exclusive IDs, i.e. Nab = 1, we can push the representation of clip ETa to be far away from Tb’s video centre cb, and similarly for ETb ’s representation and ca. The purpose of video-centralised learning is to enforce the video-level representation from the transformer to be more discriminative, more compact, and most importantly, more centralised.”; Wherein the model is trained for face tracks co-occuring to have their embeddings farther way, while face tracks of similar centers/embeddings are closer together.).
Wang does not disclose expressly: fine-tuning, using the determined plurality of face tracks and a loss function, a pre-trained face identification model to generate, for image frame crops of a common face track, respective embeddings that have a mutually high degree of similarity as measured by the loss function; 
grouping the plurality of face tracks into common identity clusters based at least in part on similarities, as measured by the same loss function as used to fine-tune the pretrained face identification model.
Zhou discloses: an image processing model framework consisting of student and teacher transformer networks (Zhou: Figure 3), pretrained and fine-tuned (Zhou: Section: 3.2 Implementation: “We pre-train and fine-tune the Transformers with 224-size images, so the total number of patch tokens is 196.”), for the purposes of training the networks to extract similar visual embeddings from images (Zhou: Section: I Introduction: “Our online tokenizer naturally resolves two major challenges. On the one hand, our tokenizer captures high-level visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens.”), measured by loss functions (Zhou: Figure 3; Section: 2.2 Self-Distillation: “Given the training set I, an image x ~ I is sampled uniformly, over which two random augmentations are applied, yielding two distorted views u and v. The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the [CLS] token: v[CLS]t = P[CLS]0(v) and u[CLS]s = P[CLS](u). The knowledge is distilled from teacher to student by minimizing their cross-entropy”). 
Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to substitute the CNN and transformer encoder disclosed by Wang with the pretrained transformer framework disclosed by Zhou. The suggestion/motivation for doing so would have been “Our online tokenizer naturally resolves two major challenges. On the one hand, our tokenizer captures high-level visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens. On the other hand, our tokenizer needs no extra stages of training as pre-processing setup since it is jointly optimized with MIM via momentum update.” (Zhou: Section: 1 Introduction). Further, one skilled in the art could have substituted the elements as described above by known methods with no change in their respective functions, and the substitution would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Wang with Zhou to obtain the invention as specified in claim 1.
Regarding Claim 2, Wang in view of Zhou discloses: The system of claim 1, wherein the fine-tuning comprises: matching a first face track to a second face track based at least in part on a similarity of respective embeddings generated by the face identification model (Zhou: Section: 1 Introduction: “On the one hand, our tokenizer captures high-level visual semantics progressively learned by enforcing the similarity of cross-view images on class tokens.”; Section: 3.2 Implementation: “We pre-train and fine-tune the Transformers with 224-size images, so the total number of patch tokens is 196. The projection head h is a 3-layer MLPs with l2-normalized bottleneck following DINO (Caron et al., 2021).”; Wherein the fine-tuning of the model allows for the images in each face track, disclosed by Wang, to capture similar visual semantics.); and updating the face identification model to increase a degree of similarity between respective embeddings for a first image frame crop from the first face track and a second image frame crop from the second face track (Zhou: Section: 2.2 Self Distillation: “The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the [CLS] token: v[CLS]t = P[CLS]0(v) and u[CLS]s = P[CLS](u). The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as 
    PNG
    media_image2.png
    56
    828
    media_image2.png
    Greyscale
”; Wherein the CLS loss function updating the models in order to correct the class embeddings extracted, also corrects the model towards increasing the similarity of CLS tokens/embeddings of face images of the same person, including those in different tracks.).  
Regarding Claim 4, Wang in view of Zhou discloses: The system of claim 1, wherein the grouping comprises: initialising a respective cluster for each face track of the plurality of face tracks (Wang: Section: 3.2 System Framework: “During the evaluation stage, we discard the sampling step and make use of the whole track for predictions. As shown in Fig. 2 (Bottom), the facial images are first embedded via the pre-trained ConvNet, which are subsequently input into the transformer without any sampling. The output embedding of the transformer encoder without the MLP head is fetched as the video-level representation, and then a HAC is applied to get the final clustering results.”; Wherein each face track’s embeddings are processed by the models aggregating the embeddings of their respective face image embeddings, which constitutes initializing a cluster for each face track.); and iteratively: determining a respective matching threshold for each cluster based on evaluations of the loss function for pairs of image frame crops within that cluster; and merging pairs of clusters based at least in part on evaluations of the loss function between clusters and the respective matching thresholds for the clusters (Wang: 4.2.1 Evaluation Metrics: “we examine the quality of the video representation under two different clustering settings, i.e. the total cluster number is either 1). known, or 2). unknown. This is achieved by applying different stopping criteria to HAC…while for the case of unknown cluster numbers, a pre-defined distance threshold is required as the stopping criterion.”; Wherein the merging of clusters based on an iterative embedding distance calculation process constitutes merging clusters based on loss function evaluations and the matching thresholds for each cluster.).  
	Regarding Claim 5, Wang in view of Zhou discloses: The system of claim 1, wherein the fine-tuning comprises: preparing a first model branch comprising a first copy of the pre-trained face identification model and a first multilayer perceptron head; preparing a second model branch comprising a second copy of the pre-trained face identification model and a second multilayer perceptron head (Zhou: Figure 3: “Given two views u and v of an image x, each view is passed through a teacher network ht o ft and a student network hs o fs.”; Section: 2.2 Self-Distillation: “The teacher and the student share the same architecture consisting of a backbone f (e.g., ViT) and a projection head h[CLS].”; Section: 3.2 Implementation: “The projection head h is a 3-layer MLPs with l2-normalized bottleneck following DINO (Caron et al., 2021).”);
for a plurality of iterations: passing a first image frame crop from a given face track through the first model branch to generate a first embedding; passing a second image frame crop from a given face track through the second model branch to generate a second embedding (Zhou: Section: 2.2 Self Distillation: “Given the training set I, an image x ~ I is sampled uniformly, over which two random augmentations are applied, yielding two distorted views u and v. The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the [CLS] token: v[CLS]t = P[CLS]0(v) and u[CLS]s = P[CLS](u).”; Wherein us and vt constitute first and second embeddings, respectively.); 
and updating parameter values of the first model branch so as to increase a degree of similarity between the first embedding and the second embedding, as measured by the loss function (Zhou: Section 2.2 Self-Distillation: “The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as 
    PNG
    media_image2.png
    56
    828
    media_image2.png
    Greyscale
”); 
and for a subset of the plurality of iterations, updating parameter values of the second model branch based on a moving average of parameter values of the first model branch over a set of preceding iterations (Zhou: Section: 2.2 Self Distillation: “The parameters of the student network θ are Exponentially Moving Averaged (EMA) to the parameters of teacher network θ ‘.”). 
Regarding Claim 6, Wang in view of Zhou discloses: The system of claim 5, wherein the loss function evaluates a cross-entropy between the first embedding and the second embedding (Zhou: Section 2.2 Self-Distillation: “The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as 
    PNG
    media_image2.png
    56
    828
    media_image2.png
    Greyscale
”).
Regarding Claim 7, Wang in view of Zhou discloses: The system of claim 1, wherein: the face identification comprises a vision transformer (Zhou: Figure 3; Section: 2.2 Self-Distillation: “The teacher and the student share the same architecture consisting of a backbone f (e.g., ViT) and a projection head h[CLS].”); the embeddings each comprise a respective class embedding and a respective patch embedding (Zhou: Figure 3: “iBOT minimizes two losses. The first loss L[CLS] is self-distillation between cross-view [CLS] tokens. The second loss LMIM is self-distillation between in-view patch tokens, with some tokens masked and replaced by e[MASK] for the student network.”); and for a given pair of embeddings, the loss function measures a degree of similarity between the respective class embeddings and a degree of similarity between the respective patch embeddings (Zhou: Section: 2.2 Self Distillation: “The two distorted views are then put through a teacher-student framework to get the predictive categorical distributions from the [CLS] token: v[CLS]t = P[CLS]0(v) and u[CLS]s = P[CLS](u). The knowledge is distilled from teacher to student by minimizing their cross-entropy, formulated as 
    PNG
    media_image2.png
    56
    828
    media_image2.png
    Greyscale
”;
Section: 3.1 Framework: “the student network outputs for the masked view ^u projections of its patch tokens ^upatchs = Ppatch(^u) and the teacher network outputs for the non-masked view u projections of its patch tokens upatcht = Ppatch0 (u). We here define the training objective of MIM in iBOT as 
    PNG
    media_image3.png
    119
    935
    media_image3.png
    Greyscale
”; Wherein the cross-entropy loss calculations constitute a measure of similarity.).  
Regarding Claim 8, Wang in view of Zhou discloses: The system of claim 1, wherein: the one or more sequences of image frames is a plurality of image frames each corresponding to a respective scene depicted within a video sequence (Wang: Section: 4.1 Dataset: “We also construct the EasyCom-Clustering dataset as another dataset for evaluation. This dataset contains 22 sessions of egocentric recordings collected in a simulated restaurant setting, and each session lasts for around 30 minutes”; Wherein the recordings each correspond to a recording session, which constitutes a different scene.); the operations further comprise detecting cuts in the video sequence to generate the plurality of sequences of image frames (Wang: Section: 4.1 Dataset: “We also construct the EasyCom-Clustering dataset as another dataset for evaluation…EasyCom-Clustering has significantly different track length distributions when compared with BBT dataset, as shown in Fig. 3. The majority duration of BBT face tracks falls between 10 and 150, while most face tracks in EasyCom-Clustering last for less than 20 frames. The average duration of the face tracks in EasyCom-Clustering is 17, significantly less than the 59 of BBT. This is not surprising, considering that the gaze points from the egocentric view can change much more frequently than a third-person camera, resulting in shorter face tracks.”; Wherein the generation of face tracks based on gaze points constitutes the further detection of video sequence cuts.).  
Regarding Claim 9, Wang in view of Zhou discloses: The system of claim 1, wherein: the operations further comprise sampling a subset of the image frame crops of a given face track, wherein a temporal spacing between image frame crops in the sampled subset is greater than a temporal spacing between image frames in the respective sequence of image frame crops; wherein the fine-tuning selectively uses the image face crops of the sampled subset (Wang: Section 3.2 System Framework: “During the training stage (Fig. 2 Top), a certain temporal augmentation technique, denoted as τ, is applied on Ea to sample a clip Eaτ out of it, i.e. Ea τ = τ(Ea).”; Section: 4.3.1 Ablation studies on different components: “The next ablation is placed on the temporal augmentation techniques, and we explore two different ways of implementing τ, i.e. a). a uniform sampler that samples uniformly from a temporal interval”; Wherein the face image sampling via uniform distribution, constitutes a sampled subset with frames containing a larger temporal spacing than frames within the face track extracted from.).  
As per claim(s) 11, arguments made in rejecting claim(s) 1 are analogous.
As per claim(s) 12, arguments made in rejecting claim(s) 2 are analogous.
As per claim(s) 14, arguments made in rejecting claim(s) 4 are analogous.
As per claim(s) 15, arguments made in rejecting claim(s) 5 are analogous.
As per claim(s) 16, arguments made in rejecting claim(s) 1 are analogous. In addition, Section 4.2.4 of Wang discloses “We implement our method in the PyTorch framework [82], and all experiments are run on an Amazon AWS server with eight A100 GPUs, each training session deployed on a single GPU,” implying the usage of one or more non-transitory storage media storing machine-readable instructions, executed by a computer.
As per claim(s) 17, arguments made in rejecting claim(s) 2 are analogous.
As per claim(s) 19, arguments made in rejecting claim(s) 4 are analogous.
As per claim(s) 20, arguments made in rejecting claim(s) 5 are analogous.

Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Zhou, and further in view of Terhorst et al. (SER-FIQ: Unsupervised Estimation of Face Image Quality Based on Stochastic Embedding Robustness) hereinafter referenced as Terhorst.
Regarding claim 10, Wang in view of Zhou discloses: The system of claim 1.
Wang in view of Zhou does not disclose expressly: wherein the operations further comprise: determining respective crop quality scores for image frame crops of a given face track, wherein the crop quality score for a given image crop evaluates a consistency of embeddings generated by perturbed versions of the face identification model; and determining, from the respective crop quality scores, a track quality score for the given face track; and omitting the given face track from being used in the fine-tuning based at least in part on the track quality score.  
Terhorst discloses: determining image quality scores, wherein the image quality score for a given image evaluates a consistency of embeddings generated by perturbed versions of a face recognition model (Terhorst: Section: 3. Our Approach: “In this work, we based our face image quality definition on the relative robustness of deeply learned embeddings of that image. Calculating the variations of embeddings coming from random subnetworks of a face recognition model, our solution defines the magnitude of these variations as a robustness measure, and thus, image quality.”); and determining, from the respective image quality scores, a database quality score for the given database; and omitting the databases from being used in either training or testing based at least in part on the database quality score (Terhorst: Section: 4. Experimental setup: “To justify the choices of the used databases, Figure 3 shows the face quality distributions of the databases using quality estimates from four pretrained face quality assessment models. ColorFeret was captured under well-controlled conditions and generally shows very high qualities. However, it contains non-frontal head poses and for COTS and SER-FIQ (on FaceNet) (Figure 3a) this is considered as low image quality. Because of these controlled variations, we choose ColorFeret as the training database.”).  
Before the effective filing date of the claimed invention, it would have been obvious to a person of ordinary skill in the art to implement the known technique of perturbing a facial recognition model in order to assess image quality as taught by Terhorst to filter the low-quality face tracks disclosed by Wang in view of Zhou. The suggestion/motivation for doing so would have been “Face image quality is an important factor to enable high-performance face recognition systems. Face quality assessment aims at estimating the suitability of a face image for recognition.” (Terhorst: Abstract). Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Wang in view of Zhou with Terhorst to obtain the invention as specified in claim 10.


Allowable Subject Matter
Claims 3, 13, and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:
Wang in view of Zhou disclose: The system of claim 2, wherein the clustering of face tracks comprises: inputting the face tracks to be cluster into the fine-tuned model, each face track being processed into a face track embedding (Wang: Section: 3.2 System Framework: “During the evaluation stage, we discard the sampling step and make use of the whole track for predictions…The output embedding of the transformer encoder without the MLP head is fetched as the video-level representation, and then a HAC is applied to get the final clustering results”), then iteratively clustering the face track embeddings using Hierarchical Agglomerative Clustering, which performs clustering until a predefined distance threshold is reached (Wang: Section: 4.2.1 Evaluation Metrics: “In this work, we examine the quality of the video representation under two different clustering settings, i.e. the total cluster number is either 1). known, or 2). unknown. This is achieved by applying different stopping criteria to HAC...while for the case of unknown cluster numbers, a pre-defined distance threshold is required as the stopping criterion.). 
Wang in view of Zhou fail to disclose: wherein the matching comprises: estimating a probability density function for embeddings corresponding to image frame crops of the first face track; determining a probability density threshold based on values of the probability density function for embeddings corresponding to image frame crops of the first face track; determining a mean embedding for image frame crops of the second face track; and matching the first face track to the second face track based at least in part on a comparison between the probability density threshold and a value of the probability density function for the determined mean embedding.  
Therefore, claim 3 is indicated as containing allowable subject matter.
As per claims 13 and 18, reasons made in indicating that claim 3 is allowable are analogous.


Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANTHONY J RODRIGUEZ whose telephone number is (703)756-5821. The examiner can normally be reached Monday-Friday 10am-7pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ANTHONY J RODRIGUEZ/Examiner, Art Unit 2672



/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672
Read full office action
Prosecution Timeline

Sep 27, 2023
Application Filed
Sep 08, 2025
Non-Final Rejection mailed — §103
Feb 06, 2026
Response Filed
May 07, 2026
Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/972,931
Patent 12499701
DOCUMENT CLASSIFICATION METHOD AND DOCUMENT CLASSIFICATION DEVICE
3y 1m to grant Granted Dec 16, 2025
17/897,121
Patent 12488563
Hub Image Retrieval Method and Device
3y 3m to grant Granted Dec 02, 2025
17/847,222
Patent 12444019
IMAGE PROCESSING APPARATUS, IMAGE PROCESSING METHOD, AND MEDIUM
3y 3m to grant Granted Oct 14, 2025
Study what changed to get past this examiner. Based on 3 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
19%
Grant Probability
-4%
With Interview (-23.5%)
3y 0m (~4m remaining)
Median Time to Grant
Moderate
PTA Risk
Based on 21 resolved cases by this examiner. Grant probability derived from career allowance rate.