Office Action Analysis: 18622720 — SYSTEMS AND METHODS FOR SELF-SUPERVISED FACIAL LANDMARK DETECTION

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
1st Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1, 2, 8, and 9 are rejected under 35 U.S.C. 103 as obvious over US Patent Publication 2023 0410483 A1, (Chen et al.) in view of US Patent Publication 2025 0299506 A1, (Miri et al.).
Claim 1
[AltContent: textbox (Figure 2A shows the system layout of the self-supervised MIM.)]
    PNG
    media_image1.png
    456
    618
    media_image1.png
    Greyscale
 Regarding Claim 1, Chen et al. teach training a first network comprising a Masked Image Modeling (MIM) network ("masked image modeling (MIM) training process," par. 4) wherein the encoder components comprise a portion of the MIM network to encode features of the input image for decoding; ("Some extant MIM models employ an encoder-decoder design followed by a projection head. The encoder aids in the modeling of latent feature representations," par. 33) and training a second network comprising the encoder components, as trained, in series with the decoder components,  ("two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A) during the self-supervised MIM training process 200 for pre-training the image encoder 150 and the UPerNet decoder," par. 44) the decoder components trained to determine local correspondences comprising respective relationships between the features of the input image provided by the encoder components ("a corresponding encoded feature representation 225 (also referred to as an encoded hidden representation 225) and a decoder 250 decodes the corresponding encoded feature representation 225 to predict a corresponding predicted token 275 as output from the projection head," par. 53).
Chen et al. do not explicitly teach all of a network that processes non-overlapping patches determined from the input image with a SSL objective.
However, Miri et al. teach a network that processes non-overlapping patches determined from the input image ("image 305 may be passed into the patching and embedding block 230 that may convert the image 305 into non-overlapping equal sized 2D grid of patches," par. 51) with a SSL objective ("the key to success of an SSL model may lie in wisely making use of the information derived from the image," par. 39).
Therefore, taking the teachings of Chen et al. and Miri et al. as a whole, it would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the MIM system as taught by Chen et al. to use the image clustering and tokenization as taught by Miri et al. The suggestion/motivation for doing so would have been that, “ instead of a single class token multiple class tokens 330 may be used, which will be responsible for learning representations for different object classes. By doing so, the model can learn to attend to the regions of the image that belong to each class and generate class-discriminative object localization maps from the class-to-patch attentions. This technique can be useful for weakly supervised semantic segmentation, which is the task of assigning a class label to each pixel in an image using only image-level labels as supervision” as noted by the Miri et al. disclosure in paragraph [0052], which also motivates combination because the combination would predictably have a higher accuracy as there is a reasonable expectation that multiple object classes can be localized more accurately within an image by learning distinct class-to-patch attention maps, leading to enhanced segmentation performance; and/or because doing so merely combines prior art elements according to known methods to yield predictable results.
Claim 2
 Regarding claim 2, Chen et al. and Miri et al. teach the method of claim 1 as noted above.
Chen et al. teach wherein the MIM network comprises a MAE network ("the self-supervised MIM training process 200 pre-trains an image encoder 150 having either a masked autoencoder (MAE) architecture," par. 50).
Chen et al. and Miri et al. are combined as per claim 1.
Claim 6
 Regarding claim 6, Chen et al. and Miri et al. teach the method of claim 2 as noted above.
Chen et al. teach the MAE network ("the self-supervised MIM training process 200 pre-trains an image encoder 150 having either a masked autoencoder (MAE) architecture," par. 50).
Chen et al. do not explicitly teach all of a network is configured as a vision transformer (ViT) using self-attention mechanisms to process images.
However, Miri et al. teach a network is configured as a vision transformer (ViT) using self-attention mechanisms to process images ("A vision transformer (ViT) may be a type of neural network based on transformer architecture. The augmented embedded vectors from patching and embedding 230 are fed into the ViTs that may further include transformer encoders (Ees) 235a and a multi-layer perceptron (MLP) 235c in student network 210, a transformer encoder 237a (Ee,) and an MLP 237c in teacher network 215 with learnable weights θ. These transformer encoders may represent a stack of multiple self-attention layers," par. 45).
Chen et al. and Miri et al. are combined as per claim 1.
Claim 8
 Regarding claim 8, Chen et al. and Miri et al. teach the method of claim 1 as noted above.
Chen et al. teach wherein the second network comprises a projector network, the decoder components comprising a portion of the projector network ("decoder 152 may include a UPerNet to perform image segmentation tasks based on the encoded features 225 output from the image encoder 150. That is, a two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A)," par. 44).
Chen et al. and Miri et al. are combined as per claim 1.
Claim 9
 Regarding claim 9, Chen et al. and Miri et al. teach the method of claim 8 as noted above.
Chen et al. teach wherein training the decoder components trains the projector network using a locality constrained repellence (LCR) loss ("The training loss may be based on a distance in a voxel space between the recovered/estimated raw voxel values 270 and the original voxels from the corresponding sets of raw voxel values that represent the masked image patches. The training loss may include either an l.sub.1 or l.sub.2 loss function. Notably, the training loss may only be computed for the masked matches 210M to prevent the encoder 150 from engaging in self-reconstruction and potentially dominate the learning process and ultimately impeded knowledge learning. Thereafter, the training process 200 updates parameters of the image encoder 150 (and optionally the decoder 250) based on the training loss," par. 58).
Chen et al. and Miri et al. are combined as per claim 1.

2nd Claim Rejections - 35 USC § 103
Claims 3, 4, 5, 6, 7, 10, 12, 13, 14, 15, 16, 17, and 20 are rejected under 35 U.S.C. 103 as obvious over US Patent Publication 2023 0410483 A1, (Chen et al.) and US Patent Publication 2025 0299506 A1, (Miri et al.) in view of US Patent Publication 2022 0075994 A1, (Shapira et al.).
Claim 3
 Regarding claim 3, Chen et al. and Miri et al. teach the method of claim 2 as noted above.
Chen et al. teach wherein the MIM network, as trained, is configured to ("masked image modeling (MIM) training process," par. 4).
Miri et al. teach provide respective tokens for the patches for processing by the decoder components; ("Patching and embedding block 230 may convert an image into equal sized patch tokens and perform a set of operations to acquire the corresponding embedding vectors for each patch," par. 43) and combine information from tokens of the input image to define approximated tokens, reducing the number of tokens for processing by the decoder components ("The position embeddings 325 are added to the patch embeddings 340 to retain the spatial arrangement of the patches," par. 51).
Chen et al. and Miri et al. do not explicitly teach all of related to non-landmark regions of the input image.
However, Shapira et al. teach related to non-landmark regions of the input image ("neural network 140 predicts a set of facial landmarks from the image patches," par. 40 wherein the areas where no landmarks exist could be considered non-landmark regions).
Therefore, taking the teachings of Chen et al., Miri et al., and Shapira et al. as a whole, it would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the MIM system as taught by Chen et al. and the image clustering and tokenization as taught by Miri et al. to use facial landmark estimation techniques as taught by Shapira et al. The suggestion/motivation for doing so would have been that, “Facial landmark detection is a preprocessing step for many applications such as face recognition, face beautification, facial expression detection, and avatar rendering, among other examples” as noted by the Shapira et al. disclosure in paragraph [0018], which also motivates combination because the combination would predictably have an additional utility as there is a reasonable expectation that incorporating precise geometric facial landmarks into the MIM system's tokenization process would provide it with the ability to estimate facial landmarks; and/or because doing so merely combines prior art elements according to known methods to yield predictable results.
Claim 4
 Regarding claim 4, Chen et al., Miri et al., and Shapira et al. teach the method of claim 3 as noted above.
Chen et al. teach wherein the MIM network, as trained, is configured to ("masked image modeling (MIM) training process," par. 4).
Chen et al. do not explicitly teach all of define respective patch tokens for each patch and a class (CLS) token representing the image as an aggregation of information from the respective patch tokens; identify each patch token as an attentive token or an inattentive token in accordance with a respective similarity to the CLS token determined for each patch token; and combine information from inattentive tokens to provide approximated inattentive tokens, reducing the number of inattentive tokens for processing by the decoder components.
However, Miri et al. teach define respective patch tokens for each patch and a class (CLS) token representing the image as an aggregation of information from the respective patch tokens; ("the input image is first split into non-overlapping patches, which are then transformed into a sequence of patch tokens 340 along with positional embeddings 325. These class tokens are concatenated with patch tokens 340, embedding position information 325, to form the input tokens," par. 54) identify each patch token as an attentive token or an inattentive token in accordance with a respective similarity to the CLS token determined for each patch token; and ("semantic clustering 520 may aim to reduce the computational complexity of self-attention in vision transformers. It may work by grouping the visual tokens that have similar semantic information into clusters, and then aggregating the key and value tokens within each cluster," par. 60) combine information from inattentive tokens to provide approximated inattentive tokens, reducing the number of inattentive tokens for processing by the decoder components ("The semantic image layout can be discovered from the attention maps of the class tokens. These attention maps may lead to promising results in unsupervised segmentation tasks. In some embodiments, unlike regular transformers, multiple class tokens 330 are used. Using a single class token may be challenging for accurate localization of different objects on a single image. Therefore, instead of a single class token multiple class tokens 330 may be used, which will be responsible for learning representations for different object classes. By doing so, the model can learn to attend to the regions of the image that belong to each class and generate class-discriminative object localization maps from the class-to-patch attentions," par. 52).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 5
 Regarding claim 5, Chen et al., Miri et al., and Shapira et al. teach the method of claim 4 as noted above.
Chen et al. teach wherein the MIM network, as trained, is configured to ("masked image modeling (MIM) training process," par. 4).
Chen et al. do not explicitly teach all of perform inattentive token clustering to combine the information from the inattentive tokens, defining cluster centers to represent the information.
However, Miri et al. teach perform inattentive token clustering to combine the information from the inattentive tokens, ("The semantic clustering module 520 may take these patches as input and perform clustering. The output of the semantic clustering module 520 is a sequence of clustered tokens," par. 59) defining cluster centers to represent the information ("semantic clustering block 520 may apply a clustering algorithm (e.g., K-means, hierarchical clustering or DBSCAN etc.) to group the pixels in the feature map into different clusters based on their similarity. Each cluster may represent a potential object category in the image," par. 60).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 7
 Regarding claim 7, Chen et al. and Miri et al. teach the method of claim 1 as noted above.
Chen et al. teach the decoder components in series with the encoder components as trained ("two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A) during the self-supervised MIM training process 200 for pre-training the image encoder 150 and the UPerNet decoder," par. 44).
Chen et al. and Miri et al. do not explicitly teach all of training a final network for landmark detection, the final network comprising regressor components configured to determine the landmark estimations features processed by the decoder components as trained, the regressor components configured in series with the decoder components as trained.

    PNG
    media_image2.png
    532
    384
    media_image2.png
    Greyscale
[AltContent: textbox (Figures 2 and 3 show the transfer of data from the decoder in Figure 2 to an example landmark regression in Figure 3.)]
    PNG
    media_image3.png
    340
    468
    media_image3.png
    Greyscale
However, Shapira et al. teach training a final network for landmark detection, the final network comprising regressor components configured to determine the landmark estimations features processed by the decoder components as trained, the regressor components configured in series with the decoder components as trained.
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 10
 Regarding claim 10, Chen et al. and Miri et al. teach the method of claim 9 as noted above.
Chen et al. teach the LCR ("The training loss may be based on a distance in a voxel space between the recovered/estimated raw voxel values 270 and the original voxels from the corresponding sets of raw voxel values that represent the masked image patches. The training loss may include either an l.sub.1 or l.sub.2 loss function. Notably, the training loss may only be computed for the masked matches 210M to prevent the encoder 150 from engaging in self-reconstruction and potentially dominate the learning process and ultimately impeded knowledge learning. Thereafter, the training process 200 updates parameters of the image encoder 150 (and optionally the decoder 250) based on the training loss," par. 58).
Chen et al. and Miri et al. do not explicitly teach all of operates on features of landmark regions and combined information from non-landmark regions that reduces processing to achieve selective correspondence processing for the local correspondences.
However, Shapira et al. teach operates on features of landmark regions and combined information from non-landmark regions that reduces processing to achieve selective correspondence processing for the local correspondences ("The loss function is a weighted average of the normalized Euclidean distance between the regressed landmarks and the ground truth and the absolute difference between the predicted error and the actual error," par. 79).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 12
 Regarding claim 12, Chen et al. teach a system comprising at least one processor, a non-transient storage device coupled to the at least one processor, the storage device storing instructions executable by the at least one processor to cause the system to: ("computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 1520, the storage device 1530, or memory on processor," par. 76) wherein the network comprises: encoder components configured for encoding features, the encoder components comprising trained components of a Masked Image Modeling (MIM) network, ("Some extant MIM models employ an encoder-decoder design followed by a projection head. The encoder aids in the modeling of latent feature representations," par. 33),  the MIM network; ("masked image modeling (MIM) training process," par. 4) and the decoder components trained to determine local correspondences comprising respective relationships between the features of the input image ("a corresponding encoded feature representation 225 (also referred to as an encoded hidden representation 225) and a decoder 250 decodes the corresponding encoded feature representation 225 to predict a corresponding predicted token 275 as output from the projection head," par. 53).
Miri et al. teach processes non-overlapping patches determined from the input image ("image 305 may be passed into the patching and embedding block 230 that may convert the image 305 into non-overlapping equal sized 2D grid of patches," par. 51) and trained with a SSL objective ("the key to success of an SSL model may lie in wisely making use of the information derived from the image," par. 39).
Chen et al. and Miri et al. do not explicitly teach all of to provide a network for facial landmark detection for faces in input images; and process, using the network, an input image comprising a face to determine and provide facial landmarks therefor; features of the face, and the decoder components configured for determining local correspondences between the features for determining estimates for the facial landmarks.
However, Shapira et al. teach to provide a network for facial landmark detection for faces ("the image comprising at least a portion of a face," par. 5) in input images; ("A neural network 200 may generate (e.g., output) facial landmark information based on input image," par. 46) and process, using the network, an input image comprising a face to determine and provide facial landmarks therefor; ("the image comprising at least a portion of a face, a neural network configured to detect a plurality of facial landmarks," par. 6) features of the face, ("facial landmark detection networks identify points (i.e., landmarks) on a face corresponding specific characteristics (e.g., or facial features) such as the tip of the nose or around the eyes and mouth," par. 18) and the decoder components configured for determining local correspondences between the features for determining estimates for the facial landmark ("decoder 220 decodes the aggregated patch features to determine (e.g., generate) facial landmark information (e.g., such as facial landmark position coordinates). The facial landmark information (e.g., the facial landmark position coordinates) may be used for various image processing operations (e.g., as further described herein, for example, with reference to operation 320 of FIG. 3). For instance, facial landmark information may include absolute or relative coordinates," par. 52).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 13
 Regarding claim 13, Chen et al., Miri et al., and Shapira et al. teach the method of claim 12 as noted above.
Chen et al. teach wherein the MIM network comprises a MAE network ("the self-supervised MIM training process 200 pre-trains an image encoder 150 having either a masked autoencoder (MAE) architecture," par. 50).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 14
 Regarding claim 14, Chen et al., Miri et al., and Shapira et al. teach the method of claim 13 as noted above.
Chen et al. teach the MAE network ("the self-supervised MIM training process 200 pre-trains an image encoder 150 having either a masked autoencoder (MAE) architecture," par. 50).
Chen et al. and Shapira et al. do not explicitly teach all of wherein network is configured as a vision transformer (ViT) using self-attention mechanisms to process images.
However, Miri et al. teach wherein the MAE network is configured as a vision transformer (ViT) using self-attention mechanisms to process images ("A vision transformer (ViT) may be a type of neural network based on transformer architecture. The augmented embedded vectors from patching and embedding 230 are fed into the ViTs that may further include transformer encoders (Ees) 235a and a multi-layer perceptron (MLP) 235c in student network 210, a transformer encoder 237a (Ee,) and an MLP 237c in teacher network 215 with learnable weights θ. These transformer encoders may represent a stack of multiple self-attention layers," par. 45).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 15
 Regarding claim 15, Chen et al., Miri et al., and Shapira et al. teach the method of claim 12 as noted above.
Chen et al. teach the decoder components in series with the encoder components as trained ("two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A) during the self-supervised MIM training process 200 for pre-training the image encoder 150 and the UPerNet decoder," par. 44).
Chen et al. and Miri et al. do not explicitly teach all of wherein the network comprises regressor components configured to determine the landmark estimations from features processed by the decoder components as trained, the regressor components configured in series with the decoder components as trained.
However, Shapira et al. teach wherein the network comprises regressor components configured to determine the landmark estimations from features processed by the decoder components as trained, the regressor components configured in series with the decoder components as trained ("decoder 220 decodes the aggregated patch features to determine (e.g., generate) facial landmark information (e.g., such as facial landmark position coordinates). The facial landmark information (e.g., the facial landmark position coordinates) may be used for various image processing operations (e.g., as further described herein, for example, with reference to operation 320 of FIG. 3) …  FIG. 3 shows an example of a process for real-time facial landmark regression," par. 52-53).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 16
 Regarding claim 16, Chen et al., Miri et al., and Shapira et al. teach the method of claim 12 as noted above.
Chen et al. teach wherein the decoder components are trained as components of a projector network, ("decoder 152 may include a UPerNet to perform image segmentation tasks based on the encoded features 225 output from the image encoder 150. That is, a two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A)," par. 44) the decoder components in series with the encoder components as trained ("two-layer convolutional transpose can be used as a projection head 260 (FIG. 2A) during the self-supervised MIM training process 200 for pre-training the image encoder 150 and the UPerNet decoder," par. 44).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 17
 Regarding claim 17, Chen et al. and Miri et al. teach the method of claim 12 as noted above.
Chen et al. and Miri et al. do not explicitly teach all of wherein the instructions are executable to further cause the system to apply an effect to the input image using the facial landmarks.
However, Shapira et al. teach wherein the instructions are executable to further cause the system to apply an effect to the input image using the facial landmarks ("the system may then process the image based on the detected set of facial landmarks in a variety of ways (e.g., coordinates of facial landmarks may be used for image processing applications such as facial recognition, avatar rendering, facial expression detection, etc.)," par. 58).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.
Claim 20
 Regarding claim 20, Chen et al. and Miri et al. teach the method of claim 12 as noted above.
Chen et al. and Miri et al. do not explicitly teach all of wherein the network is a component of or communicates with an application and the facial landmarks are provided for further use by the application, wherein the application comprises any of a VTO application; a teleconsultation application, a video chat application, a video conference application, or a facial recognition application.
However, Shapira et al. teach wherein the network is a component of or communicates with an application and the facial landmarks are provided for further use by the application, wherein the application comprises any of a VTO application; a teleconsultation application, a video chat application, a video conference application, or a facial recognition application ("the system may then process the image based on the detected set of facial landmarks in a variety of ways (e.g., coordinates of facial landmarks may be used for image processing applications such as facial recognition, avatar rendering, facial expression detection, etc.)," par. 58).
Chen et al., Miri et al., and Shapira et al. are combined as per claim 3.

3rd Claim Rejections - 35 USC § 103
Claims 18 and 19 are rejected under 35 U.S.C. 103 as obvious over US Patent Publication 2023 0410483 A1, (Chen et al.) and US Patent Publication 2025 0299506 A1, (Miri et al.) in view of US Patent Publication 2022 0075994 A1, (Shapira et al.) and US Patent Publication 2019 0289986 A1, (Fu et al.).
Claim 18
 Regarding claim 18, Chen et al., Miri et al., and Shapira et al. teach the method of claim 17 as noted above.
Chen et al., Miri et al., and Shapira et al. do not explicitly teach all of wherein the effect simulates a product or service applied to the face to provide a virtual try on experience.
However, Fu et al. teach wherein the effect simulates a product or service applied to the face to provide a virtual try on experience ("virtual makeup try-on," par. 33).
Therefore, taking the teachings of Chen et al., Miri et al., Shapira et al., and Fu et al. as a whole, it would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the MIM system as taught by Chen et al., the image clustering and tokenization as taught by Miri et al., and facial landmark estimation techniques as taught by Shapira et al. to use a virtual makeup try-on application as taught by Fu et al. The suggestion/motivation for doing so would have been that, “The more accurate and realistic the end results achieved by such a system, the more useful they are to be viable alternatives for consumers. Further, while facial landmarks detection presents many potential attractive applications in augmented reality, virtual reality, human-computer interaction, and so on, and there are now applications that let people wear virtual make-up and recognize the faces using certain end points as facial landmarks, there are still issues with such developing technology from an accuracy standpoint” as noted by the Fu et al. disclosure in paragraph [0009], which also motivates combination because the combination would predictably have additional utility as there is a reasonable expectation that a virtual makeup try-on experience using facial landmark input would make a very useful application for consumers; and/or because doing so merely combines prior art elements according to known methods to yield predictable results.
Claim 19
 Regarding claim 19, Chen et al., Miri et al., and Shapira et al. teach the method of claim 18 as noted above.
Chen et al., Miri et al., and Shapira et al. do not explicitly teach all of wherein the product comprises a makeup product or an appliance product; and the service comprises a cosmetic procedure or a surgical procedure or other face altering procedure.
However, Fu et al. teach wherein the product comprises a makeup product or an appliance product; and the service comprises a cosmetic procedure or a surgical procedure or other face altering procedure ("virtual makeup try-on," par. 33).
Chen et al., Miri et al., Shapira et al., and Fu et al. are combined as per claim 18.
Allowable Subject Matter
Claim 10 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Reference Cited
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
US Patent Publication 2024 0096072 A1 to He et al. discloses a masked autoencoder system wherein the method consists of dividing the input image into a set a patches, selecting a first subset of the patches to be visible and a second subset of the patches to be masked during the pre-training, processing, using the encoder, the first subset of patches to generate corresponding first latent representations, processing, using the decoder, the first latent representations corresponding to the first subset of patches.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KARSTEN F LANTZ whose telephone number is (571) 272-4564. The examiner can normally be reached Monday-Friday 8:00-4:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ms. Jennifer Mehmood can be reached on 571-272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Karsten F. Lantz/Examiner, Art Unit 2664


Date: 3/18/2026
/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2664
Read full office action
SYSTEMS AND METHODS FOR SELF-SUPERVISED FACIAL LANDMARK DETECTION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SYSTEMS AND METHODS FOR SELF-SUPERVISED FACIAL LANDMARK DETECTION

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email