Last updated: May 29, 2026
Application No. 18/298,128
ESTIMATION MODEL FOR INTERACTION DETECTION BY A DEVICE

Non-Final OA §103
Filed
Apr 10, 2023
Priority
May 03, 2022 — provisional 63/337,918
Examiner
KOETH, MICHELLE M
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Samsung Electronics Co., Ltd.
OA Round
3 (Non-Final)
Interview Optional

— +16.3% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 77% grant rate with +16.3% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.
Based on 434 resolved cases, 2023–2026
Examiner Intelligence

KOETH, MICHELLE M View full profile →
Grants 77% — above average
Career Allowance Rate
335 granted / 434 resolved
+15.2% vs TC avg
Strong +16% interview lift
Without
With
+16.3%
Interview Lift
resolved cases with interview
Fast prosecutor
2y 2m
Avg Prosecution
22 currently pending
Career history
465
Total Applications
across all art units
Statute-Specific Performance

§101
1.5%
-38.5% vs TC avg
§103
91.2%
+51.2% vs TC avg
§102
1.2%
-38.8% vs TC avg
§112
4.8%
-35.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 434 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's Request for Continuation (herein “RCE”) filed on March 30, 2026 has been entered.

Response to Arguments
Applicant's arguments and amendments filed in the Amendment with RCE filed March 30, 2026 (herein “Amendment”) regarding the rejection of claims 1–6 and 8 under 35 U.S.C. 103  have been fully considered but they are not persuasive. Specifically, Applicant amended claim 1 to now additionally recite “and calculating the output token from the inputs,” and argues that Fu does not teach the weighted input tokens as inputs to the encoder, and that secondary reference Lin does not teach differently weighted tokens or calculating an output token from differently weighted tokens. Applicant then argues impermissible hindsight analysis was employed in the combination of Fu and Lin for the rejection of claim 1.
First, in addressing the combined teachings of Fu and Lin, it is noted that the rejection set forth in the Final Action issued December 21, 2025 (herein “Final rejection”) on page 5, originally relied upon Lin on page 12921, fig. 3, section 3.1, for teaching the claimed “receiving weighted input tokens as inputs to the first encoder layer,” yet Applicant has not traversed these cited passages in Lin relied upon for rejecting the limitation at issue. Upon further review and consideration, the Examiner finds that Lin’s teachings of input tokens weighted by Q query values, K keys values and V vertex values into a multi-layer graphormer encoder that outputs “Coarse Mesh Output Tokens” as shown in fig. 3 replicated below for convenience, do teach the claimed “receiving the first-weighted input token and the second-weighted input token as inputs to the firs encoder layer and calculating the output token from the inputs,” as claimed.

    PNG
    media_image1.png
    461
    814
    media_image1.png
    Greyscale

Second, Applicant’s arguments regarding the combination of Fu and Lin employing impermissible hindsight analysis, while having been fully considered, are not persuasive. Impermissible hindsight analysis would be found when the motivation to combine two references comes from the application under examination itself. In this case, the motivation to combine Fu and Lin comes directly from Lin. In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
Further, Applicant argues that a combination of Fu with Lin would “do away with major features of the point-cloud processing of Fu,” yet, the combination of record is simply modifying Fu to the extent that Fu does not explicitly teach receiving weighted tokens as inputs to its encoder and calculating output tokens from the inputs. That is, the combination of Fu and Lin simply weights input tokens to the encoder. To the extent the motivational statement in Lin references “existing graph” convolution networks, this does not preclude the same benefit from applying to other, future, not yet discovered convolution networks, such as a hypothetical PHOSITA in reviewing the prior art before the effective filing date of the claimed invention would be doing in finding a motivation from having both the teachings of Fu and Lin before them in considering obviousness. In any event, for clarity of the record, the motivational statement has been updated and clarified to strengthen the motivations to combine.
Applicant's arguments and amendments filed in the Amendment regarding the rejection of claims 9 and 17, and claims depending therefrom under 35 U.S.C. 103  have been fully considered and are persuasive. Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground of rejection is made in view of Zhai et al., "PGMANet:Pose-Guided Mixed Attention Network for Occluded Person Re-Identification," 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-8, doi: 10.1109/IJCNN52387.2021.9534442.
The following remarks are made for clarity of the record regarding the previously cited Chen reference. Applicant argues on pages 14–15 of the Amendment in response to the rejection against claims 9 and 17, and claims depending therefrom, that the cited Chen reference does not teach or suggest the newly amended “updating, by the estimation-model encoder, the attention mask by removing the occluded portion of the object based on an unoccluded portion of the object,” because Chen in previously cited sections 3.0 and 3.1 discusses augmenting training data with occlusions. However, Chen further discloses in section 3.2, that an attention guided-mask module generates spatial weight maps for input features (attention mask), and that Chen’s attention-guided mask module seeks to capture human features as complete as possible, even when given occluded data, such that the attention mask is learned to focus on the same area whether it is occluded, or in a holistic image. However, upon further search and consideration, Chen is not the closest/best prior art that teaches the new limitation, and therefore a new ground of rejection in view of Zhai is set forth herein.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1–3, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Fu et al., “POS-BERT: Point Cloud One-Stage BERT Pre-Training,” arXiv:2204.00989v1 [cs.CV], https://doi.org/10.48550/arXiv.2204.00989 (herein “Fu”) in view of Lin et al., "Mesh Graphormer," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 12919-12928, doi: 10.1109/ICCV48922.2021.01270 (herein “Lin”).
Regarding claim 1, with deficiencies of Fu noted in square brackets [], Fu teaches a method of estimating an [interaction with a device], the method comprising (Fu page 1, Introduction, learning point cloud representation promotes applications such as augmented reality): 
configuring a first token and a second token of an estimation model according to one or more first features of a 3-dimensional (3D) object (Fu pages 4–5, fig. 2, section 3, POS-BERT model (estimation model) taking as input a raw point cloud, corresponding to a 3D data representation (see page 1, Introduction), is split into two point clouds, Pg and Pl, which respectively are processed by a PGE module and thereafter with a transformer based encoder to embed (configure) patches from the Pg, and patches from the Pl into respective patch tokens (first and second tokens)); 
applying a first weight to the first token to produce a first-weighted input token and applying a second weight that is different from the first weight to the second token to produce a second-weighted input token (Fu page 6, section 3.3, each encoder respective to the first and second patch tokens has a weight that it applies in the encoding according to equation (1), where θm is the weight of the momentum encoder, and θe is the weight of the encoder, thus different weights); 
generating, by a first encoder layer of an estimation-model encoder of the estimation model, an output token based on [receiving] the first-weighted input token and the second-weighted input token [as inputs to the first encoder layer and calculating the output token from the inputs] (Fu page 6, section 3.4, the Momentum encoder and the Encoder respectively output values                         
                            
                                
                                    O
                                
                                
                                    m
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    O
                                
                                
                                    e
                                
                                
                                    i
                                
                            
                        
                     from the weighted first and second input patch tokens, where page 4, section 3 teaches that a modeling loss is based on the Encoder outputs’ patch tokens and the Momentum Encoder outputs’ patch tokens, thus the output of the Momentum encoder and Encoder being tokens);
[generating, by the estimation model, an estimated output based on the output token]; and
performing an operation based on the estimated output (Fu pages 9–10, section 5.2, various “downstream tasks” which as tasks (operations) performed based on the estimated output, including 3D object classification).
While Fu teaches that its point-cloud pre-training method is useful for augmented reality applications, and therefore would be a method for augmented reality which would be obvious to a person having ordinary skill in the art (herein “PHOSITA”) to include interpreting user gestures as interactions with an augmented reality app or headset with predictable results, nonetheless, Fu does not explicitly teach/anticipate that the augmented reality includes interactions with an augmented reality device.
Lin teaches interaction with a device (Lin page 12919, Introduction, 3D human pose and mesh reconstruction for applying to human-computer interactions).
Lin further teaches receiving weighted input tokens as inputs to the first encoder layer and calculating the output token from the inputs (Lin page 12921, fig. 3, section 3.1, input tokens into the multi-layer graphormer encoder as a whole, thus the first layer of that encoder, where input tokens weighted by Q query values, K keys values and V vertex values into a multi-layer graphormer encoder that outputs “Coarse Mesh Output Tokens” from these input Q, K, V weighted tokens).
Lin still further teaches generating, by the estimation model, an estimated output based on the output token (Lin pages 12921–12922, fig. 3, coarse mesh output tokens output from the multi-layer graphormer encoder are then upsampled using MLP to produce the estimate output full mesh).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mesh producing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5).
Regarding claim 2, Fu teaches further comprising: receiving, at a backbone of the estimation model, input data corresponding to the interaction with the device (Fu page 5, standard transformer used as encoder backbone, shown in fig. 2 as receiving local point cloud set data, where page 1, section 1 teaches point clouds as being 3D data representation for example with augmented reality and thus would be data corresponding to interaction with an augmented reality device). Fu does not teach the remainder of the limitations of claim 2.
Lin teaches extracting, by the backbone, the one or more first features from the input data (Lin pages 12921–12922, fig. 3, image of size 224x224 used as input and image grid features are extracted from the last convolution block in the CNN (convolutional neural network which is pre-trained in other task of feature extraction)); 
receiving, at a two-dimensional (2D) feature extraction model, the one or more first features from the backbone (Lin pages 12921–12922, fig. 3, a pooling and MLP operation received as input the grid features shown to be 2D); 
extracting, by the 2D feature extraction model, one or more second features associated with the one or more first features, the one or more second features comprising one or more 2D features (Lin pages 12921–12922, fig. 3, the Pooling + MLP outputs a global feature vector from the grid features); 
receiving, at the estimation-model encoder, data generated based on the one or more 2D features (Lin pages 12921–12922, fig. 3, multi-layer graphormer encoder receiving the grid features and the global feature vector); and
generating, by the estimation model, the estimated output based on the data generated based on the one or more 2D features (Lin pages 12921–12922, fig. 3, coarse mesh output tokens output from the multi-layer graphormer encoder are then upsampled using MLP to produce the estimate output full mesh).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mesh producing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5).
Regarding claims 3, with claim 3 as exemplary, Fu does not explicitly teach, but Lin teaches wherein the data generated based on the one or more 2D features comprises an attention mask (Lin page 12921, fig. 3, input tokens generated by multi-head self-attention include masked tokens denoted in fig. 3 as [MASK]).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mask from the multi-head self attention disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5).
Regarding claim 8, Fu does not explicitly teach the limitations of claim 8. Lin teaches generating a 3D scene including a visual representation of the 3D object; and updating the visual representation of the 3D object based on the output token (Lin fig. 3, pages 12921–12922, a coarse mesh is generated, including a visual representation of the human subject shown in the input image, and this coarse mesh is refined into an output mesh, also a representation of the 3D object human subject based on the full mesh output tokens). 
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the coarse and full mesh processing disclosed in Lin at least because doing so would provide for combining self attentions and graph convolutions in a transformer for human mesh reconstruction that outperforms existing graph convolution networks (See Lin page 12919, fig. 1 caption).
Claims 9–10, and 16–18 are rejected under 35 U.S.C. 103 as being unpatentable over Fu in view of Lin, further in view of Zhai et al., "PGMANet:Pose-Guided Mixed Attention Network for Occluded Person Re-Identification," 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-8, doi: 10.1109/IJCNN52387.2021.9534442 (herein “Zhai”).
Regarding claim 9¸ with deficiencies of Fu noted in square brackets [], Fu teaches a method of estimating [an interaction with a device], the method comprising (Fu page 1, Introduction, learning point cloud representation promotes applications such as augmented reality): 
receiving, [at a two-dimensional (2D) feature extraction model of an estimation model,] one or more first features corresponding to input data associated with an interaction with the device (Fu pages 1 and 5, transformer’s input tokens To pass through the h-layer to produce the features of each patch, the tokens corresponding to input point cloud data used in augmented reality device); 
[extracting, by the 2D feature extraction model, one or more second features associated with the one or more first features, the one or more second features comprising one or more 2D features;] 
generating, [by the 2D feature extraction model,] data based on the one or more [2D] features [the data generated based on the one or more 2D features comprising an attention mask indicating an occluded portion of an object associated with the interaction;](Fu page 5, the feature of each patch is obtained with the global receptive field); 
providing the data to an estimation-model encoder of the estimation model (Fu page 5, the features of each patch with the global receptive field are mapped to the loss space with the projector (thus provided to the projector) that is part of the Encoder);
[updating, by the estimation-model encoder, the attention mask by removing the occluded portion of the object based on an unoccluded portion of the object;]
[generating, by the estimation model, an estimated output based on the data generated based on the one or more 2D features]; and
performing an operation based on the estimated output (Fu pages 9–10, section 5.2, various “downstream tasks” which as tasks (operations) performed based on the estimated output, including 3D object classification).
While Fu teaches that its point-cloud pre-training method is useful for augmented reality applications, and therefore would be a method for augmented reality which would be obvious to a PHOSITA to include interpreting user gestures as interactions with an augmented reality app or headset with predictable results, nonetheless, Fu does not explicitly teach/anticipate that the augmented reality includes interactions with an augmented reality device. 
Lin teaches interaction with a device (Lin page 12919, Introduction, 3D human pose and mesh reconstruction for applying to human-computer interactions).
Fu does not explicitly teach, but Lin further teaches: at a two-dimensional (2D) feature extraction model of an estimation model (Lin pages 12921–12922, fig. 3, a pooling and MLP operation received as input the grid features shown to be 2D),
extracting, by the 2D feature extraction model, one or more second features associated with the one or more first features, the one or more second features comprising one or more 2D features (Lin pages 12921–12922, fig. 3, the Pooling + MLP outputs a global feature vector from the grid features);
by the 2D feature extraction model, 2D features (Lin pages 12921–12922, fig. 3, multi-layer graphormer encoder receiving the grid features and the global feature vector);
generating, by the estimation model, an estimated output based on the data generated based on the one or more 2D features (Lin pages 12921–12922, fig. 3, coarse mesh output tokens output from the multi-layer graphormer encoder are then upsampled using MLP to produce the estimate output full mesh).
Fu further does not explicitly teach, but Zhai teaches the data generated based on the one or more 2D features comprising an attention mask indicating an occluded portion of an object associated with the interaction (Zhai pages 3-4, fig. 2, human part attention model which outputs an attention mask that distinguishes the pedestrian in a 2D image, with 2D features, from the background and occlusions, where the distinguishing aspect would thus indicate the occluded part of a pedestrian object associated with various interactions with their environment).
Fu still further does not explicitly teach where Zhai teaches updating, by the estimation-model encoder, the attention mask by removing the occluded portion of the object based on an unoccluded portion of the object (Zhai page 4, fig. 2, as shown, the human part attention model attention mask is updated by the second order information attention model in re-recognition of occluded pedestrians by ignoring/suppressing (removing) the occluded areas of the image (the occluded portion of the pedestrian (object)) while concentrating on the region of the pedestrian body (unoccluded portion of the pedestrian (object))).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mesh producing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5.
Further, taking the teachings of Fu as modified by Lin and Zhai together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the mixed attention network for occluded persons operations disclosed in Zhai at least because doing so would increase performance of person re-identification, especially with challenging occlusion images by enriching part-level features and the correlation among them (see Zhai Abstract, and page 2 section II(C), and section III first paragraph).
Regarding claims 10 and 18, with claim 10 as exemplary, Fu teaches further comprising: receiving, at a backbone of the estimation model, the input data (Fu page 5, standard transformer used as encoder backbone, shown in fig. 2 as receiving local point cloud set data, where page 1, section 1 teaches point clouds as being 3D data representation for example with augmented reality);
associating a first token and a second token of the estimation model with the one or more first features (Fu page 5, series of patch tokens are concatenated and then passed through the h-layer to get a (associating) feature for each patch); 
applying a first weight to the first token to produce a first-weighted input token and applying a second weight that is different from the first weight to the second token to produce a second-weighted input token (Fu page 6, section 3.3, each encoder respective to the first and second patch tokens has a weight that it applies in the encoding according to equation (1), where θm is the weight of the momentum encoder, and θe is the weight of the encoder, thus different weights); 
calculating, by a first encoder layer of the estimation-model encoder, an output token based on receiving the first-weighted input token and the second-weighted input token as inputs (Fu page 6, section 3.4, the Momentum encoder and the Encoder respectively output values                         
                            
                                
                                    O
                                
                                
                                    m
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    O
                                
                                
                                    e
                                
                                
                                    i
                                
                            
                        
                     from the weighted first and second input patch tokens, where page 4, section 3 teaches that a modeling loss is based on the Encoder outputs’ patch tokens and the Momentum Encoder outputs’ patch tokens, thus the output of the Momentum encoder and Encoder being tokens). 
Fu does not explicitly teach, but Lin teaches generating, by the backbone, the one or more first features based on the input data (Lin pages 12921–12922, fig. 3, image of size 224x224 used as input and image grid features are extracted (generating) from the last convolution block in the CNN (convolutional neural network which is pre-trained in other task of feature extraction)); generating, by the estimation model, the estimated output based on the output token (Lin pages 12921–12922, fig. 3, coarse mesh output tokens output from the multi-layer graphormer encoder are then upsampled using MLP to produce the estimate output full mesh).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mesh producing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5.
Regarding claim 16, Fu teaches calculating, by a first encoder layer of the estimation-model encoder, an output token (Fu page 6, section 3.4, the Momentum encoder and the Encoder respectively output values                                 
                                    
                                        
                                            O
                                        
                                        
                                            m
                                        
                                        
                                            i
                                        
                                    
                                
                             and                                 
                                    
                                        
                                            O
                                        
                                        
                                            e
                                        
                                        
                                            i
                                        
                                    
                                
                             from the weighted first and second input patch tokens, where page 4, section 3 teaches that a modeling loss is based on the Encoder outputs’ patch tokens and the Momentum Encoder outputs’ patch tokens, thus the output of the Momentum encoder and Encoder being tokens).
Fu does not explicitly teach but Lin teaches generating a 3D scene including a visual representation of the interaction with the device; and updating the visual representation of the interaction with the device based on the output token (Lin fig. 3, pages 12921–12922, a coarse mesh is generated, including a visual representation of the human subject shown in the input image, and this coarse mesh is refined into an output mesh, also a representation of the 3D object human subject based on the full mesh output tokens).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the coarse and full mesh processing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5.
Regarding claim 17¸ with deficiencies of Fu noted in square brackets [], Fu teaches a device configured to estimate [an interaction with the device], the device comprising: a memory; and a processor communicably coupled to the memory, wherein the processor is configured to: (Fu pages 1 and 7, Introduction and implementation, learning point cloud representation promotes applications such as augmented reality, the training implemented on NVIDIA A100 (a computing processor including a memory)): 
receive, [at a two-dimensional (2D) feature extraction model of an estimation model,] one or more first features corresponding to input data associated with an interaction with the device (Fu pages 1 and 5, transformer’s input tokens To pass through the h-layer to produce the features of each patch, the tokens corresponding to input point cloud data used in augmented reality device); 
[generate, by the 2D feature extraction model, one or more second features associated with the one or more first features, the one or more second features comprising one or more 2D features;] 
send, [by the 2D feature extraction model], data generated based on the one or more [2D] features (Fu page 5, the feature of each patch is obtained with the global receptive field) to an estimation-model encoder of the estimation model, [the data generated based on the one or more 2D features comprising an attention mask indicating an occluded portion of an object associated with the interaction] (Fu page 5, the features of each patch with the global receptive field are mapped to the loss space with the projector (thus provided to the projector) that is part of the Encoder);
[updating, by the estimation-model encoder, the attention mask by removing the occluded portion of the object based on an unoccluded portion of the object;]
[generate, by the estimation model, an estimated output based on the data generated based on the one or more 2D features]; and
perform an operation based on the estimated output (Fu pages 9–10, section 5.2, various “downstream tasks” which as tasks (operations) performed based on the estimated output, including 3D object classification).
While Fu teaches that its point-cloud pre-training method is useful for augmented reality applications, and therefore would be a method for augmented reality which would be obvious to a person having ordinary skill in the art (herein “PHOSITA”) to include interpreting user gestures as interactions with an augmented reality app or headset with predictable results, nonetheless, Fu does not explicitly teach/anticipate that the augmented reality includes interactions with an augmented reality device. 
Lin teaches interaction with a device (Lin page 12919, Introduction, 3D human pose and mesh reconstruction for applying to human-computer interactions).
Fu does not explicitly teach, but Lin teaches: at a two-dimensional (2D) feature extraction model of an estimation model (Lin pages 12921–12922, fig. 3, a pooling and MLP operation received as input the grid features shown to be 2D),
generate, by the 2D feature extraction model, one or more second features associated with the one or more first features, the one or more second features comprising one or more 2D features (Lin pages 12921–12922, fig. 3, the Pooling + MLP outputs a global feature vector from the grid features);
by the 2D feature extraction model, 2D features Lin pages 12921–12922, fig. 3, multi-layer graphormer encoder receiving the grid features and the global feature vector);
generate, by the estimation model, an estimated output based on the data generated based on the one or more 2D features (Lin pages 12921–12922, fig. 3, coarse mesh output tokens output from the multi-layer graphormer encoder are then upsampled using MLP to produce the estimate output full mesh).
Fu further does not explicitly teach, but Zhai teaches the data generated based on the one or more 2D features comprising an attention mask indicating an occluded portion of an object associated with the interaction (Zhai pages 3-4, fig. 2, human part attention model which outputs an attention mask that distinguishes the pedestrian in a 2D image, with 2D features, from the background and occlusions, where the distinguishing aspect would thus indicate the occluded part of a pedestrian object associated with various interactions with their environment).
Fu still further does not explicitly teach where Zhai teaches update, by the estimation-model encoder, the attention mask by removing the occluded portion of the object based on an unoccluded portion of the object (Zhai page 4, fig. 2, as shown, the human part attention model attention mask is updated by the second order information attention model in re-recognition of occluded pedestrians by ignoring/suppressing (removing) the occluded areas of the image (the occluded portion of the pedestrian (object)) while concentrating on the region of the pedestrian body (unoccluded portion of the pedestrian (object))).
Therefore, taking the teachings of Fu and Lin together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the feature processing and mesh producing disclosed in Lin at least because doing so would improve spatial locality in the features (See Lin page 12921, right column), as well as avoiding redundancies to make training more efficient (See Lin page 12922, right column, ll. 4-5.
Further, taking the teachings of Fu as modified by Lin and Zhai together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the mixed attention network for occluded persons operations disclosed in Zhai at least because doing so would increase performance of person re-identification, especially with challenging occlusion images by enriching part-level features and the correlation among them (see Zhai Abstract, and page 2 section II(C), and section III first paragraph).
Claims 4–5 are rejected under 35 U.S.C. 103 as being unpatentable over Fu and Lin as set forth above regarding claim 1, further in view of Yang et al., "Dynamic Iterative Refinement for Efficient 3D Hand Pose Estimation," 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, January 8, 2022, pp. 2703-2713, doi: 10.1109/WACV51458.2022.00276 (herein “Yang”), further in view of Goyal et al., “PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination,” arXiv:2001.08950v5 [cs.LG], Sept. 8, 2020, https://doi.org/10.48550/arXiv.2001.08950 (herein “Goyal”).
Regarding claim 4, with deficiencies of Fu noted in square brackets [], Fu teaches wherein the first encoder layer of the estimation-model encoder [corresponds to a first BERT encoder of the estimation-model encoder] (Fu fig. 2, pages 5–6, momentum encoder), and the method further comprises: concatenating a token, associated with an output of the first [BERT] encoder, with at least one of [camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data] to generate concatenated data (Fu fig. 2, pages 4–5, patch tokens output from top processing branch including PGE model processing, are concatenated along the patch dimension with class tokens to get the transformer input for the transformer encoder); and 
receiving the concatenated data at a second BERT encoder (Fu fig. 2, page 5, the concatenated tokens are input to the transformer encoder (BERT encoder)).
Fu does not explicitly teach corresponds to a first BERT encoder of the estimation-model encoder, the first BERT encoder, or camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data.
Yang teaches camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data (Yang page 2709, STB dataset including 3D annotations of 21 joints of a hand/wrist).
Goyal teaches corresponds to a first BERT encoder of the estimation-model encoder, the first BERT encoder (Goyal fig. 1, page 2, PoWERT-BERT architecture including a chain of 12 encoders in series, thus the output of one BERT encoder being input to the next BERT encoder in the chain).
Therefore, taking the teachings of Fu and Yang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the hand/wrist data disclosed in Yang at least because doing so would provide for hand pose estimation approaches with improved accuracy and efficiency (See Yang Abstract).
Further, taking the teachings of Fu as modified by Yang and Goyal together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the BERT encoder chain disclosed in Goyal at least because doing so would provide for improving the inference time of a BERT system while maintaining the accuracy (See Goyal Abstract).
Regarding claim 5, with deficiencies of Fu noted in square brackets, Fu teaches [the first BERT encoder] and the second BERT encoder (Fu fig. 2, page 5 the transformer encoder (BERT encoder)) [are included in a chain of BERT encoders, the first BERT encoder and the second BERT encoder being separated by at least three BERT encoders of the chain of BERT encoders; and the chain of BERT encoders] comprises at least one BERT encoder having more than four encoder layers (Fu fig. 2, page 5, within the transformer encoder architecture is an MLP layer, a maxpool layer, and the transformer encoder layer, and multiple MPL layers in the projector, this at least 5 layers in the encoder).
Goyal teaches the first BERT encoder, are included in a chain of BERT encoders, the first BERT encoder and the second BERT encoder being separated by at least three BERT encoders of the chain of BERT encoders; and the chain of BERT encoders (Goyal fig. 1, page 2, considering encoder 1 as the first BERT encoder, within a chain of 12 encoders, including a second BERT encoder–encoder 5, between encoder 1 and encoder 5 are three BERT encoders, encoders 2, 3, and 4).
Therefore, taking the teachings of Fu as modified by Yang and Goyal together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the BERT encoder chain disclosed in Goyal at least because doing so would provide for improving the inference time of a BERT system while maintaining the accuracy (See Goyal Abstract).
Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Fu and Lin as set forth above regarding claim 1, further in view of Shen et al., “Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng., Nov. 2019, 3(11):880-888. doi: 10.1038/s41551-019-0466-4 (herein “Shen”).
Regarding claim 6, with deficiencies of Fu noted in square brackets [], Fu teaches [a data set used to train the estimation model is generated based on two-dimensional (2D) image rotation and rescaling that is projected to three dimensions (3D) in an augmentation process;] and a backbone of the estimation model is trained using two optimizers (Fu fig. 2, pages 6–7, two loss functions are used to train the POS-BERT model, one for mask patch modeling loss (MPM) given in equation 2, and one for the global point clouds (GFC) given in equation 4, where the overall loss function depends on both the MPM and GFC loss functions).
Fu does not explicitly teach, but Shen teaches a data set used to train the estimation model is generated based on two-dimensional (2D) image rotation and rescaling that is projected to three dimensions (3D) in an augmentation process (Shen pages 10 and 5–6, augmented training datasets with rotational transformations, the datasets being of 2D projection images reshaping and resizing the images which are scalable to full-size images (rescaling)).
Therefore, taking the teachings of Fu and Shen together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the data augmentation disclosed in Shen at least because doing so would provide for robustness in the training of a deep neural network (See Shen page 5).
Claims 12–13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Fu in view of Lin in view of Zhai as set forth above regarding claims 9 and 17, further in view of Yang, further in view of Goyal.
Regarding claims 12 and 20, with deficiencies of Fu noted in square brackets [], Fu teaches wherein the estimation-model encoder comprises [a first BERT encoder] comprising a first encoder layer (Fu fig. 2, pages 5–6, momentum encoder), and the method further comprises: concatenating a token, corresponding to an output of the first [BERT] encoder, with at least one of [camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data] to generate concatenated data (Fu fig. 2, pages 4–5, patch tokens output from top processing branch including PGE model processing, are concatenated along the patch dimension with class tokens to get the transformer input for the transformer encoder); and 
receiving the concatenated data at a second BERT encoder (Fu fig. 2, page 5, the concatenated tokens are input to the transformer encoder (BERT encoder)).
Fu does not explicitly teach comprises a first BERT encoder, the first BERT encoder, or camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data.
Yang teaches camera intrinsic-parameter data, three-dimensional (3D) hand-wrist data, or bone-length data (Yang page 2709, STB dataset including 3D annotations of 21 joints of a hand/wrist).
Goyal teaches comprises a first BERT encoder, the first BERT encoder (Goyal fig. 1, page 2, PoWERT-BERT architecture including a chain of 12 encoders in series, thus the output of one BERT encoder being input to the next BERT encoder in the chain).
Therefore, taking the teachings of Fu and Yang together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the hand/wrist data disclosed in Yang at least because doing so would provide for hand pose estimation approaches with improved accuracy and efficiency (See Yang Abstract).
Further, taking the teachings of Fu as modified by Yang and Goyal together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the BERT encoder chain disclosed in Goyal at least because doing so would provide for improving the inference time of a BERT system while maintaining the accuracy (See Goyal Abstract).
Regarding claim 13, with deficiencies of Fu noted in square brackets, Fu teaches [the first BERT encoder] and the second BERT encoder (Fu fig. 2, page 5 the transformer encoder (BERT encoder)) [are included in a chain of BERT encoders, the first BERT encoder and the second BERT encoder being separated by at least three BERT encoders of the chain of BERT encoders; and the chain of BERT encoders] comprises at least one BERT encoder having more than four encoder layers (Fu fig. 2, page 5, within the transformer encoder architecture is an MLP layer, a maxpool layer, and the transformer encoder layer, and multiple MPL layers in the projector, this at least 5 layers in the encoder).
Goyal teaches the first BERT encoder, are included in a chain of BERT encoders, the first BERT encoder and the second BERT encoder being separated by at least three BERT encoders of the chain of BERT encoders; and the chain of BERT encoders (Goyal fig. 1, page 2, considering encoder 1 as the first BERT encoder, within a chain of 12 encoders, including a second BERT encoder–encoder 5, between encoder 1 and encoder 5 are three BERT encoders, encoders 2, 3, and 4).
Therefore, taking the teachings of Fu as modified by Yang and Goyal together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the BERT encoder chain disclosed in Goyal at least because doing so would provide for improving the inference time of a BERT system while maintaining the accuracy (See Goyal Abstract).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Fu in view of Lin in view of Zhai as set forth above regarding claim 9, further in view of Shen et al., “Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning. Nat Biomed Eng., Nov. 2019, 3(11):880-888. doi: 10.1038/s41551-019-0466-4 (herein “Shen”).
Regarding claim 14, with deficiencies of Fu noted in square brackets [], Fu teaches [a data set used to train the estimation model is generated based on two-dimensional (2D) image rotation and rescaling that is projected to three dimensions (3D) in an augmentation process;] and a backbone of the estimation model is trained using two optimizers (Fu fig. 2, pages 6–7, two loss functions are used to train the POS-BERT model, one for mask patch modeling loss (MPM) given in equation 2, and one for the global point clouds (GFC) given in equation 4, where the overall loss function depends on both the MPM and GFC loss functions).
Fu does not explicitly teach, but Shen teaches a data set used to train the estimation model is generated based on two-dimensional (2D) image rotation and rescaling that is projected to three dimensions (3D) in an augmentation process (Shen pages 10 and 5–6, augmented training datasets with rotational transformations, the datasets being of 2D projection images reshaping and resizing the images which are scalable to full-size images (rescaling)).
Therefore, taking the teachings of Fu and Shen together as a whole, it would have been obvious to a PHOSITA before the effective filing date of the claimed invention to have modified the point cloud data processing of Fu with the data augmentation disclosed in Shen at least because doing so would provide for robustness in the training of a deep neural network (See Shen page 5).

Allowable Subject Matter
Claims 7 and 15 remain objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, for the same reasoning provided on pages 24–25 of the Non-Final Office Action issued August 26, 2025.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Thursday, 09:00-17:00, Friday 09:00-13:00, EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached at 571-272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MICHELLE M. KOETH
Primary Examiner
Art Unit 2671



/MICHELLE M KOETH/Primary Examiner, Art Unit 2671
Read full office action
Prosecution Timeline

Show 3 earlier events
Nov 20, 2025
Applicant Interview (Telephonic)
Nov 26, 2025
Response Filed
Dec 31, 2025
Final Rejection mailed — §103
Mar 24, 2026
Applicant Interview (Telephonic)
Mar 24, 2026
Examiner Interview Summary
Mar 30, 2026
Request for Continued Examination
Apr 01, 2026
Response after Non-Final Action
Apr 17, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/242,213
Patent 12626495
MULTIMODAL EMBEDDINGS
2y 8m to grant Granted May 12, 2026
18/297,396
Patent 12586221
METHOD AND APPARATUS FOR ESTIMATING DEPTH INFORMATION OF IMAGES
2y 11m to grant Granted Mar 24, 2026
17/886,027
Patent 12579651
IMPEDED DIFFUSION FRACTION FOR QUANTITATIVE IMAGING DIAGNOSTIC ASSAY
3y 7m to grant Granted Mar 17, 2026
17/988,795
Patent 12567241
Method For Generating Training Data Used To Learn Machine Learning Model, System, And Non-Transitory Computer-Readable Storage Medium Storing Computer Program
3y 3m to grant Granted Mar 03, 2026
18/132,751
Patent 12567177
METHOD, ELECTRONIC DEVICE, AND COMPUTER PROGRAM PRODUCT FOR IMAGE PROCESSING
2y 10m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

3-4
Expected OA Rounds
77%
Grant Probability
94%
With Interview (+16.3%)
2y 2m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 434 resolved cases by this examiner. Grant probability derived from career allowance rate.