Prosecution Insights
Last updated: May 29, 2026
Application No. 18/477,429

MULTIMODAL THREE-DIMENSIONAL ASSET SEARCH TECHNIQUES

Non-Final OA §103
Filed
Sep 28, 2023
Examiner
CHIN, MICHELLE
Art Unit
2614
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
3 (Non-Final)
85%
Grant Probability
Favorable
3-4
OA Rounds
0m
Est. Remaining
97%
With Interview

Examiner Intelligence

Grants 85% — above average
85%
Career Allowance Rate
542 granted / 636 resolved
+23.2% vs TC avg
Moderate +12% lift
Without
With
+11.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 2m
Avg Prosecution
19 currently pending
Career history
664
Total Applications
across all art units

Statute-Specific Performance

§101
2.8%
-37.2% vs TC avg
§103
88.2%
+48.2% vs TC avg
§102
1.5%
-38.5% vs TC avg
§112
0.3%
-39.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 636 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status 1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Continued Examination Under 37 CFR 1.114 2. A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 04/27/2026 has been entered. Response to Amendment 3. Acknowledgement is made of amendment filed on April 27, 2026, in which claims 1, 11 and 17 are amended, claims 6, 9 and 10 are canceled, claims 21-24 are new, and claims 1-5, 7, 8 and 11-24 are still pending. Response to Arguments 4. Applicant's arguments, filed on April 27, 2026, with respect to Claims 1-5, 7, 8 and 11-24 have been fully considered and they are not persuasive. 5. With regards to arguments for independent claims 1, 11 and 17, applicants argue that Ponjou Tasse et al. (US 2020/0104318 A1), Liang et al. (US 2020/0005015 A1) and Turkelson et al. (US 2021/0004589 A1) fail to disclose a machine learning model comprising multiple attention layers and multiple residual layers. The examiner respectfully agrees moots in view of the new grounds of rejections regarding claims 1, 11 and 17, since in Bradley et al. (US 2023/01404702 A1) teaches (“training engine 122 trains a machine learning model to learn nonlinear global (e.g., across an entire shape) and local (e.g., in the vicinity of a point or region within a shape) correlations across points or regions in faces, hands, bodies, and/or other three-dimensional (3D) shapes.” [0028] “concatenated tokens 340 and shape token 342 could be processed sequentially by a “stack” of N (where N is an integer greater than or equal to 1) encoder transformer blocks 302, so that the output of a given encoder transformer block is used as input into the next encoder transformer block. Each encoder transformer block includes a cross-covariance image transformer (XCiT) block with a cross-covariance attention (XCA) layer, a transformer block with a self-attention layer, and/or another type of transformer neural network architecture.” [0040] “Within decoder 206, position MLP 310 converts a second set of positions 222 in canonical shape 220 into a corresponding set of position tokens 354. For example, position MLP 310 could include a series of fully connected layers that map each position to a higher-dimensional position token in a latent space. … As with encoder transformer blocks 302, each of decoder transformer blocks 304 includes a cross-covariance image transformer (XCiT) block with a cross-covariance attention (XCA) layer, a transformer block with a self-attention layer, and/or another type of transformer neural network architecture.” [0042-0043]) Bradley teaches a cross-covariance attention layer, a self-attention layer, and/or another type of transformer neural network architecture, which can include multiple attention layers and multiple residual layers. Therefore, Bradley teaches the arguments of the limitations for claims 1, 11 and 17 as it is recited. Applicants also argue that Ponjou Tasse et al. (US 2020/0104318 A1), Liang et al. (US 2020/0005015 A1) and Turkelson et al. (US 2021/0004589 A1) fail to disclose minimizing an L2 distance between the encoded input and encoded representations of multiple views. The examiner respectfully agrees moots in view of the new grounds of rejections regarding claims 1, 11 and 17, since in UY et al. (US 2022/0229943 A1) teaches (“a particular source model will be deformed, so rather than representing a single shape (e.g., the shape of the source model) in the retrieval embedding space, each source model is represented by a range of possible deformed shapes in the retrieval embedding space. In an example implementation, this range is represented by a variance that defines an area in the retrieval embedding space, centered around the point where the source gets encoded, and that represents a range of potential deformations of the source model. Accordingly, in some embodiments, a distance function that compares a target to a source model using both the center and variance for a source model serves to define a deformation-aware retrieval. … shared encoder 220 encodes a representation of target shape 210, source selector 230 uses the distance function to calculate the distance between the encoded target and each source model in source database 240, and source selector 230 selects, retrieves, or otherwise identifies a source model with the shortest computed distance from the target (e.g., source shape 250).” [0039-0040]) UY teaches shared encoder uses the distance function to calculate the distance between the encoded target and each source model and selects, retrieves, or otherwise identifies a source model with the shortest computed distance from the target. Therefore, UY teaches the arguments of the limitations for claims 1, 11 and 17 as it is recited. Applicants further argue that Ponjou Tasse et al. (US 2020/0104318 A1) teaches away from the claimed “mesh” representation. However, the examiner respectfully disagrees that Ponjou Tasse teaches away from the claimed “mesh” representation since in Ponjou Tasse et al. (US 2020/0104318 A1) teaches (“W is a vector space of words 200 that can map text to a vectorial representation. It is assumed that such a vector space already exists, and remains fixed. An example of such a vector space approach is the “word2vec” approach and neural network as presented in Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality” as published in Advances in neural information processing systems 2013. Semantically close words are mapped to spatially close vectors, as illustrated in the FIG. 2. A similar vector space 300 showing 3D shape descriptors mapped into the above word vector space, in particular showing semantically close shapes being mapped to spatially close vectors, is shown in FIG. 3.” [0047]) Ponjou Tasse teaches the "word2vec" approach can be adapted to generate 3D mesh objects by training a neural network. Ponjou Tasse teaches the arguments of the limitations for claims 1, 11 and 17 as it is recited. Claim Rejections - 35 USC § 103 6. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 7. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. 8. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. 9. Claim(s) 1-4, 11-14, 17 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponjou Tasse et al. (US 2020/0104318 A1) in view of Bradley et al. (US 2023/01404702 A1) and UY et al. (US 2022/0229943 A1). 10. With reference to claim 1, Ponjou Tasse teaches A method comprising: receiving, by a computing system, a query for a three-dimensional representation of a target object, (“The present invention relates to methods for searching for two-dimensional or three-dimensional objects. More particularly, the present invention relates to searching for two-dimensional or three-dimensional objects in a collection by using a multi-modal query of image and/or tag data.” [0001] “descriptors can be computed for each object in a collection in advance of queries being received or processed. At query time, a unified descriptor is computed for a multimodal input, and any object with a spatially close descriptor is a relevant result.” [0051]) Ponjou Tasse also teaches the query comprising input in the form of text describing the target object, a two-dimensional image of the target object, or a three-dimensional model of the target object; (“repositories such as Google Images accept image queries in order to identify relevant three-dimensional models within the repository for input images. Specifically, current search functionality on three-dimensional and two-dimensional repositories such as Thingiverse or Google Images accept only one form of query from users, an image or a text.” [0004] “First, a unified descriptor from a combination of image and text is constructed. This is done by first representing each of the image 10 and text descriptors 20a separately using Shape2Vec, followed by using vector calculus (which can comprise vector analysis and/or multivariable calculus and/or vector arithmetic) to combine the descriptors 20b.” [0046]) Ponjou Tasse further teaches encoding, by the computing system using a machine learning model, the input to generate an encoded representation of the input; (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) Ponjou Tasse teaches searching, by the computing system, identify a three-dimensional representation of the target object, the search space comprising encoded representations of multiple views of a plurality of sample three-dimensional object representations; (“Images and shapes may be embedded in the same word vector space, thus ensuring that all modalities share a common representation. This is extendable to three-dimensional shapes by computing rendered views for a shape from multiple viewpoints and computing a descriptor for each view. The shape descriptor may then comprise an average of its view descriptors.” [0026] “By taking multiple views of an object, it can be more accurately located in a collection of objects as more viewpoints can allow for more efficient or certain identification of similar objects. … there is provided a method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; determining a vector representation for the query; determining which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query. Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0030-0032] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) Ponjou Tasse teaches determining which three dimensional objects have a spatially close vector representation to the vector representation for the query and searching a collection of objects allow for the location of three-dimensional objects. Ponjou Tasse also teaches outputting, by the computing system, the identified three-dimensional representation of the target object. (“Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “After training the above neural network, a classifier has been developed that can identify image labels. This network can then be modified and its parameters fine-tuned to embed 425 images in W as described below. Next, an image encoder is trained to output vectors in W. To generate a vector that lies in W, given an image, the classifier is converted into an encoder that returns vectors similar to the vector representation of the image label.” [0062-0063]) PNG media_image1.png 663 476 media_image1.png Greyscale Ponjou Tasse does not explicitly teach a machine learning model comprising multiple attention layers and multiple residual layers, a search space using nearest neighbors; the searching comprising minimizing an L2 distance between one or more of the encoded representations of multiple views of the plurality of sample three-dimensional object representations in the search space and the encoded input, wherein the three-dimensional representation of the target object is a mesh. This is what Bradley teaches. Bradley teaches a machine learning model comprising multiple attention layers and multiple residual layers, wherein the three-dimensional representation of the target object is a mesh. (“training engine 122 trains a machine learning model to learn nonlinear global (e.g., across an entire shape) and local (e.g., in the vicinity of a point or region within a shape) correlations across points or regions in faces, hands, bodies, and/or other three-dimensional (3D) shapes.” [0028] “concatenated tokens 340 and shape token 342 could be processed sequentially by a “stack” of N (where N is an integer greater than or equal to 1) encoder transformer blocks 302, so that the output of a given encoder transformer block is used as input into the next encoder transformer block. Each encoder transformer block includes a cross-covariance image transformer (XCiT) block with a cross-covariance attention (XCA) layer, a transformer block with a self-attention layer, and/or another type of transformer neural network architecture.” [0040] “Within decoder 206, position MLP 310 converts a second set of positions 222 in canonical shape 220 into a corresponding set of position tokens 354. For example, position MLP 310 could include a series of fully connected layers that map each position to a higher-dimensional position token in a latent space. … As with encoder transformer blocks 302, each of decoder transformer blocks 304 includes a cross-covariance image transformer (XCiT) block with a cross-covariance attention (XCA) layer, a transformer block with a self-attention layer, and/or another type of transformer neural network architecture.” [0042-0043] “at inference time such a generative neural network 204 (or other machine learning (ML) or artificial intelligence (AI) model) can be used to generate textured 3D meshes 206 for a variety of different objects, of one or more object types, as illustrated in FIG. 2A. In at least one embodiment, different input feature vectors 202 (or latent codes) can be provided as input to this generative network, and each different input vector 202 can result in a different output textured 3D mesh.” [0057]) Bradley teaches a cross-covariance attention layer, a self-attention layer, and/or another type of transformer neural network architecture, which can include multiple attention layers and multiple residual layers. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Bradley into Ponjou Tasse, in order to generate more accurate or realistic shapes. The combination of Ponjou Tasse and Bradley does not explicitly teach a search space using nearest neighbors; the searching comprising minimizing an L2 distance between one or more of the encoded representations of multiple views of the plurality of sample three-dimensional object representations in the search space and the encoded input. This is what UY teaches (“in some embodiments in which shared encoder 220 is implemented using a point cloud encoder, for each source model in source database 240, a corresponding 3D point cloud is sampled from the source model, and the 3D point cloud is fed into shared encoder 220 to generate a corresponding mean or center code SR ϵcustom-character.sup.n.sub.4. … a particular source model will be deformed, so rather than representing a single shape (e.g., the shape of the source model) in the retrieval embedding space, each source model is represented by a range of possible deformed shapes in the retrieval embedding space. In an example implementation, this range is represented by a variance that defines an area in the retrieval embedding space, centered around the point where the source gets encoded, and that represents a range of potential deformations of the source model. Accordingly, in some embodiments, a distance function that compares a target to a source model using both the center and variance for a source model serves to define a deformation-aware retrieval. … shared encoder 220 encodes a representation of target shape 210, source selector 230 uses the distance function to calculate the distance between the encoded target and each source model in source database 240, and source selector 230 selects, retrieves, or otherwise identifies a source model with the shortest computed distance from the target (e.g., source shape 250).” [0038-0040] “Embodiments described herein support 3D model generation. The components described herein refer to integrated components of a 3D model generation system.” [0070] “Although some implementations are described with respect to neural networks, some embodiments are implemented using other types of machine learning model(s), such as those using linear regression, logistic regression, decision trees, support vector machines (SVM), Naive Bayes, k-nearest neighbor (Knn), K means clustering, random forest, dimensionality reduction algorithms, gradient boosting algorithms, neural networks (e.g., auto-encoders, convolutional, recurrent, perceptrons, Long/Short Term Memory (LSTM), Hopfield, Boltzmann, deep belief, deconvolutional, generative adversarial, liquid state machine, etc.), and/or other types of machine learning models.” [0073]) UY teaches shared encoder uses the distance function to calculate the distance between the encoded target and each source model and selects, retrieves, or otherwise identifies a source model with the shortest computed distance from the target. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of UY into the combination of Ponjou Tasse and Bradley, in order to produce the fidelity, level of detail, and overall quality of 3D models generated by professional 3D artists. 11. With reference to claim 2, Ponjou Tasse teaches the input comprises the three-dimensional model of the target object, and wherein the method further comprises: generating multiple views of the three-dimensional model; and encoding each of the multiple views using the machine learning model. (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “Images and shapes may be embedded in the same word vector space, thus ensuring that all modalities share a common representation. This is extendable to three-dimensional shapes by computing rendered views for a shape from multiple viewpoints and computing a descriptor for each view. The shape descriptor may then comprise an average of its view descriptors.” [0026] “Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) 12. With reference to claim 3, Ponjou Tasse teaches the input comprises the text describing the target object; and the machine learning model comprises a text encoder. (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) 13. With reference to claim 4, Ponjou Tasse teaches the input comprises the two-dimensional image describing the target object; and the machine learning model comprises an image encoder. (“current search functionality on three-dimensional and two-dimensional repositories such as Thingiverse or Google Images accept only one form of query from users, an image or a text.” [0004] “The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) 14. Claim 11 is similar in scope to claim 1, and thus is rejected under similar rationale. Ponjou Tasse does not explicitly teach A system comprising: a memory component; and a processing device coupled to the memory component, the processing device configured to perform operations. This is what Turkelson teaches (“Some aspects include a system, including: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate operations of each the above-mentioned processes.” [0017]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into Ponjou Tasse, in order to enhance a training set for a visual search process. 15. Claims 12-14 are similar in scope to claims 2-4, and they are rejected under similar rationale. 16. Claim 17 is similar in scope to claim 1, and thus is rejected under similar rationale. Ponjou Tasse does not explicitly teach A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations. This is what Turkelson teaches (“Some aspects include a tangible, non-transitory, machine-readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations including each of the above-mentioned processes.” [0016]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into Ponjou Tasse, in order to enhance a training set for a visual search process. 17. Claim 18 is similar in scope to claim 2, and thus is rejected under similar rationale. 18. With reference to claim 22, Ponjou Tasse teaches determining a mode of the input of the query, wherein the mode is the text, the two-dimensional image, or the three-dimensional model; and routing the input to a specific encoder of the machine learning model based on the determined mode of the input, wherein the specific encoder is a text encoder when the mode is the text, and the specific encoder is an image encoder when the mode is the two-dimensional image or the three-dimensional model. (“Aspects and/or embodiments seek to provide a method of searching for digital objects using any combination of images, three-dimensional shapes and text by embedding the vector representations for these multiple modes in the same space.” [0010] “The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data. … the neural network is generated by an image classifier followed by an image encoder operable to generate embeddings in the vector space of words. Optionally, the classifier is operable to be trained to identify image labels. Optionally, the classifier is converted to an encoder operable to generate semantic-based descriptors.” [0019-0020] “there is provided a further step of receiving a query regarding the image data and/or the tag data; and providing a unified representation in relation to the query.” [0023] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) 19. With reference to claim 23, Ponjou Tasse teaches the input comprises a rough three-dimensional shape sketched by a user, and wherein outputting the identified three-dimensional representation of the target object comprises replacing the rough three-dimensional shape sketched by the user with the outputted three-dimensional representation of the target object. (“By using a unified representation for multi-modal data, it can be possible to search within one vector space for relevant multi-modal data using other multi-modal data. 3D models may be considered as collections of rendered images, where each image is a rendering from a random viewpoint. Therefore a 3D shape descriptor can comprise a combination of its view descriptors.” [0012] “Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “After training the above neural network, a classifier has been developed that can identify image labels. This network can then be modified and its parameters fine-tuned to embed 425 images in W as described below. Next, an image encoder is trained to output vectors in W. To generate a vector that lies in W, given an image, the classifier is converted into an encoder that returns vectors similar to the vector representation of the image label.” [0062-0063] “This method can be easily extended to a new modality such as depth images and sketches by training an encoder for mapping the modality to W. An example of a multi-modal query 600 comprising a sketch and tag is shown in FIG. 6. There is also shown the corresponding results retrieved comprising databases of 3D models in an upright orientation 605, databases of 3D models in an arbitrary orientation 610, sketches 615 and images 620.” [0075]) 20. Claim(s) 5, 7, 8, 15, 16, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponjou Tasse et al. (US 2020/0104318 A1), Bradley et al. (US 2023/01404702 A1) and UY et al. (US 2022/0229943 A1), as applied to claims 1, 11 and 17 above, and further in view of Turkelson et al. (US 2021/0004589 A1). 21. With reference to claim 5, the combination of Ponjou Tasse, Bradley and UY does not explicitly teach normalizing the encoded representation of the input, wherein the encoded representations of the plurality of sample three-dimensional object representations in the search space are normalized, and wherein the searching comprises comparing the normalized encoded representation of the input to the normalized encoded representations in the search space. This is what Turkelson teaches (“a classifier may be trained using extracted features from an earlier layer of the machine learning model. In some embodiments, preprocessing may be performed to an input image prior to the feature extraction being performed. For example, preprocessing may include resizing, normalizing, cropping, etc., to each image to allow that image to serve as an input to the pre-trained model. Example pre-trained networks may include AlexNet, GoogLeNet, MobileNet-v2, and others. The preprocessing input images may be fed to the pre-trained model, which may extract features, and those features may then be used to train a classifier (e.g., SVM). In some embodiments, the input images, the features extracted from each of the input images, an identifier labeling each of the input image, or any other aspect capable of being used to describe each input image, or a combination thereof, may be stored in memory (e.g., within training data database 136A as an update to training data set for training an object recognition model, a context classification model, etc.).” [0084] “Some embodiments may include the trained computer-vision object recognition model having parameters that encode information about a subset of visual features of the object depicted by each image from the training data set.” [0088] “image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “receive a query image, pass the image to a deep neural network that extracts deep features, before computing distances to all images in the index and presenting a nearest neighbor as a search result. Some embodiments may receive a query image (e.g., a URL of a selected online image hosted on a website, a captured image from a mobile device camera, or a sketch drawn by a user in a bitmap editor) and determine the nearest neighbor, computing its distance in vector space.” [0158] “the trained computer-vision object recognition model may extract one or more visual features describing the new image. The visual features may be compared to the visual features extracted from each of the images from the training data set to determine a similarity between the visual features of the new image and the visual features of the images from the training data set. In some embodiments, the visual features of the new image and the visual features of the images from the training data set may be represented as feature vectors in an n-dimensional feature space.” [0189]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse, Bradley and UY, in order to enhance a training set for a visual search process. 22. With reference to claim 7, the combination of Ponjou Tasse, Bradley and UY does not explicitly teach the multiple views of each of the plurality of sample three-dimensional object representations comprises at least one hundred views from predetermined viewpoints. This is what Turkelson teaches (“image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “the candidate video may include video of depicting the object from a first perspective (e.g., head-on) for a first amount of time (e.g., four seconds), followed by video depicting the object from a second perspective (e.g., a side view) for a second amount of time (e.g., five seconds). Mobile computing device 104 may be configured to continually obtain the video for a predefined amount of time (e.g., 10 seconds, 30 seconds, 1 minute, etc.), until a threshold number of images are obtained (e.g., 10 or more images, 20 or more images, 50 or more images), until images captured by the video satisfy a threshold number of criteria (e.g., a threshold number of perspective views of the object are obtained, a threshold number of lighting conditions are obtained, etc.), or a combination thereof.” [0246]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse, Bradley and UY, in order to enhance a training set for a visual search process. 23. With reference to claim 8, the combination of Ponjou Tasse, Bradley and UY does not explicitly teach the multiple views of each of the plurality of sample three-dimensional object representations comprise views from predetermined viewpoints, and one or more of views with varying lighting or views with varying texture. This is what Turkelson teaches (“image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “the candidate video may include video of depicting the object from a first perspective (e.g., head-on) for a first amount of time (e.g., four seconds), followed by video depicting the object from a second perspective (e.g., a side view) for a second amount of time (e.g., five seconds). Mobile computing device 104 may be configured to continually obtain the video for a predefined amount of time (e.g., 10 seconds, 30 seconds, 1 minute, etc.), until a threshold number of images are obtained (e.g., 10 or more images, 20 or more images, 50 or more images), until images captured by the video satisfy a threshold number of criteria (e.g., a threshold number of perspective views of the object are obtained, a threshold number of lighting conditions are obtained, etc.), or a combination thereof.” [0246]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse, Bradley and UY, in order to enhance a training set for a visual search process. 24. Claims 15 and 19 are similar in scope to claim 5, and they are rejected under similar rationale. 25. Claims 16 and 20 are similar in scope to claim 8, and they are rejected under similar rationale. Allowable Subject Matter 26. Claims 21 and 24 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is an examiner’s statement of reasons for allowance: Regarding claim 21, the prior art of record fails to either individually or in combination teach the claimed feature of “maintaining a look-up table mapping the encoded representations in the search space to the corresponding plurality of sample three-dimensional object representations and corresponding viewpoints; and traversing the look-up table to identify the three-dimensional representation of the target object based on an identified encoded representation from the search space.” Regarding claim 24, the prior art of record fails to either individually or in combination teach the claimed feature of “estimating a depth map of the two-dimensional image; and using the outputted three-dimensional representation of the target object to cast one or more shadows on the estimated depth map of the two-dimensional image.” Conclusion 27. Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michelle Chin whose telephone number is (571)270-3697. The examiner can normally be reached on Monday-Friday 8:00 AM-4:30 PM. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http:/Awww.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Kent Chang can be reached on (571)272-7667. The fax phone number for the organization where this application or proceeding is assigned is (571)273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https:/Awww.uspto.gov/patents/apply/patent- center for more information about Patent Center and https:/Awww.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /MICHELLE CHIN/ Primary Examiner, Art Unit 2614
Read full office action

Prosecution Timeline

Show 5 earlier events
Jan 02, 2026
Response Filed
Jan 30, 2026
Final Rejection mailed — §103
Mar 13, 2026
Interview Requested
Mar 19, 2026
Examiner Interview Summary
Mar 19, 2026
Applicant Interview (Telephonic)
Apr 27, 2026
Request for Continued Examination
Apr 30, 2026
Response after Non-Final Action
May 06, 2026
Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12633040
METHOD AND SYSTEM FOR TEXTURING AN IMAGE
2y 10m to grant Granted May 19, 2026
Patent 12626453
METHOD AND ARRANGEMENTS FOR GRAPHICALLY VISUALIZING DATA TRANSFER IN A 3D VIRTUAL ENVIRONMENT
2y 6m to grant Granted May 12, 2026
Patent 12608895
ENHANCING MONITORING SYSTEM WITH AUGMENTED REALITY
2y 0m to grant Granted Apr 21, 2026
Patent 12608840
Orientation Based on Vanishing Points
1y 11m to grant Granted Apr 21, 2026
Patent 12602870
COMPUTER-AIDED TECHNIQUES FOR DESIGNING 3D SURFACES BASED ON GRADIENT SPECIFICATIONS
2y 2m to grant Granted Apr 14, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4
Expected OA Rounds
85%
Grant Probability
97%
With Interview (+11.5%)
2y 2m (~0m remaining)
Median Time to Grant
High
PTA Risk
Based on 636 resolved cases by this examiner. Grant probability derived from career allowance rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month