Last updated: April 19, 2026
Application No. 18/477,429
MULTIMODAL THREE-DIMENSIONAL ASSET SEARCH TECHNIQUES

Final Rejection §103
Filed
Sep 28, 2023
Examiner
CHIN, MICHELLE
Art Unit
2614
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +11.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 634 resolved cases, 2023–2026
Examiner Intelligence

CHIN, MICHELLE View full profile →
Grants 85% — above average
Career Allow Rate
540 granted / 634 resolved
+23.2% vs TC avg
Moderate +12% lift
Without
With
+11.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 4m
Avg Prosecution
29 currently pending
Career history
663
Total Applications
across all art units
Statute-Specific Performance

§101
8.8%
-31.2% vs TC avg
§103
70.6%
+30.6% vs TC avg
§102
5.1%
-34.9% vs TC avg
§112
1.6%
-38.4% vs TC avg
Black line = Tech Center average estimate • Based on career data from 634 resolved cases
Office Action

§103
DETAILED ACTION
		Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
2.	Acknowledgement is made of amendment filed on January 02, 2026, in which claims 1, 11 and 17 are amended, claim 10 is canceled, and claims 1-9 and 11-20 are still pending.

Response to Arguments
3.	Applicant's arguments, filed on January 02, 2026, with respect to Claims 1-9 and 11-20 have been fully considered and they are not persuasive.
4.	With regards to arguments for independent claims 1, 11 and 17, applicants argue that Ponjou Tasse et al. (US 2020/0104318 A1) and Turkelson et al. (US 2021/0004589 A1) fail to disclose searching a search space using nearest neighbors to identify a three-dimensional representation of the target object, wherein the three-
dimensional representation of the target object is a mesh. However, the examiner respectfully disagrees that Ponjou Tasse do not teach the arguments regarding claims 1, 11 and 17, since in Ponjou Tasse et al. (US 2020/0104318 A1) teaches (“By taking multiple views of an object, it can be more accurately located in a collection of objects as more viewpoints can allow for more efficient or certain identification of similar objects.  … there is provided a method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; determining a vector representation for the query; determining which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query. Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0030-0032]) Ponjou Tasse teaches determining which three dimensional objects have a spatially close vector representation to the vector representation for the query and searching a collection of objects allow for the location of three-dimensional objects. The arguments further moots in view of the new grounds of rejections regarding claims 1, 11 and 17, since in Liang et al. (US 2020/0005015 A1) teaches (“3D image analysis of whole slide image volumes produces large amount of quantifications such as 3D spatial objects and features. In a typical 3D analytical pathology imaging pipeline, selected biopsies are sectioned into thin slices and mounted on physical glasses. These slides are then scanned into digital images to form 3D image volumes. With the image volume, micro-anatomic objects of interest such as blood vessels and cells are reconstructed in 3D models. Finally, the 3D objects as well as their extracted features are managed and queried by a spatial data management system. Models for 3D object representation, such as a mesh based approach can be implemented using polyhedral modeling. In an embodiment of the disclosed system and method, 3D objects are represented for example, in geometry definition file format OFF. The mesh model with OFF specifies both the geometry (shapes, sizes and absolute positions) and topology (relationships among elements).” [0096-0097] “Spatial proximity estimation is another embodiment of the disclosed 3D spatial query system and method, that explores the distribution of target objects in 3D space given a set of basic objects. In 3D digital pathology, for instance, the spatial distribution of different types of vessels in liver organ is useful in providing a quantitative measurement of disease progression. 3D proximity estimation is a complex spatial query involving multiple objects. It is based on nearest neighbor search and demands accurate distance computation for global spatial pattern discovery.” [0200]) Liang teaches models for 3D object representation such as a mesh based and the 3D proximity estimation is involving multiple objects that based on nearest neighbor search. While Ponjou Tasse teaches identify a three-dimensional representation of the target object, Liang teaches searching a search space using nearest neighbors; wherein the three-dimensional representation of the target object is a mesh. Therefore, the combination of Ponjou Tasse and Liang teaches the arguments of the limitations for claims 1, 11 and 17 as they are recited.

Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1.	Determining the scope and contents of the prior art.
2.	Ascertaining the differences between the prior art and the claims at issue.
3.	Resolving the level of ordinary skill in the pertinent art.
4.	Considering objective evidence present in the application indicating obviousness or nonobviousness.
 
8.	Claim(s) 1-4, 9, 11-14, 17 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponjou Tasse et al. (US 2020/0104318 A1) in view of Liang et al. (US 2020/0005015 A1). 
9.	With reference to claim 1, Ponjou Tasse teaches A method comprising: receiving, by a computing system, a query for a three-dimensional representation of a target object, (“The present invention relates to methods for searching for two-dimensional or three-dimensional objects. More particularly, the present invention relates to searching for two-dimensional or three-dimensional objects in a collection by using a multi-modal query of image and/or tag data.” [0001] “descriptors can be computed for each object in a collection in advance of queries being received or processed. At query time, a unified descriptor is computed for a multimodal input, and any object with a spatially close descriptor is a relevant result.” [0051]) Ponjou Tasse also teaches the query comprising input in the form of text describing the target object, a two-dimensional image of the target object, or a three-dimensional model of the target object; (“repositories such as Google Images accept image queries in order to identify relevant three-dimensional models within the repository for input images. Specifically, current search functionality on three-dimensional and two-dimensional repositories such as Thingiverse or Google Images accept only one form of query from users, an image or a text.” [0004] “First, a unified descriptor from a combination of image and text is constructed. This is done by first representing each of the image 10 and text descriptors 20a separately using Shape2Vec, followed by using vector calculus (which can comprise vector analysis and/or multivariable calculus and/or vector arithmetic) to combine the descriptors 20b.” [0046]) Ponjou Tasse further teaches encoding, by the computing system using a machine learning model, the input to generate an encoded representation of the input; (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) Ponjou Tasse teaches searching, by the computing system, identify a three-dimensional representation of the target object, the search space comprising encoded representations of multiple views of a plurality of sample three-dimensional object representations; (“Images and shapes may be embedded in the same word vector space, thus ensuring that all modalities share a common representation. This is extendable to three-dimensional shapes by computing rendered views for a shape from multiple viewpoints and computing a descriptor for each view. The shape descriptor may then comprise an average of its view descriptors.” [0026] “By taking multiple views of an object, it can be more accurately located in a collection of objects as more viewpoints can allow for more efficient or certain identification of similar objects.  … there is provided a method for searching for an image or shape based on a query comprising tag and image data, comprising the steps of: creating a word space in which images, three dimensional objects, text and combinations of the same are embedded; determining vector representations for each of the images, three dimensional objects, text and combinations of the same; determining a vector representation for the query; determining which one or more of the images, three dimensional objects, text and combinations have a spatially close vector representation to the vector representation for the query. Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0030-0032] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055]) Ponjou Tasse teaches determining which three dimensional objects have a spatially close vector representation to the vector representation for the query and searching a collection of objects allow for the location of three-dimensional objects. Ponjou Tasse also teaches outputting, by the computing system, the identified three-dimensional representation of the target object. (“Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “After training the above neural network, a classifier has been developed that can identify image labels. This network can then be modified and its parameters fine-tuned to embed 425 images in W as described below. Next, an image encoder is trained to output vectors in W. To generate a vector that lies in W, given an image, the classifier is converted into an encoder that returns vectors similar to the vector representation of the image label.” [0062-0063])

    PNG
    media_image1.png
    663
    476
    media_image1.png
    Greyscale

Ponjou Tasse does not explicitly teach searching a search space using nearest neighbors; wherein the three-dimensional representation of the target object is a mesh. This is what Liang teaches (“3D image analysis of whole slide image volumes produces large amount of quantifications such as 3D spatial objects and features. In a typical 3D analytical pathology imaging pipeline, selected biopsies are sectioned into thin slices and mounted on physical glasses. These slides are then scanned into digital images to form 3D image volumes. With the image volume, micro-anatomic objects of interest such as blood vessels and cells are reconstructed in 3D models. Finally, the 3D objects as well as their extracted features are managed and queried by a spatial data management system. Models for 3D object representation, such as a mesh based approach can be implemented using polyhedral modeling. In an embodiment of the disclosed system and method, 3D objects are represented for example, in geometry definition file format OFF. The mesh model with OFF specifies both the geometry (shapes, sizes and absolute positions) and topology (relationships among elements).” [0096-0097] “Spatial proximity estimation is another embodiment of the disclosed 3D spatial query system and method, that explores the distribution of target objects in 3D space given a set of basic objects. In 3D digital pathology, for instance, the spatial distribution of different types of vessels in liver organ is useful in providing a quantitative measurement of disease progression. 3D proximity estimation is a complex spatial query involving multiple objects. It is based on nearest neighbor search and demands accurate distance computation for global spatial pattern discovery.” [0200]) Liang teaches models for 3D object representation such as a mesh based and the 3D proximity estimation is involving multiple objects that based on nearest neighbor search. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Liang into Ponjou Tasse, in order to provide options for users to decide and tailor their goals for faster queries or higher accuracy to meet application specific requirements.
10.	With reference to claim 2, Ponjou Tasse teaches the input comprises the three-dimensional model of the target object, and wherein the method further comprises: generating multiple views of the three-dimensional model; and encoding each of the multiple views using the machine learning model. (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “Images and shapes may be embedded in the same word vector space, thus ensuring that all modalities share a common representation. This is extendable to three-dimensional shapes by computing rendered views for a shape from multiple viewpoints and computing a descriptor for each view. The shape descriptor may then comprise an average of its view descriptors.” [0026] “Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055])
11.	With reference to claim 3, Ponjou Tasse teaches the input comprises the text describing the target object; and the machine learning model comprises a text encoder. (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “Searching a collection of objects based on visual and semantic similarity can allow for the location of three-dimensional objects based on multi-modal search queries using image and/or text data.” [0032] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055])
12.	With reference to claim 4, Ponjou Tasse teaches the input comprises the two-dimensional image describing the target object; and the machine learning model comprises an image encoder. (“current search functionality on three-dimensional and two-dimensional repositories such as Thingiverse or Google Images accept only one form of query from users, an image or a text.” [0004] “The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data” [0019] “The visual-semantic descriptors are computed in two steps: (1) embedding images and shapes in W and then (2) embedding multi-modal input of the image (object) and text in W.” [0048] “To obtain this convolutional network, an image classifier is created, followed by an image encoder operable to generate embeddings in the word vector space. A classifier may be trained to correctly identify image labels. This classifier may then be converted into an encoder and fine-tuned to generated semantic-based descriptors in the next section.” [0055])
13.	With reference to claim 9, Ponjou Tasse teaches the machine learning model comprises a neural network having multiple attention layers. (“The use of neural networks may be referred to as a use of machine learning. Machine learning may further be performed through the use of one or more of: a non-linear hierarchical algorithm; recurrent neural network; long short-term memory network; multi-dimensional convolutional network; a memory network; or a gated recurrent network allows a flexible approach when generating the predicted block of visual data. … the neural network comprises one or more fully-connected layers.” [0019 -0020] “The AlexNet implementation is a multi-layer network consisting of one input layer, a combination of five convolutional and pooling layers and three fully-connected layers. The convolutional layers of the AlexNet implementation capture progressively more complex edge information, while the pooling layers apply subsampling. The fully-connected layers specify how the information captured from previous layers is combined to compute the probability that an image has a given label. These layers are controlled by about 60 million parameters that are initialized with AlexNet and optimised using the training dataset I.” [0056-0057])
14.	Claim 11 is similar in scope to claim 1, and thus is rejected under similar rationale. Ponjou Tasse does not explicitly teach A system comprising: a memory component; and a processing device coupled to the memory component, the processing device configured to perform operations. This is what Turkelson teaches (“Some aspects include a system, including: one or more processors; and memory storing instructions that when executed by the processors cause the processors to effectuate operations of each the above-mentioned processes.” [0017]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into Ponjou Tasse, in order to enhance a training set for a visual search process.
15.	Claims 12-14 are similar in scope to claims 2-4, and they are rejected under similar rationale.
16.	Claim 17 is similar in scope to claim 1, and thus is rejected under similar rationale. Ponjou Tasse does not explicitly teach A non-transitory computer-readable medium storing executable instructions, which when executed by a processing device, cause the processing device to perform operations. This is what Turkelson teaches (“Some aspects include a tangible, non-transitory, machine-readable medium storing instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations including each of the above-mentioned processes.” [0016]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into Ponjou Tasse, in order to enhance a training set for a visual search process.
17.	Claim 18 is similar in scope to claim 2, and thus is rejected under similar rationale.
18.	Claim(s) 5-8, 15, 16, 19 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ponjou Tasse et al. (US 2020/0104318 A1) and Liang et al. (US 2020/0005015 A1), as applied to claims 1, 11 and 17 above, and further in view of Turkelson et al. (US 2021/0004589 A1). 
19.	With reference to claim 5, the combination of Ponjou Tasse and Liang does not explicitly teach normalizing the encoded representation of the input, wherein the encoded representations of the plurality of sample three-dimensional object representations in the search space are normalized, and wherein the searching comprises comparing the normalized encoded representation of the input to the normalized encoded representations in the search space. This is what Turkelson teaches (“a classifier may be trained using extracted features from an earlier layer of the machine learning model. In some embodiments, preprocessing may be performed to an input image prior to the feature extraction being performed. For example, preprocessing may include resizing, normalizing, cropping, etc., to each image to allow that image to serve as an input to the pre-trained model. Example pre-trained networks may include AlexNet, GoogLeNet, MobileNet-v2, and others. The preprocessing input images may be fed to the pre-trained model, which may extract features, and those features may then be used to train a classifier (e.g., SVM). In some embodiments, the input images, the features extracted from each of the input images, an identifier labeling each of the input image, or any other aspect capable of being used to describe each input image, or a combination thereof, may be stored in memory (e.g., within training data database 136A as an update to training data set for training an object recognition model, a context classification model, etc.).” [0084] “Some embodiments may include the trained computer-vision object recognition model having parameters that encode information about a subset of visual features of the object depicted by each image from the training data set.” [0088] “image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “receive a query image, pass the image to a deep neural network that extracts deep features, before computing distances to all images in the index and presenting a nearest neighbor as a search result. Some embodiments may receive a query image (e.g., a URL of a selected online image hosted on a website, a captured image from a mobile device camera, or a sketch drawn by a user in a bitmap editor) and determine the nearest neighbor, computing its distance in vector space.” [0158] “the trained computer-vision object recognition model may extract one or more visual features describing the new image. The visual features may be compared to the visual features extracted from each of the images from the training data set to determine a similarity between the visual features of the new image and the visual features of the images from the training data set. In some embodiments, the visual features of the new image and the visual features of the images from the training data set may be represented as feature vectors in an n-dimensional feature space.” [0189]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse and Liang, in order to enhance a training set for a visual search process.
20.	With reference to claim 6, Ponjou Tasse does not explicitly teach searching the search space using nearest neighbors comprises minimizing an L2 distance between one or more of the encoded representations of multiple views of the plurality of sample three-dimensional object representations in the search space and the encoded input. This is what Liang teaches.  Liang teaches searching the search space using nearest neighbors (“Spatial proximity estimation is another embodiment of the disclosed 3D spatial query system and method, that explores the distribution of target objects in 3D space given a set of basic objects. In 3D digital pathology, for instance, the spatial distribution of different types of vessels in liver organ is useful in providing a quantitative measurement of disease progression. 3D proximity estimation is a complex spatial query involving multiple objects. It is based on nearest neighbor search and demands accurate distance computation for global spatial pattern discovery.” [0200]) Liang teaches the 3D proximity estimation is involving multiple objects that based on nearest neighbor search. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Liang into Ponjou Tasse, in order to provide options for users to decide and tailor their goals for faster queries or higher accuracy to meet application specific requirements.
The combination of Ponjou Tasse and Liang does not explicitly teach minimizing an L2 distance between one or more of the encoded representations of multiple views of the plurality of sample three-dimensional object representations in the search space and the encoded input. This is what Turkelson teaches (“a classifier may be trained using extracted features from an earlier layer of the machine learning model. In some embodiments, preprocessing may be performed to an input image prior to the feature extraction being performed. For example, preprocessing may include resizing, normalizing, cropping, etc., to each image to allow that image to serve as an input to the pre-trained model. Example pre-trained networks may include AlexNet, GoogLeNet, MobileNet-v2, and others. The preprocessing input images may be fed to the pre-trained model, which may extract features, and those features may then be used to train a classifier (e.g., SVM). In some embodiments, the input images, the features extracted from each of the input images, an identifier labeling each of the input image, or any other aspect capable of being used to describe each input image, or a combination thereof, may be stored in memory (e.g., within training data database 136A as an update to training data set for training an object recognition model, a context classification model, etc.).” [0084] “Some embodiments may include the trained computer-vision object recognition model having parameters that encode information about a subset of visual features of the object depicted by each image from the training data set.” [0088] “image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “receive a query image, pass the image to a deep neural network that extracts deep features, before computing distances to all images in the index and presenting a nearest neighbor as a search result. Some embodiments may receive a query image (e.g., a URL of a selected online image hosted on a website, a captured image from a mobile device camera, or a sketch drawn by a user in a bitmap editor) and determine the nearest neighbor, computing its distance in vector space. Based on the distance (e.g., if the distance is less than 0.05 on a scale of 0-1), embodiments may designate the search was successful with a value indicating relatively high confidence, and embodiments may add the query image to the product catalog as ground truth to the index.” [0158-0159] “the trained computer-vision object recognition model may extract one or more visual features describing the new image. The visual features may be compared to the visual features extracted from each of the images from the training data set to determine a similarity between the visual features of the new image and the visual features of the images from the training data set. In some embodiments, the visual features of the new image and the visual features of the images from the training data set may be represented as feature vectors in an n-dimensional feature space.” [0189] “the candidate video may include video of depicting the object from a first perspective (e.g., head-on) for a first amount of time (e.g., four seconds), followed by video depicting the object from a second perspective (e.g., a side view) for a second amount of time (e.g., five seconds). Mobile computing device 104 may be configured to continually obtain the video for a predefined amount of time (e.g., 10 seconds, 30 seconds, 1 minute, etc.), until a threshold number of images are obtained (e.g., 10 or more images, 20 or more images, 50 or more images), until images captured by the video satisfy a threshold number of criteria (e.g., a threshold number of perspective views of the object are obtained, a threshold number of lighting conditions are obtained, etc.), or a combination thereof.” [0246]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse and Liang, in order to enhance a training set for a visual search process.
21.	With reference to claim 7, the combination of Ponjou Tasse and Liang does not explicitly teach the multiple views of each of the plurality of sample three-dimensional object representations comprises at least one hundred views from predetermined viewpoints. This is what Turkelson teaches (“image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “the candidate video may include video of depicting the object from a first perspective (e.g., head-on) for a first amount of time (e.g., four seconds), followed by video depicting the object from a second perspective (e.g., a side view) for a second amount of time (e.g., five seconds). Mobile computing device 104 may be configured to continually obtain the video for a predefined amount of time (e.g., 10 seconds, 30 seconds, 1 minute, etc.), until a threshold number of images are obtained (e.g., 10 or more images, 20 or more images, 50 or more images), until images captured by the video satisfy a threshold number of criteria (e.g., a threshold number of perspective views of the object are obtained, a threshold number of lighting conditions are obtained, etc.), or a combination thereof.” [0246]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse and Liang, in order to enhance a training set for a visual search process.
22.	With reference to claim 8, the combination of Ponjou Tasse and Liang does not explicitly teach the multiple views of each of the plurality of sample three-dimensional object representations comprise views from predetermined viewpoints, and one or more of views with varying lighting or views with varying texture. This is what Turkelson teaches (“image capture components 508A may include one or more cameras configured to capture two-dimensional images, three-dimensional images, high definition images, videos, time series images, image bursts, and the like.” [0118] “the candidate video may include video of depicting the object from a first perspective (e.g., head-on) for a first amount of time (e.g., four seconds), followed by video depicting the object from a second perspective (e.g., a side view) for a second amount of time (e.g., five seconds). Mobile computing device 104 may be configured to continually obtain the video for a predefined amount of time (e.g., 10 seconds, 30 seconds, 1 minute, etc.), until a threshold number of images are obtained (e.g., 10 or more images, 20 or more images, 50 or more images), until images captured by the video satisfy a threshold number of criteria (e.g., a threshold number of perspective views of the object are obtained, a threshold number of lighting conditions are obtained, etc.), or a combination thereof.” [0246]) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Turkelson into the combination of Ponjou Tasse and Liang, in order to enhance a training set for a visual search process.
23.	Claims 15 and 19 are similar in scope to claim 5, and they are rejected under similar rationale.
24.	Claims 16 and 20 are similar in scope to claim 8, and they are rejected under similar rationale.

Conclusion
25.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michelle Chin whose telephone number is (571)270-3697.  The examiner can normally be reached on Monday-Friday 8:00 AM-4:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http:/Awww.uspto.gov/interviewpractice.
 If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Kent Chang can be reached on (571)272-7667.  The fax phone number for the organization where this application or proceeding is assigned is (571)273-8300.   
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https:/Awww.uspto.gov/patents/apply/patent- center for more information about Patent Center and https:/Awww.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/MICHELLE CHIN/
Primary Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Sep 28, 2023
Application Filed
Sep 30, 2025
Non-Final Rejection — §103
Dec 11, 2025
Interview Requested
Dec 17, 2025
Applicant Interview (Telephonic)
Dec 18, 2025
Examiner Interview Summary
Jan 02, 2026
Response Filed
Jan 27, 2026
Final Rejection — §103
Mar 13, 2026
Interview Requested
Mar 19, 2026
Applicant Interview (Telephonic)
Mar 19, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/419,443
Patent 12602870
COMPUTER-AIDED TECHNIQUES FOR DESIGNING 3D SURFACES BASED ON GRADIENT SPECIFICATIONS
2y 5m to grant Granted Apr 14, 2026
18/211,823
Patent 12597205
HYBRID GPU-CPU APPROACH FOR MESH GENERATION AND ADAPTIVE MESH REFINEMENT
2y 5m to grant Granted Apr 07, 2026
18/568,680
Patent 12592041
MIXED SHEET EXTENSION
2y 5m to grant Granted Mar 31, 2026
18/540,069
Patent 12586287
Method of Operating Shared GPU Resource and a Shared GPU Device
2y 5m to grant Granted Mar 24, 2026
18/202,631
Patent 12579700
METHODS OF IMPERSONATION IN STREAMING MEDIA
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
85%
Grant Probability
97%
With Interview (+11.5%)
2y 4m
Median Time to Grant
Moderate
PTA Risk
Based on 634 resolved cases by this examiner. Grant probability derived from career allow rate.