Last updated: April 19, 2026
Application No. 18/633,828
SYSTEM AND METHOD FOR CONVERSATIONAL SHOPPING BASED ON MACHINE LEARNING

Non-Final OA §101§102§103§112
Filed
Apr 12, 2024
Examiner
SORRIN, AARON JOSEPH
Art Unit
2672
Tech Center
2600 — Communications
Assignee
Walmart Apollo LLC
OA Round
1 (Non-Final)
Interview Optional

— +50.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 62 resolved cases, 2023–2026
Examiner Intelligence

SORRIN, AARON JOSEPH View full profile →
Grants 74% — above average
Career Allow Rate
46 granted / 62 resolved
+12.2% vs TC avg
Strong +51% interview lift
Without
With
+50.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
22 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
20.4%
-19.6% vs TC avg
§103
35.6%
-4.4% vs TC avg
§102
14.1%
-25.9% vs TC avg
§112
29.3%
-10.7% vs TC avg
Black line = Tech Center average estimate • Based on career data from 62 resolved cases
Office Action

§101 §102 §103 §112
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 04/18/2025 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Specification
The abstract of the disclosure is objected to because of improper grammar, particularly for “systems and methods conversational shopping” in the first line which should recite “systems and methods for conversational shopping”.  A corrected abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text. See MPEP § 608.01(b).


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 3-8 and 12-17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 3 (and similarly claim 12) recites, “filter the candidate image pairs based on at least one condition over the converted item attributes to generate filtered image pairs;”. The underlined limitation is unclear and indefinite. This is being interpreted such that the candidate image pairs are filtered based on a condition relating to the converted item attributes to generate filtered image pairs.
Claims 4-8 and 13-17 are rejected as dependent on claims 3 and 12.
Claim 7 (and similarly claim 16) recites, “whose two images” in two instances. There is insufficient antecedent basis for this limitation in the claim. This is being interpreted as a new element. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1, 9, 10, 18, and 19 are rejected under 35 U.S.C. 101. 
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to the abstract idea of identifying a product in response to a search query, without significantly more. 
The claim recites: “A system, comprising:
a non-transitory memory having instructions stored thereon; and
at least one processor operatively coupled to the non-transitory memory, and configured to read the instructions to:
obtain, from a computing device, a search request identifying a reference image representing a first product and a textual query associated with the reference image,
compute a text embedding in an embedding space based on the textual query,
compute an image embedding in the embedding space based on the reference image,
determine, based on at least one machine learning model, a target image representing a second product based on the text embedding and the image embedding, and
transmit, to the computing device, the target image in response to the search request.”
The limitations, as drafted, are processes that, under their broadest reasonable interpretation, cover performance of the limitation in the mind. Firstly, the receiving of the text and image queries amounts to insignificant extra-solution activity (data gathering).  A person can mentally convert the queries into embeddings in a shared embedding space. This could, as a non-limiting example, amount to mentally envisioning a blue car in response to the image query of a car and the text query “blue”. As another example, the person could mentally convert the image query and text query into vectors (or a combined vector) in a shared embedding space. This amounts to identifying a numerical string corresponding to the blue car query. The person could then visually search a catalog of products and identify a blue car (a target image based on the embeddings). The person can then present the blue car image to another person (transmitting the target image).
This judicial exception is not integrated into a practical application. In particular, the claim recites the additional elements of a non-transitory memory, a processor, and a computing device. These are recited at a high level of generality such that they amount to generic, standard computer components for the performance of the abstract idea. The claim also recites a machine learning model, which is recited at a high level of generality such that it amounts to a generic learning model for the performance of the abstract idea. Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements are recited at a high-level of generality. It is therefore a judicial exception that is not integrated into a practical application, and does not include additional elements that are sufficient to amount to significantly more than the judicial exception. This claim is not patent eligible.

Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to the abstract idea of performing a search similar to claim 1, wherein a previously determined image is used as a new query image (the previous search is refined based on the search result). This amounts to a mental process for the same reasons as claim 1. The claim is not patent eligible.

Claims 10 and 18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a method analogous to the abstract idea of claims 1 and 9. The computing device is recited at a high level of generality such that it amounts to no more than a generic computer component for execution of the abstract idea. The machine learning model is recited at a high level of generality such that it amounts to a generic learning model for the performance of the abstract idea. These claims are not patent eligible.

Claims 19 is rejected under 35 U.S.C. 101 because the claimed invention is directed to a non-transitory computer readable medium having instructions that are analogous to the abstract idea of claim 1. The non-transitory computer readable medium, processor, device, and computing device are all recited at a high level of generality such that they amount to no more than generic computer components for storage and execution of the abstract idea. The machine learning model is recited at a high level of generality such that it amounts to a generic learning model for the performance of the abstract idea. This claim is not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.



Claim(s) 1, 2, 10, 11, 19, and 20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Zhu (US 12210516 B1).
Regarding claim 1, Zhu teaches “A system, comprising: a non-transitory memory having instructions stored thereon; and at least one processor operatively coupled to the non-transitory memory,” (Zhu, Figure 11 shows memory in communication with a processor, Column 18 lines 56-67 and Column 19 lines 1-4, “Storage media and other non-transitory computer readable media for containing code, or portions of code, can include any appropriate media known or used in the art, such as but not limited to volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other medium which can be used to store the desired information and which can be accessed by a system device.”)
“and configured to read the instructions to: obtain, from a computing device, a search request identifying a reference image representing a first product and a textual query associated with the reference image,” (Zhu, Column 2, lines 30-46, “FIG. 1 illustrates components of an example computing environment 100 that can be used to implement aspects of the various embodiments. In this example, inputs are shown as a first input 102 (which in this example may correspond to an image input) and a second input 104 (which in this example may correspond to a text input). It should be appreciated that “first” and “second” are used to label the inputs and not to identify the order in which they are used or processed by the system, as the first input may be provided first or the second input may be provided first. The inputs may be provided over one or more computer networks, for example using a client device (e.g., user device), as will be described herein. The client device can be any appropriate computing device capable of generating such inputs, as may include a smartphone, desktop, set-top box (e.g., Fire TV), voice-enabled device (e.g., Echo), or tablet computer.” Note that these inputs correspond to a product (see element 682 in Figure 6F).)
“compute a text embedding in an embedding space based on the textual query, compute an image embedding in the embedding space based on the reference image,” (Zhu, Column 4, lines 38-53, “One or more trained MIM models may be used to generate vector representations of inputs, which may be referred to as feature vectors. The feature vectors may all be associated with a shared space representation where both text and image representations (among other options, such as audio and image representations) are aligned within a common embedding space so that matching can be carried out interchangeably across modalities. For example, during training, the MIM models may be trained using images that include an associated label, such as an image and a title that would appear within an e-commerce environment. Training may be performed simultaneously or substantially simultaneously. Additionally, in embodiments, training may be performed separately, for example on a number of different models, and then an alignment model may be used to align different results within a common vector space.”)

“determine, based on at least one machine learning model, a target image representing a second product based on the text embedding and the image embedding, and transmit, to the computing device, the target image in response to the search request.” (Zhu, Figure 6F and Column 4, lines 38-53, “One or more trained MIM models may be used to generate vector representations of inputs, which may be referred to as feature vectors. The feature vectors may all be associated with a shared space representation where both text and image representations (among other options, such as audio and image representations) are aligned within a common embedding space so that matching can be carried out interchangeably across modalities. For example, during training, the MIM models may be trained using images that include an associated label, such as an image and a title that would appear within an e-commerce environment. Training may be performed simultaneously or substantially simultaneously. Additionally, in embodiments, training may be performed separately, for example on a number of different models, and then an alignment model may be used to align different results within a common vector space.”; Figure 6F shows the target images, transmitted to the user device responsive to the search, representing a second product (tie) based on the search embeddings.)

Regarding claim 2, Zhu teaches “The system of claim 1,”
“wherein the target image is determined based on: combining, using a machine learning model, the text embedding and the image embedding with weights pre-determined based on a training process of the machine learning model, to generate a combined embedding representing an aggregate of the textual query and the reference image in the embedding space;” (Zhu, Figure 5 shows text embedding and image embedding (feature vectors) combined at the combiner using pre-determined weights to generate the combined vector (aggregate of the queries in the embedding space). Zhu, Column 9 lines 9-26,  further describes Figure 5 and the use of a machine learning model for the weighting and combining:, “These inputs 502, 504 may then be provided to the MIM models 108A, 108B, as noted above, to generate respective image and text feature vectors 506, 508. The respective feature vectors 506, 508 are then provided to the vector generator 110 to be combined 510 according to one or more weights 512. For example, the weights 512 may be predetermined based, at least in part, on a category of the search. In other examples, the weights 512 may be adjusted dynamically by a user. It should be appreciated that weights 512 may be updated and adjusted over time, for example as information regarding search accuracy is collected. As previously noted, the use of weights 512 with the vector generator 110 is provided by way of example and is not intended to limit the scope of the present disclosure. For example, the weights 512 may correspond to neural network information and the combiner may be a trained network that receives the input vectors 506, 508 to produce a combined feature vector output.” Note that one skilled in the art would understand that weights of a neural network are established through training processes.)
“and determining the target image based on the combined embedding using the machine learning model, wherein the target image is an image whose embedding is closest to the combined embedding in the embedding space.”  (Zhu, Figure 5 element 514, Figure 8, and Column 4 lines 9-19, “This combined vector may be representative of one or more features of both the first image 102 and the second text 104. The combined feature vector may then be evaluated against a MIM index 112, which may have various items associated with a certain environment pre-mapped and indexed. As noted, the items within the MIM index 112 may be mapped to a common vector space, and as a result, the combined vector, which includes components of both the first input 102 and the second input 104, may provide an improved result 114 responsive to an initial input query.” Figure 5 shows target images determined in search result 514, which are obtained via matching of the query against the MIM index, (i.e. determining a closest match or a set of closest matches). Figure 8 further shows that the items of the catalogue, from which the target image is selected, are converted into embeddings in “a common vector space” (the embedding space) for comparison against the query.)

Regarding claims 10 and 11, these claims recite a method with steps corresponding to the elements of the system recited in Claims 1 and 2. Therefore, the recited steps of these claims are mapped to the analogous elements in the corresponding system claims.

Regarding claims 19 and 20, these claims recite a non-transitory computer readable medium having instructions corresponding to the steps recited in Claims 10 and 11.  Therefore, the recited programming instructions of these claims are mapped to the analogous steps in the corresponding method claims. Zhu discloses a non-transitory computer readable medium having instructions (Zhu, Column 18 lines 56-67 and Column 19 lines 1-4, “Storage media and other non-transitory computer readable media for containing code, or portions of code, can include any appropriate media known or used in the art, such as but not limited to volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices or any other medium which can be used to store the desired information and which can be accessed by a system device.”)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhu in view of Bursztyn (US 20230161808 A1).

Regarding claim 9, Zhu teaches “The system of claim 1,”
“wherein the at least one processor is configured to: obtain, from the computing device, a new textual query after the target image is transmitted; compute a new text embedding in the embedding space based on the new textual query; (The invention of Zhu as recited in the rejection of claim 1 is implicitly reusable for a plurality of searches after a first search is performed and the first search result is transmitted. This enables users to input a second (new) textual query, such as element 504 in Figure 5, which is then converted to a text embedding (text feature vector), such as element 508 in Figure 5.)
While the invention of Zhu teaches the use of a new reference (query) image associated with the new textual query and determining an associated new image embedding, then generating a new target image representing a third product (search result) based on the newly searched image and textual embeddings, and then transferring the new target image to the computing device (As described above, the invention of Zhu is implicitly reusable for a plurality of searches, including a second search image input along with the new text query, wherein new embeddings are generated for each query, new matching products are identified based on the new embeddings, and their images (target images) are transmitted. For example, the search process of Figure 5 includes these elements and is implicitly repeatable.), Zhu does not expressly disclose that the target image (prior search output) can be used as the new reference image (query image) for the additional product searching.
Bursztyn teaches the use of a prior search output as a new reference image (query image) for an additional search that includes a new textual query (Bursztyn, Paragraph 30, “In the example of FIG. 1 , one or more users 100 can provide an initial query to image search apparatus 110 via user device 105 and cloud 115. Image search apparatus 110 can retrieve results from database 120 via cloud 115 based on the query. A user can select a reference image from the results and input a critique of the reference image. Image search apparatus 110 can generate a preference statement based on the critique, retrieve one or more second images from database 120 based on a combination of the reference image, critique, and preference statement, and return the one or more second images to the one or more users 100.”)
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to replace the reference image of Zhu with an image from a prior search result, as disclosed by Bursztyn.
The motivation for doing so would have been to improve upon search results already generated rather than starting from scratch with a new query, especially when the search results are close to the user’s desired product but require only a slight modification. Further, one skilled in the art could have combined the elements as described above by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Zhu with the above teaching of Bursztyn to fully disclose, “determine the target image as a new reference image associated with the new textual query; determine a new image embedding in the embedding space based on the target image; determine, based on the at least one machine learning model, a new target image representing a third product based on the new text embedding and the new image embedding; and transmit the new target image to the computing device.”

Regarding claim 18, claim 18 recites a method with steps corresponding to the elements of the system recited in Claim 9. Therefore, the recited steps of this claim are mapped to the analogous elements in the corresponding system claim. The rationale and motivation for combining the Zhu and Bursztyn references apply here.
Allowable Subject Matter
Claims 3-8 and 12-17 are rejected under 35 USC 112(b) and objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims, and amended to overcome the 35 USC 112(b) rejection.
The following is a statement of reasons for the indication of allowable subject matter:  With respect to claims 3 and 12 (and their respective dependent claims), in addition to other limitations in the claims, the Prior Art of Record fails to teach, disclose or render obvious the applicant' s invention as claimed, in particular:
Claim 3 (and similarly claim 12) recites: “The system of claim 2, wherein, during the training process of the machine learning model, the at least one processor is configured to: extract item attributes from a training dataset; convert jargons in the item attributes to generate converted item attributes; generate candidate image pairs for training the machine learning model; filter the candidate image pairs based on at least one condition over the converted item attributes to generate filtered image pairs; generate captions for the filtered image pairs based on the converted item attributes; and train, using the captions as labels, the machine learning model based on the filtered image pairs.”
Zhu teaches multi-modal product searching in which a text and image query are transformed into embeddings in a common embedding space for comparison to product catalogue embeddings. Bursztyn teaches multimodal searching in which a user preference statement (text query) is used in combination with a reference image for image searching. Zhang (US 20230245418 A1) teaches an image search method wherein a text query embedding is compared against embeddings generated based on product text and image information to determine if products are relevant to the user’s search. Baltescu (US 20230252550 A1) teaches a machine learning model that generates embeddings from product information comprising image and textual information, wherein the products are included in a product catalogue. Wang (US 20230368268 A1) teaches the generation of a multi-modal embedding based on user query, selected items, and conversation history, wherein a recommender system is configured to recommend items to a user based on the embedding. Forsyth (US 20200311798 A1) teaches an image search method wherein user inputs an image of an item, and the system returns a set of images that are similar to the query image along with text describing the relevance of each returned image to the query image. However, none of these references disclose the claim recited above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AARON JOSEPH SORRIN whose telephone number is (703)756-1565. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sumati Lefkowitz can be reached at (571) 272-3638. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/AARON JOSEPH SORRIN/Examiner, Art Unit 2672



/SUMATI LEFKOWITZ/Supervisory Patent Examiner, Art Unit 2672
Read full office action
Prosecution Timeline

Apr 12, 2024
Application Filed
Mar 10, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/999,467
Patent 12592054
LOW-LIGHT VIDEO PROCESSING METHOD, DEVICE AND STORAGE MEDIUM
2y 5m to grant Granted Mar 31, 2026
18/183,423
Patent 12586245
ROBUST LIDAR-TO-CAMERA SENSOR ALIGNMENT
2y 5m to grant Granted Mar 24, 2026
17/756,401
Patent 12566954
SOLVING MULTIPLE TASKS SIMULTANEOUSLY USING CAPSULE NEURAL NETWORKS
2y 5m to grant Granted Mar 03, 2026
18/060,645
Patent 12555394
IMAGE PROCESSING APPARATUS, METHOD, AND STORAGE MEDIUM FOR GENERATING DATA BASED ON A CAPTURED IMAGE
2y 5m to grant Granted Feb 17, 2026
17/809,781
Patent 12547658
RETRIEVING DIGITAL IMAGES IN RESPONSE TO SEARCH QUERIES FOR SEARCH-DRIVEN IMAGE EDITING
2y 5m to grant Granted Feb 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+50.6%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 62 resolved cases by this examiner. Grant probability derived from career allow rate.