Last updated: May 29, 2026

Application No. 17/726,385

SYSTEM FOR ITEM RECOGNITION USING COMPUTER VISION

Non-Final OA §103

Filed

Apr 21, 2022

Priority

Apr 21, 2021 — provisional 63/177,937

Examiner

WAMBST, DAVID ALEXANDER

Art Unit

2663

Tech Center

2600 — Communications

Assignee

Maplebear Inc. (Dba Instacart)

OA Round

5 (Non-Final)

Interview Optional

— +42.9% interview lift. Interview already conducted in this application's prosecution history. This examiner has a 71% grant rate with +42.9% interview lift. Since an interview has already been tried, recommend written response with narrowed claims based on precedent claim evolution patterns.

Based on 31 resolved cases, 2023–2026

Examiner Intelligence

WAMBST, DAVID ALEXANDER View full profile →

Grants 71% — above average

Career Allowance Rate

22 granted / 31 resolved

+9.0% vs TC avg

Strong +43% interview lift

Without

With

+42.9%

Interview Lift

resolved cases with interview

Typical timeline

2y 11m

Avg Prosecution

12 currently pending

Career history

Total Applications

across all art units

Statute-Specific Performance

§101

0.9%

-39.1% vs TC avg

§103

91.5%

+51.5% vs TC avg

§102

2.8%

-37.2% vs TC avg

§112

3.8%

-36.2% vs TC avg

Black line = Tech Center average estimate • Based on career data from 31 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant's submission filed on 11/17/2025 has been entered.

Response to Arguments
Applicant’s arguments, see Remarks Pgs. 2-3, filed 11/17/2025, with respect to the rejection of claim 1 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Guo et al. (NPL, “Multi-View 3D Object Retrieval With Deep Embedding Network”, previously cited, pdf attached).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 7, 9-12, 17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wu (WO 2019019525 A1, previously cited) in view of Guo et al. (NPL, “Multi-View 3D Object Retrieval With Deep Embedding Network”, previously cited), further in view of Yoshii (US Patent Pub. No. 2012/0008830 A1).
Regarding claim 1, Wu teaches an item recognition system comprising: a receiving surface (Pg. 2, “an acquisition step of collecting M to be classified placed on a settlement counter”); a top camera coupled to a top portion of the item recognition system (Pg. 5, “It can accurately identify goods and facilitate self-checkout.”, Pg. 3, “at least one of the N shooting angles is at least one of the first shooting angles, and the first shooting angle is from the top of the M products to the M products”), wherein the top camera is configured to capture images of the receiving surface from a top-down view (Pg. 3, “at least one of the N shooting angles is at least one of the first shooting angles, and the first shooting angle is from the top of the M products to the M products”); one or more peripheral cameras coupled to one or more side portions of the item recognition system, wherein the one or more peripheral cameras are configured to capture images of the receiving surface from different peripheral views (Pg. 6, “When the number N of cameras is five, the other four cameras may be evenly arranged around the M items to be classified, and all of the M items are photographed from obliquely downward); a processor (Pg. 11, “network image recognition technology, including: a processor...”); and a non-transitory, computer-readable medium storing instructions that (Pg. 11, “The memory is used to store instructions executable by the processor”), when executed by the processor, cause the processor to: access a top image comprising an image captured by the top camera (Pg. 6, “At least one shooting angle that photographs M items from directly above the M items”); access one or more peripheral images, each comprising an image captured by a peripheral camera of the one or more peripheral cameras (Pg. 6, “When the number N of cameras is five, the other four cameras may be evenly arranged around the M items to be classified, and all of the M items are photographed from obliquely downward”); identify a region of the top image and a region of each of the one or more peripheral images that depicts an item on the receiving surface (Pg. 7, “Perform target detection on the picture acquired at the first shooting angle to obtain M first rectangular area images corresponding to M products one by one, and then rest on the N pictures according to the number of the first rectangular area images. The pictures are respectively subjected to target detection to acquire M remaining rectangular area images corresponding to M items one by one in each picture”). 
Wu discloses performing data fusion on the classification results but does not teach to generate an image embedding for each region and then concatenate the regions to get a concatenated embedding or comparing the concatenated embedding to a reference item embedding by computing a distance between them.
Guo teaches to generate an image embedding for each of the identified regions of the top image and the one or more peripheral images by applying an image embedding model to each of the identified regions, wherein the image embedding model is a machine-learning model trained to generate image embeddings for identified regions of images (Pg. 5531, Col. 1, “In terms of convolutional layers, we extract deep features by aggregating local activations in the feature map stack. Each activation corresponds to a local region of the input image and captures some local patterns of this region.”); and wherein comparing the embedding to a reference item embedding of the one or more reference item embeddings comprises: computing a distance between the embedding and the reference item embedding (Pg. 5531, Col. 2, “With the deep embedding network, each 3D object can be represented by a set of deep features. Thus the original 3D object retrieval task is converted to a set-to-set matching problem, which can be well solved via set-to-set distances... The third distance is a modified version of Hausdorff distance... where C represents the number of images each 3D object has. Different from the above two distances, it takes the whole view set into account.”).
Guo does not explicitly disclose concatenating the image embeddings to form a concatenated embedding.
Yoshii teaches to generate a feature vector for each of the identified regions of the top image and the one or more peripheral images by applying a machine-learning model to each of the identified regions, wherein the machine-learning model is a model trained to generate feature vectors for identified regions of images (Para. 34, “In an image matching step 111, matching is performed on the clipped normalized input image group 105 (clipped normalized input captured image group)”; Para. 36, “In the case of using machine learning technology, features may be extracted from each image”); concatenate the feature vectors based on a pre-determined ordering of the top camera and the one or more peripheral cameras to form a concatenated embedding (Fig. 10, Para. 63, “the technique shown in FIG. 10A involves simply concatenating the image from the camera 1 and the image from the camera 2 in the x axis direction.” explains that there is an order to the concatenation, Para. 36, “In the case of using machine learning technology, features may be extracted from each image, and the multiple feature amount vectors obtained as a result may be simply connected to create a single feature vector (feature information).”); and identify the item by comparing the concatenated feature vector to one or more reference item feature vectors (Para. 69, “Specifically, a recognition result for each input image is acquired by performing matching between the dictionary and the feature vector.”), wherein each reference item feature vector is associated with an item identifier (Para. 33, “The dictionary 110 stores the information of captured images of target objects from multiple viewpoints, as well as information such as the types and the positions and orientations of the target objects”), and comparing the concatenated feature vector to a reference item feature vector of the one or more reference item feature vector (Para. 35, “The image matching step 111 may be realized by performing simple checking between images, or may be realized using machine learning technology.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wu to incorporate the teachings of Guo and Yoshii to generate an image embedding for each identified region, concatenate the image embeddings to get a concatenated image embedding, and compare the concatenated image embedding to a reference embedding by computing a distance between the concatenated image embedding and the reference item embedding. Guo teaches that creating an image embedding is a powerful way to represent the relevant features in an image (Pg. 5535, Col. 2), and embedding multiple views allows for the creation of 3D object embeddings to represent the objects (Pg. 5531, Col. 2). Concatenation is a well-known and conventional method for fusing multiple data vectors into a single combined representation. Wu discloses performing data fusion on classification results but does not specify the particular technique employed. Yoshii explicitly teaches generating feature vectors for each image captured from multiple views and then combining these vectors into a single feature vector for matching to a reference database (concatenation). Implementing the image embedding method of Guo and the concatenation method disclosed by Yoshii into the system of Wu would have been a routine substitution of known image recognition and data fusion techniques. One of ordinary skill in the art would recognize that utilizing image embeddings for concatenation rather than the feature vectors disclosed by Yoshii would provide the predictable improvement of more accurately preserving and aggregating all relevant image features from each individual view, thereby enabling more robust matching against reference data.
Regarding claim 2, Wu as modified teaches all of the elements of claim 1, as stated above, as well as the top camera and the one or more peripheral cameras are configured to capture 2D images of the receiving surface (Wu Pg. 4, “the collection device is a camera, and one camera is disposed directly above the M items to be classified to take a picture of the M items from directly above to collect Picture; four cameras are arranged around the M item to be classified to take pictures of the M items from obliquely downwards to collect pictures.”).
Regarding claim 7, Wu as modified teaches all of the elements of claim 1, as stated above, including wherein the instructions for identifying the item comprise instructions that cause the processor to: receive a set of candidate reference embeddings from a remote server (Yoshii, Para. 33, “The dictionary 110 stores the information of captured images of target objects from multiple viewpoints, as well as information such as the types and the positions and orientations of the target objects”; Para. 39, “The external storage apparatus 201 serving as the storage unit stores and holds, for example, programs for realizing the present embodiment, registered images captured by cameras, and a dictionary created using the registered images”).
Regarding claim 9, Wu as modified teaches all of the elements of claim 1, as stated above, including the instructions that cause the processor to: detect that an item was placed on the receiving surface; and access the top image and the one or more peripheral images responsive to detecting an item was placed on the receiving surface (Wu Pg. 6, “The collection action (or camera action) can be triggered by a scale arranged on the settlement counter”).
Regarding claim 10, Wu as modified teaches all of the elements of claim 9, as stated above, including instructions that cause the processor to: detect that an item was placed on the receiving surface based on sensor data from one or more weight sensors coupled to the receiving surface(Wu Pg. 6, “The collection action (or camera action) can be triggered by a scale arranged on the settlement counter. For example, the scale is a scale with a pressure sensor, and the change of the weight sensed by the scale determines whether to trigger the shooting”).
Regarding claims 11-12, 17, and 19-20 the system of claims 1-2, 7, and 9 stores the program of claim 11-12, 17, and 19 and performs the method of claim 20. They are rejected for the same reasons as claims 1-2, 7, and 9.

Claim(s) 3-5, and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Yoshii and Guo as applied to Claim 1 above, and further in view of He (NPL “Mask R-CNN”, previously cited).
Regarding claim 3, Wu as modified teaches all of the elements of claim 1, as stated above. They do not explicitly disclose for the processor to: generate a pixel-wise mask for the top image and a pixel-wise mask for each of the one or more peripheral images, wherein the pixel-wise masks identify pixels of the top image and the one or more peripheral images that include an item.
He teaches a CNN that generates a pixel-wise mask for the top image and a pixel-wise mask for each of the one or more peripheral images, wherein the pixel-wise masks identify pixels of the top image and the one or more peripheral images that include an item (Pg. 3, Col. 2, “A mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions… our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wu in combination with Yoshii and Guo to incorporate the teachings of He to allow for the generation of a pixel-wise mask for the top image and a pixel-wise mask for each of the one or more peripheral images, wherein the pixel-wise masks identify pixels of the top image and the one or more peripheral images that include an item. Utilizing a pixel-wise mask allows for more accurate results, surpassing other state of the art models in instance segmentation, as disclosed by He (Pg. 2, Col. 1, “Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task”).
Regarding claim 4, Wu as modified teaches all of the elements of claim 3, as stated above, including the generation of a bounding box for the item (Pg. 6, “The images are respectively subjected to target detection to obtain the same number of remaining rectangular area images in each picture”). They do not teach to generate a bounding box based on the pixel-wise mask of the top image and the one or more peripheral images.
He teaches a method to generate a pixel-wise mask of the top image and the one or more peripheral images. (Pg. 3, Col. 2, “Specifically, we predict an m × m mask from each RoI using an FCN [30]. This allows each layer in the mask branch to maintain the explicit m × m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wu in combination with Yoshii and Guo to incorporate the teachings of He to allow for the processor to generate a bounding box for the item for the top image and a bounding box for the item for each of the one or more peripheral images based on the pixel-wise mask of the top image and the one or more peripheral images. Incorporating the pixel-wise mask taught by He into the existing bounding box generation disclosed by Wu allows for the usage of fewer parameters and is more accurate, as acknowledged by He.
Regarding claim 5, Wu as modified and He teach all of the elements of claim 4, as stated above, as well as the identified regions of the top image and the one or more peripheral images comprising a cropped image based on the bounding boxes of the top image and the one or more peripheral images (Wu, Pg. 7, “and when the target is detected, M rectangles (or rectangular areas) containing the products are pulled out on the picture, and each rectangle box contains an item”).
Regarding claims 13-15, the system of claims 1 and 3-5 stores the program of claims 11 and 13-15. They are rejected for the same reasons as claims 1 and 3-5.

Claim(s) 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wu in view of Yoshii and Guo as applied to Claim 1 above, and further in view in view of Colachis (US Patent Number 2021/0082564, previously cited).
Regarding claim 8, Wu as modified in view of Yoshii teaches all of the elements of claim 1, as stated above, as well as instructions that cause the processor to generate an image embedding for each of the identified regions of the top image and the one or more peripheral images (Yoshii, Para. 34, “In an image matching step 111, matching is performed on the clipped normalized input image group 105 (clipped normalized input captured image group)”; Para. 36, “In the case of using machine learning technology, features may be extracted from each image”; Guo, Pg. 5531, Col. 1, “In terms of convolutional layers, we extract deep features by aggregating local activations in the feature map stack. Each activation corresponds to a local region of the input image and captures some local patterns of this region.”). They do not teach the image embedding to be responsive to determining that the item does not overlap with another item on the receiving surface.
Colachis teaches a method to determine that the item does not overlap with another item on the receiving surface (Para. 0021, “an object overlap detection function 32 (detecting whether two objects overlap in space from the vantage of the video camera)”, Para. 0041, “detection of one of the first or second objects overlapping some other object may be taken as a trigger to prompt the person to correct the error”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Wu in combination with Yoshii and Guo to incorporate the teachings of Colachis to include image embeddings responsive to determining that the item does not overlap with another item on the receiving surface. An overlap detection function provides the ability to detect a wide range of errors in manipulating objects during activities of daily living, as recognized by Colachis (Para. 0021, “These object-oriented image analysis functions 30, 32, 34 provide the ability to detect a wide range of errors in manipulating objects during performance of a typical ADL”). This is analogous to manipulating objects on a receiving surface in a self-checkout system.
Regarding claim 18, the system of claims 1 and 8 stores the program of claims 11 and 18. They are rejected for the same reasons as claims 1 and 8.	




Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DAVID A WAMBST whose telephone number is (703)756-1750. The examiner can normally be reached M-F 9-6:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Gregory Morse can be reached at (571)272-3838. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DAVID ALEXANDER WAMBST/       Examiner, Art Unit 2663                                                                                                                                                                                                 

/GREGORY A MORSE/       Supervisory Patent Examiner, Art Unit 2698

Read full office action

Prosecution Timeline

Show 14 earlier events

Jun 24, 2025

Interview Requested

Jul 01, 2025

Response Filed

Jul 02, 2025

Applicant Interview (Telephonic)

Jul 09, 2025

Examiner Interview Summary

Sep 17, 2025

Final Rejection mailed — §103

Nov 17, 2025

Response after Non-Final Action

Dec 02, 2025

Non-Final Rejection (signed) — §103

Jan 28, 2026

Non-Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/741,959

Patent 12632998

PLAUSIBLE DAYSCALE TIMELAPSE GENERATION METHOD AND COMPUTING DEVICE

4y 0m to grant Granted May 19, 2026

18/076,021

Patent 12597278

IMAGE AUTHENTICITY DETECTION METHOD AND DEVICE, COMPUTER DEVICE, AND STORAGE MEDIUM

3y 4m to grant Granted Apr 07, 2026

18/146,445

Patent 12524892

SYSTEMS AND METHODS FOR IMAGE REGISTRATION

3y 0m to grant Granted Jan 13, 2026

18/052,658

Patent 12437437

DIFFUSION MODELS HAVING CONTINUOUS SCALING THROUGH PATCH-WISE IMAGE GENERATION

2y 11m to grant Granted Oct 07, 2025

17/886,664

Patent 12423783

DIFFERENTLY CORRECTING IMAGES FOR DIFFERENT EYES

3y 1m to grant Granted Sep 23, 2025

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

5-6

Expected OA Rounds

71%

Grant Probability

99%

With Interview (+42.9%)

2y 11m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 31 resolved cases by this examiner. Grant probability derived from career allowance rate.