DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see page 8, filed 12/29/2025, with respect to the claim objection have been fully considered and are persuasive. The objection of claim 9 has been withdrawn.
Applicant’s arguments, see page 8, filed 12/29/2025, with respect to the specification objection have been fully considered and are persuasive. The objection of the specification has been withdrawn.
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on all references applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. In the arguments, the Applicant states that the reference applied does not perform the features of the last four generating steps of the independent claims. The deficiency of the primary reference is cured by the addition of Bengio, which is explained below.
Regarding the first limitation of the generating, in parallel with the text feature data, image feature data …, the primary reference discloses in ¶ [135] that operations can be performed in parallel, which includes the generating of text and image feature data from a packaged item that is mentioned in ¶ [58]-[62]. The primary reference also discloses generating a ranked list of query results from a first database that can include text data, which the databases with text data is taught in ¶ [51] and [59]. As stated earlier, a first ranked set of results can be generated with a second ranked list in parallel based on querying a second database. This is taught in ¶ [65]. This database may include both visual embeddings and text data. However, the primary reference is not clear in teaching a second database of visual embeddings describing images of packaged products and an intersection of query results that are in both a first and second ranked set of query results. This is cured by the Bengio reference.
Regarding the Bengio reference, it teaches having separate databases and one database contains visual embeddings that describe images of scanned objects using image features, which is seen in figure 1 and ¶ [75]-[77]. Moreover, the secondary reference discloses taking a list of terms associated with a scanned object and ranks the terms to identify the top N terms. The system further identifies images that are associated with the top N terms to find the top N images with the top term. The system can identify the top image for the term, which is a form of taking a list of terms and images and selecting an intersection between two lists. This is taught in ¶ [84]-[93] and [99]-[102]. Therefore, based on the above, the features of the claims above are taught.
Thus, based on the above, the features of the claims below are disclosed.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 1, 10 and 15 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. The specification states that intersection of query results in the first ranked set and the second ranked set. The specification is not clear as to the “query results that are in both” the first and second rank set of query results. Thus, this is considered as new matter. This same issue is present in claims 10 and 15. Claims 2-9, 11-14 and 16-20 are rejected based on their dependency.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1, 8-10 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim (US Pub 2021/0303614) in view of Bengio (US Pub 2014/0046935).
Re claim 1: Yim discloses a system comprising:
at least one processor; at least one memory component storing instructions that, when executed by the at least one processor (e.g. a processor is connected to a memory with instructions to perform the invention, which is taught in ¶ [26].), cause the at least one processor to perform operations comprising:
[0026] The processor 120 may execute, for example, software (e.g., a program 140) to control at least one other component (e.g., a hardware or software component) of the electronic device 101 coupled with the processor 120, and may perform various data processing or computation. According to one embodiment, as at least part of the data processing or computation, the processor 120 may load a command or data received from another component (e.g., the sensor module 176 or the communication module 190) in volatile memory 132, process the command or the data stored in the volatile memory 132, and store resulting data in non-volatile memory 134. According to an embodiment, the processor 120 may include a main processor 121 (e.g., a central processing unit (CPU) or an application processor (AP)), and an auxiliary processor 123 (e.g., a graphics processing unit (GPU), an image signal processor (ISP), a sensor hub processor, or a communication processor (CP)) that is operable independently from, or in conjunction with, the main processor 121. Additionally or alternatively, the auxiliary processor 123 may be adapted to consume less power than the main processor 121, or to be specific to a specified function. The auxiliary processor 123 may be implemented as separate from, or as part of the main processor 121.
accessing a set of image frames, from a computing device (e.g. the plurality of query images can be accessed, which can be explained in ¶ [56]-[60].);
[0056] The electronic device 101 may transmit an image to be queried (hereinafter, referred to as a ‘query image’) to the server 108, and may display information on products received from the server. The server 108 may search products by using the query image received from the electronic device 101, and may generate and provide information on products to be provided to a user.
[0057] The image acquiring module 302 may acquire the query image through a shooting device (e.g., the camera module 180) or a module (e.g., the display device 160) for capturing a screen of an image output device. For example, the query image may be a camera shooting image, a screen capture image, or the like. Alternatively, the image acquiring module 302 may acquire an image stored in a memory (e.g., the memory 130) as the query image. The image stored in the memory may include an image captured previously, an image transmitted from a web or another device (e.g., the electronic device 102, the electronic device 104, etc.), or the like. The display device 160 may display information on products, received from the server 108. Although not shown in FIG. 3, the electronic device 101 may further include the processor 120, and the processor 120 may control operations of the image acquiring module 302 and display device 106.
[0058] The processor 312 may control overall operations of the server 108. The processor 312 may include an image feature extracting module 314, an image matching module 316, an image recognizer 318, a text recognizer 320, a word vector converting module 322, a similarity measuring module 324, and a priority determining module 326. According to an embodiment, at least one of the image feature extracting module 314, the image matching module 316, the image recognizer 318, the text recognizer 320, the word vector converting module 322, the similarity measuring module 324, and the priority determining module 326 may be configured in hardware as part of a circuit of the processor 312. Alternatively, at least one of the image feature extracting module 314, the image matching module 316, the image recognizer 318, the text recognizer 320, the word vector converting module 322, the similarity measuring module 324, and the priority determining module 326 may be, as a software module, an instruction/code temporarily resided in the processor 312 or a storage space in which the instruction/code is stored. Operations of the image feature extracting module 314, the image matching module 316, the image matching module 316, the image recognizer 318, the text recognizer 320, the word vector converting module 322, similarity measuring module 324, and the priority determining module 326 may be understood as the operation of the processor 312.
[0059] The data storage 332 may store information on candidate products to be searched. The data storage 332 may store product images for a plurality of candidate products that can be provided as a query result and text information related to the candidate products. A set of the product images for the candidate products and the text information related to the candidate products may be referred to as ‘product information’. The text information related to the candidate products may include a brand, a manufacturer, a name, a type, a category, or the like. The product information stored in the data storage 332 may be provided from an external database (e.g., a product DB1 310a, a product DB2 310b). Although not shown in FIG. 3, the server 108 may further include a processor and a communication module, and the processor may collect product information from the external database periodically or in an event-driven manner through the communication module, and may store the product information in the data storage 332.
[0060] The image feature extracting module 314 may extract at least one feature point of a query image. For example, the image feature extracting module 314 may convert the query image into a state of being easily analyzed, and may extract at least one feature point from the query image in the converted state. The image matching module 316 may search products by using at least one feature point extracted by the image feature extracting module 314. For example, based on the at least one feature point extracted from the query image, the image matching module 316 may determine an image-based similarity between the query image and the product image in the product information stored in the data storage 332, and may select a product image having a similarity greater than or equal to a specific level. The image matching module 316 may provide the product information storing module 342 with product information corresponding to at least one product image having the similarity greater than or equal to the specific level.
[0062] The image recognizer 318 may recognize an external object included in the query image. The image recognizer 318 may analyze the query image to determine a type, name, or the like of an object included in the query image. The text recognizer 320 may recognize texts included in the query image. The text recognizer 320 may use an Optical Character Reader/Recognition (OCR) function to recognize characters, numeric symbols, or the like printed or engraved on the external object included in the query image. Text information derived from a shape of the external object included in the query image may be generated by the image recognizer 318 and the text recognizer 320. The text information may include one or more words.
detecting a packaged item in at least one frame of the set of image frames (e.g. a packaged item is detected in a query image from a set of stored images, which is taught in ¶ [56]-[60] above.);
generating text feature data by extracting text features from the packaged item from the at least one frame (e.g. text data is generated by extracting text features from the query image, which is taught in ¶ [58]-[60] and [62] above.);
generating, in parallel with the text feature data, image feature data by extracting image features from the packaged item from the at least one frame (e.g. the system generates feature points to be analyzed by extracting image features from the query image, which is taught in ¶ [58]-[60] and [62] above. As stated in ¶ [135], various operations can occur in parallel.);
[0135] According to various embodiments, each component (e.g., a module or a program) of the above-described components may include a single entity or multiple entities. According to various embodiments, one or more of the above-described components may be omitted, or one or more other components may be added. Alternatively or additionally, a plurality of components (e.g., modules or programs) may be integrated into a single component. In such a case, according to various embodiments, the integrated component may still perform one or more functions of each of the plurality of components in the same or similar manner as they are performed by a corresponding one of the plurality of components before the integration. According to various embodiments, operations performed by the module, the program, or another component may be carried out sequentially, in parallel, repeatedly, or heuristically, or one or more of the operations may be executed in a different order or omitted, or one or more other operations may be added.
generating a first ranked set of query results by querying a first database of text data associated with packaged products using the generated text feature data (e.g. the text data extracted is used to determine a first rank of the product when it is compared to other text information, which is taught in ¶ [65]. The text information can be provided from a plurality of databases, which is taught in ¶ [59] above and [51].);
[0051] A product Database (DB) may include not only an image of each of products but also text information indicating various attributes of the respective products. However, when an image is searched for in an image-based manner in which only an image is provided as query information, it may be difficult to utilize the text information of the product.
[0064] The similarity measuring module 324 may determine a similarity based on text information between the query image and the stored product information. For example, the similarity measuring module 324 may determine the similarity by using a vector corresponding to the query image and a vector corresponding to the product information stored in the product information storing module 342. The similarity may be evaluated according to an angle between two vectors to be compared.
[0065] The priority determining module 326 may determine a priority among products searched according to the similarity based on the text information. The priority determining module 326 may re-order products indicated by the product information stored in the product information storing module 342 according to the similarity based on the text information. According to the similarity based on the text information, the priority determining module 326 may adjust or re-order the priority among the products, determined according to a similarity based on an image. For example, a high priority may be assigned to a product having a high similarity. The high priority allows a corresponding product to be displayed at an upper end of a list when displayed on the electronic device 101. For example, if the similarity based on the image is high and thus a similarity based on text information is low for a product belonging to a high rank, then the priority determining module 326 may adjust a priority of the product to a lower priority. Otherwise, if the similarity based on the text information is high even if the similarity based on the image is low, the priority determining module 326 may re-rank the product to have a high rank.
generating, in parallel with the first ranked set of query results, a second ranked set of query results by querying a second database using the generated image feature data (e.g. a query search result can be changed based on the comparison of image features with image features stored within the data storage acquire from external databases, which is taught in ¶ [59] and [65] above.);
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of the first ranked set of query results and the second ranked set of query results (e.g. a final ranked position of the query result is shown based on the two factors of the text information and the image features acquired, which is taught in ¶ [65] above.); and
causing presentation of a subset of the final ranked set of query results on a graphical user interface of the computing device (e.g. the invention discloses displaying some of the products on the display that results from the ranking of the query results, which is taught in ¶ [85] and [110].).
[0085] In operation 407, the electronic device 101 may display information on at least some of products. The electronic device 101 displays information on products on the display device 160. In this case, information on all products indicated by a search result or information on some of the products may be displayed. When the information on the products included in the search result indicates address values, the electronic device 101 may acquire information regarding purchasing of products by using the address values (e.g., request or receive the information by using a URL), and then may display the information regarding purchasing of the products.
[0110] In operation 1003, the electronic device 101 may display at least some of the products on a display (e.g., the display device 160) according to the priority. According to a screen size of the display provided in the electronic device 101, only information on some of the products may be displayed if information on the received products cannot be displayed at the same time. Thereafter, information on the remaining products may be displayed according to a drag input of a user. For example, as shown in FIG. 11A, the electronic device 101 may display a search result screen 1110 including a query image 1112 at an upper end and sequentially including an N-th product 1114a, a second product 1114b, and a first product 1114c at a lower end according to a priority.
However, Yim fails to specifically teach the features of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results.
However, this is well known in the art as evidenced by Bengio. Similar to the primary reference, Bengio discloses different databases associated with different query types (same field of endeavor or reasonably pertinent to the problem).
Bengio discloses querying a second database of visual embeddings describing images of the packaged products (e.g. the primary reference includes visual embeddings describing a product that are determined from an image. This secondary reference provides a separate database associated with either text or an image that is considered an embedding that is converted into a term that is processed by a model in the form of a vector. This is taught in ¶ [75]-[77].);
[0075] For each representative image, relevant image feature values are extracted (508). For example, the image features identifier 419 may extract image feature values for a respective representative image. In some embodiments, an image feature value is a visual characteristic of a portion of the image. Examples of image feature values include color histogram values, intensity values, an edge statistic, texture values, and so forth. Further details on extracting image feature values are disclosed in U.S. patent application Ser. No. ______, titled "Image Relevance Model," filed Jul. 17, 2009, Attorney Docket No. 16113-1606001, which is incorporated by reference herein in its entirety.
[0076] Machine learning is applied to generate an image relevance model for each of the top N query terms (510). In some embodiments, the image relevance model is a vector of weights representing the relative importance of corresponding image features to a query term (512). For a respective query term, machine learning is applied to the extracted image feature values of the representative images for the respective query term to train (and generate) an image relevance model for the respective query term. In some embodiments, the image relevance model is implemented as a passive-aggressive model for image retrieval (PAMIR), an example of which is disclosed in D. Grangier and S. Bengio, "A Discriminative Kernel-Based Model to Rank Images from Text Queries," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30 (2008), pp. 1371-1384, which is incorporated by reference herein in its entirety as background information. Further details on training and generating the image relevance model is described in U.S. patent application Ser. No. ______, titled "Image Relevance Model," filed Jul. 17, 2009, Attorney Docket No. 16113-1606001, which is incorporated by reference above.
[0077] Image relevance models for the top N query terms are combined to produce a matrix for mapping a visual query's image feature vector to N (query term, score) pairs (514). Each image relevance model vector for a respective query term becomes a row in a matrix of N rows for mapping a visual query's image feature vector to N (query term, score) pairs.
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results (e.g. the invention discloses a final list of query results that includes an image containing a text term, an image or both. The term could be only text. The system determines the image features that are relevant to a specific text term. The system can a rank the number of query textual terms. A number of images can be considered related to the textual term. The system can then take the best image and associate this image with the text term that can be related to a ranked text term list. This is an example of taking the intersection of terms in one list with images within another list to create a final list shown to a user. This is taught in ¶ [84]-[93] and [99]-[102].).
[0084] The visual query is an image document of any suitable format. For example, the visual query can be a photograph, a screen shot, a scanned image, or a frame or a sequence of multiple frames of a video. In some embodiments, the visual query is a drawing produced by a content authoring program (236, FIG. 2). As such, in some embodiments, the user "draws" the visual query, while in other embodiments the user scans or photographs the visual query. Some visual queries are created using an image generation application such as ADOBE ACROBAT, a photograph editing program, a drawing program, or an image editing program. For example, a visual query could come from a user taking a photograph of his friend on his mobile phone and then submitting the photograph as the visual query to the server system. The visual query could also come from a user scanning a page of a magazine, or taking a screen shot of a webpage on a desktop computer and then submitting the scan or screen shot as the visual query to the server system. In some embodiments, the visual query is submitted to the server system 106 through a search engine extension of a browser application, through a plug-in for a browser application, or by a search application executed by the client system 102. Visual queries may also be submitted by other application programs (executed by a client system) that support or generate images which can be transmitted to a remotely located server by the client system.
[0085] The visual query can be a combination of text and non-text elements. For example, a query could be a scan of a magazine page containing images and text, such as a person standing next to a road sign. A visual query can include an image of a person's face, whether taken by a camera embedded in the client system or a document scanned by or otherwise received by the client system. A visual query can also be a scan of a document containing only text. The visual query can also be an image of numerous distinct subjects, such as several birds in a forest, a person and an object (e.g., car, park bench, etc.), a person and an animal (e.g., pet, farm animal, butterfly, etc.). Visual queries may have two or more distinct elements. For example, a visual query could include a barcode and an image of a product or product name on a product package. For example, the visual query could be a picture of a book cover that includes the title of the book, cover art, and a bar code. In some instances, one visual query will produce two or more distinct search results corresponding to different portions of the visual query, as discussed in more detail below.
[0086] The visual query server system responds to the visual query by generating a set of image feature values for the visual query (704). The visual query server system identifies a set of image features in the visual query and generates a set of values for the image features in the visual query. Each image feature value represents a distinct image characteristic of the visual query. Examples of the generation image feature values are described in U.S. patent application Ser. No. ______, titled "Image Relevance Model," filed Jul. 17, 2009, Attorney Docket No. 16113-1606001, which is incorporated by reference above. In some embodiments, the set of image feature values includes color histogram values, intensity values, and an edge statistic (706). Other examples of image feature values include texture and other characteristics of a portion of an image. In some embodiments, the set of image feature values includes more feature values or less feature values than as described above.
[0087] The visual query server system maps the set of image feature values to a plurality of textual terms, including a weight for each of the textual terms in the plurality of textual terms (708). In some embodiments, the plurality of textual terms is the top N query terms or top N image queries described above with reference to FIG. 5. A respective textual term is a phrase, multiple words, or a single word. The mapping yields a weight or score for each of the plurality of textual terms with respect to the visual query. The weight or score is a relevance measure of the visual query to a respective textual term.
[0088] In some embodiments, the mapping utilizes a set of image relevance models, each model corresponding to a predefined textual term (710). The image relevance model for a textual term is a vector of weights representing the relative importance of a corresponding image feature used in determining whether an image is relevant to the textual term. In some embodiments, the predefined textual terms are the top N query terms, and each model in the set of image relevance models correspond to a respective top N query term.
[0089] In some embodiments, the set of image feature values for the visual query comprises an image features vector of the image feature values; and the mapping includes multiplying the image features vector by a matrix of image relevance models, each row of the matrix corresponding to a predefined textual term (712). Stated another way, the set of image feature values is represented by a vector of the values, and the image feature values vector is multiplied with a matrix of image relevance models, where each row in the matrix is a image relevance model vector corresponding to a query term, an example of which is described above with reference to FIGS. 5-6. The resulting product is a set of weights or scores for each of the plurality of textual terms with respect to the visual query.
[0090] The visual query server system ranks the textual terms in accordance with the weights of the textual terms (714). For example, the textual terms are ordered by their weights.
[0091] The visual query server system sends one or more of the ranked textual terms to the client system in accordance with the ranking the textual terms (716). In some embodiments, the textual terms that are weighted or scored the highest with respect to the visual query, in accordance with the weights or scores calculated from the mapping described above, are sent to the client system for display to the user, an example of which is described below.
[0092] In some embodiments, the visual query server system sends to the client system one or more images associated with the ranked textual terms (718) that are sent to the client system. Stated another way, the visual query server system sends, along with the ranked terms, images associated with the ranked terms to the client system. In some implementations, at the client system, a textual term is displayed with an associated image received from the visual query server system. An example of the resulting display at the client system is described below with reference to FIG. 10.
[0093] In some cases, one or more of the images associated with the ranked textual terms have image feature values similar to the image feature values identified for the visual query (720). For example, images associated with a ranked textual term are identified from a search for images using the ranked textual term (e.g., using terms-to-image search application 425). A set of best images associated with the ranked textual terms are selected by the visual query server system in accordance with a metric of similarity between their image feature values and the image feature values of the visual query. One example of such a metric of similarity is a dot product of the image feature values of candidate images with the image feature values of the visual query. For each top ranked textual term, one or more images having the highest metric of similarity (e.g., dot product) is selected.
[0099] FIG. 9 illustrates a screen shot of an interactive results document and visual query displayed concurrently with a list of textual terms, in accordance with some embodiments. The screen shot in FIG. 9 shows an interactive results document 900 and the original visual query 802 displayed concurrently with a visual query results list 902. In some embodiments, the interactive results document 900 is displayed by itself. In some other embodiments, the interactive results document 900 is displayed concurrently with the original visual query as shown in FIG. 9. In some embodiments, the list of visual query results 902 is concurrently displayed along with the original visual query 802 and/or the interactive results document 900. The type of client system and the amount of room on the display 206 may determine whether the list of results 902 is displayed concurrently with the interactive results document 900. In some embodiments, the client system 102 receives (in response to a visual query submitted to the visual query server system) both the list of results 902 and the interactive results document 900, but only displays the list of results 902 when the user scrolls below the interactive results document 900.
[0100] In FIG. 9, the list of results 902 includes a list of textual terms 903. The list of textual terms 903 includes one or more textual term results 905. The textual terms 905 are terms that were identified for the visual query 802 in accordance with the process described above with reference to FIGS. 7A-7B. Selection of a textual term 905 by the user (e.g., by clicking on the term) activates a textual search using the selected textual term 905 as the query.
[0101] In some embodiments, the list of results 902 also includes other search results found in response to the visual query. Examples of search results displayed in response to a visual query are disclosed in U.S. patent application Ser. No. 12/852,189, filed Aug. 6, 2010, entitled "Identifying Matching Canonical Documents in Response to a Visual Query," which is incorporated by reference in its entirety.
[0102] In some embodiments, one or more of the textual terms 905 in textual terms list 903 are displayed with one or more accompanying images 1002, as shown in FIG. 10. In some implementations, image 1002 is the most relevant image corresponding to textual term 905, based on an image search using the textual term as the query. The images 1002 are images associated with the visual query 802 as a whole or with sub-portions of the visual query 802. The pairing of textual terms 905 and accompanying images 1002 provide further context to the user as to how the textual terms 905 relate to the visual query 802 and sub-portions of the visual query 802.
Therefore, in view of Bengio, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results, incorporated in the device of Yim, in order to create a list from multiple groups of information from visual queries, which aid in helping the user determine items received in a visual query (as stated in Bengio ¶ [03]).
Re claim 6: However, Yim fails to specifically teach the features of the system of claim 1, wherein the intersection of query results excludes at least one first query result from the first ranked set of query results that is not in the second ranked set of query results and at least one second query result from the second ranked set of query results that is not in the first ranked set of query results.
However, this is well known in the art as evidenced by Bengio. Similar to the primary reference, Bengio discloses different databases associated with different query types (same field of endeavor or reasonably pertinent to the problem).
Bengio discloses wherein the intersection of query results excludes at least one first query result from the first ranked set of query results that is not in the second ranked set of query results and at least one second query result from the second ranked set of query results that is not in the first ranked set of query results (e.g. the invention discloses the Top N number of terms that are associated with a text term. Any term not included in the Top N number is not included. In addition, there are a list of images that may be associated with the term. Out of the associated images, images that are not associated with the term are not included. The best images selected to be associated with the term is discussed in ¶ [84]-[93] above.).
Therefore, in view of Bengio, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of wherein the intersection of query results excludes at least one first query result from the first ranked set of query results that is not in the second ranked set of query results and at least one second query result from the second ranked set of query results that is not in the first ranked set of query results, incorporated in the device of Yim, in order to create a list from multiple groups of information from visual queries, which aid in helping the user determine items received in a visual query (as stated in Bengio ¶ [03]).
Re claim 8: Yim discloses the system of claim 1, wherein each result in the subset of the final ranked set of query results is displayed as a selectable user interface element, the selectable user interface element comprising purchase information of an item similar to the packaged item (e.g. the search results are displayed on the screen in a manner to be able to access a product for purchase through a URL, which is taught in ¶ [84] and [85].).
[0084] In operation 405, the electronic device 101 may receive, from the server 108, information on a search result determined from the transmitted image. The information on the search result may include information on products and information on a priority among the products. The information on the products may include an address value (e.g., Uniform Resource Location (URL)) capable of accessing information required to purchase a product. The information on the priority may be explicitly or implicitly indicated. For example, a priority of products may be implicitly represented through an order by which address values are listed.
[0085] In operation 407, the electronic device 101 may display information on at least some of products. The electronic device 101 displays information on products on the display device 160. In this case, information on all products indicated by a search result or information on some of the products may be displayed. When the information on the products included in the search result indicates address values, the electronic device 101 may acquire information regarding purchasing of products by using the address values (e.g., request or receive the information by using a URL), and then may display the information regarding purchasing of the products.
Re claim 9: Yim discloses the system of claim 8, further comprising:
receiving a selection of the selectable user interface element (e.g. the user can select a product through the input or selection of a URL that is associated with a product, which is taught in ¶ [84] and [85] above.); and
in response to receiving the selection, causing presentation of a packaged item for purchase that is similar to the packaged item (e.g. a selection can be made to products similar to the query result that can be used to display this product or purchase the selected product, which is taught in ¶ [109]-[111].).
[0109] Referring to FIG. 10, in operation 1001, the electronic device 101 (e.g., the processor 120) may identify a priority for at least some of products. Herein, the products may include products indicated by information on a plurality of products received from the server 108. Herein, the priority may be indicated explicitly or implicitly. For example, when the priority conforms to an order of address values capable of accessing information required to purchase a product, the electronic device 101 may identify the priority according to the order of address values.
[0110] In operation 1003, the electronic device 101 may display at least some of the products on a display (e.g., the display device 160) according to the priority. According to a screen size of the display provided in the electronic device 101, only information on some of the products may be displayed if information on the received products cannot be displayed at the same time. Thereafter, information on the remaining products may be displayed according to a drag input of a user. For example, as shown in FIG. 11A, the electronic device 101 may display a search result screen 1110 including a query image 1112 at an upper end and sequentially including an N-th product 1114a, a second product 1114b, and a first product 1114c at a lower end according to a priority.
[0111] According to another embodiment, the electronic device 101 may display additional information utilizing a keyword related to a product having a high similarity to text information extracted from the query image. For example, as shown in FIG. 11B, the electronic device 101 may display a search result screen 1120 including the query image 1112 at an upper end, sequentially including the N-th product 1114a, the second product 1114b, and the first product 1114c at a lower end according to a priority, and further including a keyword related to a specific product at a middle portion. Herein, the additional information 1122 may include a link capable of additionally searching products related to a corresponding keyword. To this end, the server 108 may transmit an address value for additional search or information on a keyword related to a product having a high score in terms of a similarity.
[0112] According to another embodiment, the electronic device 101 may further display an item (e.g., a button) to additionally search products which have a common feature of products having a high similarity to the text information extracted from the query image. For example, as shown in FIG. 11C, the electronic device 101 may display a search result screen 1130 including the query image 1112 at an upper end, sequentially including the N-th product 1114a, the second product 1114b, and the first product 1114c at a lower end according to a priority, and further including a first button 1134a, a second button 1134b, and a second third button 1134bc as items indicating common features of products at a middle portion. To this end, the server 108 may transmit, to the electronic device 101, address values for additional search or information on features to be represented through buttons.
Re claim 10: Yim discloses a method comprising:
accessing, using one or more processors, a set of image frames, from a computing device (e.g. the plurality of query images can be accessed, which can be explained in ¶ [56]-[60] above.);
detecting a packaged item in at least one frame of the set of image frames (e.g. a packaged item is detected in a query image from a set of stored images, which is taught in ¶ [56]-[60] above.);
generating text feature data by extracting text features from the packaged item from the at least one frame (e.g. text data is generated by extracting text features from the query image, which is taught in ¶ [58]-[60] and [62] above.);
generating, in parallel with text feature data, image feature data by extracting image features from the packaged item from the at least one frame (e.g. the system generates feature points to be analyzed by extracting image features from the query image, which is taught in ¶ [58]-[60] and [62] above. As stated in ¶ [135], various operations can occur in parallel.);
generating a first ranked set of query results by querying a first database of text data associated with packaged products using the generated text feature data (e.g. the text data extracted is used to determine a first rank of the product when it is compared to other text information, which is taught in ¶ [65]. The text information can be provided from a plurality of databases, which is taught in ¶ [51] and [59] above.);
generating, in parallel with the first ranked set of query results, a second ranked set of query results by querying a second database using the generated image feature data (e.g. a query search result can be changed based on the comparison of image features with image features stored within the data storage acquire from external databases, which is taught in ¶ [59] and [65] above.);
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of the first ranked set of query results and the second ranked set of query results (e.g. a final ranked position of the query result is shown based on the two factors of the text information and the image features acquired, which is taught in ¶ [65] above.); and
causing presentation of a subset of the final ranked set of query results on a graphical user interface of the computing device (e.g. the invention discloses displaying some of the products on the display that results from the ranking of the query results, which is taught in ¶ [85] and [110] above.).
However, Yim fails to specifically teach the features of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results.
However, this is well known in the art as evidenced by Bengio. Similar to the primary reference, Bengio discloses different databases associated with different query types (same field of endeavor or reasonably pertinent to the problem).
Bengio discloses querying a second database of visual embeddings describing images of the packaged products (e.g. the primary reference includes visual embeddings describing a product that are determined from an image. This secondary reference provides a separate database associated with either text or an image that is considered an embedding that is converted into a term that is processed by a model in the form of a vector. This is taught in ¶ [75]-[77] above.);
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results (e.g. the invention discloses a final list of query results that includes an image containing a text term, an image or both. The term could be only text. The system determines the image features that are relevant to a specific text term. The system can a rank the number of query textual terms. A number of images can be considered related to the textual term. The system can then take the best image and associate this image with the text term that can be related to a ranked text term list. This is an example of taking the intersection of terms in one list with images within another list to create a final list shown to a user. This is taught in ¶ [84]-[93] and [99]-[102] above.).
Therefore, in view of Bengio, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results, incorporated in the device of Yim, in order to create a list from multiple groups of information from visual queries, which aid in helping the user determine items received in a visual query (as stated in Bengio ¶ [03]).
Re claim 15: Yim discloses a non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor (e.g. a processor is connected to a memory with instructions to perform the invention, which is taught in ¶ [26] above.), cause the at least one processor to perform operations comprising:
accessing a set of image frames, from a computing device (e.g. the plurality of query images can be accessed, which can be explained in ¶ [56]-[60] above.);
detecting a packaged item in at least one frame of the set of image frames (e.g. a packaged item is detected in a query image from a set of stored images, which is taught in ¶ [56]-[60] above.);
generating text feature data by extracting text features from the packaged item from the at least one frame (e.g. text data is generated by extracting text features from the query image, which is taught in ¶ [58]-[60] and [62] above.);
generating, in parallel with the text feature data, image feature data by extracting image features from the packaged item from the at least one frame (e.g. the system generates feature points to be analyzed by extracting image features from the query image, which is taught in ¶ [58]-[60] and [62] above. As stated in ¶ [135], various operations can occur in parallel.);
generating a first ranked set of query results by querying a first database of text data associated with packaged products using the generated text feature data (e.g. the text data extracted is used to determine a first rank of the product when it is compared to other text information, which is taught in ¶ [65] above. The text information can be provided from a plurality of databases, which is taught in ¶ [59] above.);
generating, in parallel with the first ranked set of query results, a second ranked set of query results by querying a second database using the generated image feature data (e.g. a query search result can be changed based on the comparison of image features with image features stored within the data storage acquire from external databases, which is taught in ¶ [59] and [65] above.);
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of the first ranked set of query results and the second ranked set of query results (e.g. a final ranked position of the query result is shown based on the two factors of the text information and the image features acquired, which is taught in ¶ [65] above.); and
causing presentation of a subset of the final ranked set of query results on a graphical user interface of the computing device (e.g. the invention discloses displaying some of the products on the display that results from the ranking of the query results, which is taught in ¶ [85] and [110] above.).
However, Yim fails to specifically teach the features of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results.
However, this is well known in the art as evidenced by Bengio. Similar to the primary reference, Bengio discloses different databases associated with different query types (same field of endeavor or reasonably pertinent to the problem).
Bengio discloses querying a second database of visual embeddings describing images of the packaged products (e.g. the primary reference includes visual embeddings describing a product that are determined from an image. This secondary reference provides a separate database associated with either text or an image that is considered an embedding that is converted into a term that is processed by a model in the form of a vector. This is taught in ¶ [75]-[77] above.);
generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results (e.g. the invention discloses a final list of query results that includes an image containing a text term, an image or both. The term could be only text. The system determines the image features that are relevant to a specific text term. The system can a rank the number of query textual terms. A number of images can be considered related to the textual term. The system can then take the best image and associated this image with the text term that can be related to a ranked text term list. This is an example of taking the intersection of terms in one list with images within another list to create a final list shown to a user. This is taught in ¶ [84]-[93] and [99]-[102] above.).
Therefore, in view of Bengio, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of querying a second database of visual embeddings describing images of the packaged products; generating a final ranked set of query results, the final ranked set of query results comprising an intersection of query results that are in both the first ranked set of query results and the second ranked set of query results, incorporated in the device of Yim, in order to create a list from multiple groups of information from visual queries, which aid in helping the user determine items received in a visual query (as stated in Bengio ¶ [03]).
Claim(s) 2, 3, 11, 12, 16 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claims 1, 10 and 15 above, and further in view of Jindal (US Pub 2022/0261579).
Re claim 2: However, Yim fails to specifically teach the features of the system of claim 1, wherein the packaged item is detected using an object detector neural network.
However, this is well known in the art as evidenced by Jindal. Similar to the primary reference, Jindal discloses detecting an object with a neural network (same field of endeavor or reasonably pertinent to the problem).
Jindal discloses wherein the packaged item is detected using an object detector neural network (e.g. a neural network is used to detect an object, which is taught in ¶ [28].).
[0028] Providing a user the ability to identify objects from a live captured scene requires a solution that is both fast and adaptable to be able to identify newly added objects. Thus, and in accordance with some embodiments, an object locating technique discussed herein leverages a neural network for quick object detection, along with a combination of salient feature detection and text identification to quickly produce highly robust matches. Additionally, the feature comparison operations use pre-stored and labelled or otherwise pre-classified reference images from an object image database, which allows for new object images to be easily added to the database, according to some embodiments. Furthermore, the neural network can be trained in a supervised manner which allows for identification of new objects and can even be trained in an unsupervised manner to recognize new objects that are observed over and over again, as will be appreciated.
Therefore, in view of Jindal, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the packaged item is detected using an object detector neural network, incorporated in the device of Yim, in order to use a neural network to detect an object, which aids in identifying new objects when trained to recognize re-observed objects (as stated in Jindal ¶ [28]).
Re claim 3: Yim discloses the system of claim 2, wherein the object detector neural network generates a confidence level indicating that the packaged item is an object of interest based on a position and a prominence of the packaged item in the at least one image frame (e.g. the system determines if the product is in the center position in a prominent location within the image frame in order to easily extract the product from the image, which is taught in ¶ [116]. Based on the ease of extracting the image, the system can extract an image in order to determine if a similarity level is determined. The similarity level is compared to a threshold to determine if a product has a certain similarity level, which is taught in ¶ [60] above and [97].).
[0097] In operation 605, the server 108 may select products having at least a specific matching rate, based on a matching result. The server 108 may identify at least one product image having a similarity greater than or equal to a specific level, and may identify a product corresponding to the identified product image.
[0116] In operation 1205, the electronic device 101 may display a guidance phrase for shooting. The guidance phrase may be an interface which describes a matter required to a user in order to capture a query image so that information on an external object can be easily extracted. For example, as shown in FIG. 13, the electronic device 101 may display a screen 1310 including a preview image 1302 at an upper end and a guidance phrase 1304 at a lower end. In the example of FIG. 13, although the guidance phrase 1304 includes a sentence “Place the product at the center of the screen, please”, this is only an example, and thus it is possible to include another sentence. In addition, in the example of FIG. 13, although the guidance phrase 1304 is displayed in a region separated from the preview image 1302, this is only an example, and thus at least part of the guidance phrase 1304 may be displayed to overlap with the preview image 1302.
Re claim 11: However, Yim fails to specifically teach the features of the method of claim 10, wherein the packaged item is detected using an object detector neural network.
However, this is well known in the art as evidenced by Jindal. Similar to the primary reference, Jindal discloses detecting an object with a neural network (same field of endeavor or reasonably pertinent to the problem).
Jindal discloses wherein the packaged item is detected using an object detector neural network (e.g. a neural network is used to detect an object, which is taught in ¶ [28] above.).
Therefore, in view of Jindal, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the packaged item is detected using an object detector neural network, incorporated in the device of Yim, in order to use a neural network to detect an object, which aids in identifying new objects when trained to recognize re-observed objects (as stated in Jindal ¶ [28]).
Re claim 12: Yim discloses the method of claim 11, wherein the object detector neural network generates a confidence level indicating that the packaged item is an object of interest based on a position and a prominence of the packaged item in the at least one image frame (e.g. the system determines if the product is in the center position in a prominent location within the image frame in order to easily extract the product from the image, which is taught in ¶ [116] above. Based on the ease of extracting the image, the system can extract an image in order to determine if a similarity level is determined. The similarity level is compared to a threshold to determine if a product has a certain similarity level, which is taught in ¶ [60] and [97] above.).
Re claim 16: However, Yim fails to specifically teach the features of the non-transitory computer-readable storage medium of claim 15, wherein the packaged item is detected using an object detector neural network.
However, this is well known in the art as evidenced by Jindal. Similar to the primary reference, Jindal discloses detecting an object with a neural network (same field of endeavor or reasonably pertinent to the problem).
Jindal discloses wherein the packaged item is detected using an object detector neural network (e.g. a neural network is used to detect an object, which is taught in ¶ [28] above.).
Therefore, in view of Jindal, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the packaged item is detected using an object detector neural network, incorporated in the device of Yim, in order to use a neural network to detect an object, which aids in identifying new objects when trained to recognize re-observed objects (as stated in Jindal ¶ [28]).
Re claim 17: Yim discloses the non-transitory computer-readable storage medium of claim 16, wherein the object detector neural network generates a confidence level indicating that the packaged item is an object of interest based on a position and a prominence of the packaged item in the at least one frame (e.g. the system determines if the product is in the center position in a prominent location within the image frame in order to easily extract the product from the image, which is taught in ¶ [116] above. Based on the ease of extracting the image, the system can extract an image in order to determine if a similarity level is determined. The similarity level is compared to a threshold to determine if a product has a certain similarity level, which is taught in ¶ [60] and [97] above.).
Claim(s) 4 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio and Jindal, as applied to claims 2 and 16 above, and further in view of Ravichandran (USP 9830534).
Re claim 4: However, Yim fails to specifically teach the features of the system of claim 2, further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a beauty product category.
However, this is well known in the art as evidenced by Jindal. Similar to the primary reference, Jindal discloses detecting an object with a neural network (same field of endeavor or reasonably pertinent to the problem).
Jindal discloses further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a product category (e.g. a neural network is used to receive captured data and categorize the item within the captured scene, which is taught in ¶ [49]. The classification of the product can yield a label that is associated with the identified category.).
[0049] According to some embodiments, object determination module 222 is programmed or otherwise configured to identify the presence of various objects within the captured scene generated via camera 216. Object identification is performed by feeding the captured scene to a neural network trained to identify and categorize various objects. According to some embodiments, the neural network is trained using many (e.g., hundreds or thousands, or more) images of particular products offered for sale at a particular store where the user would be looking for one of the products. For example, a grocery store may train a neural network using hundreds or thousands of images of food products sold in that grocery store to identify any one of the store's products. In another example, an electronics store may train a neural network using hundreds or thousands of images of the various electronic or office products sold in that store to identify any one of the store's products. According to some embodiments, the neural network produces a bounding box output around each identified object within the received captured scene along with a confidence score for that bounding box and a one or more classification labels. The classification labels may be, for example, in the form of a vector that includes one or more classification labels for a given identified object to be used for categorizing the identified object. Such a vector is referred to herein as a label vector. For example, the label vector for a jar Welch's grape jelly may include the classification labels “Welch's”, “grape”, “jelly”, “jar” based on what is identified by the neural network.
Therefore, in view of Jindal, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a product category, incorporated in the device of Yim, in order to categorize an identified product when the image is input into a neural network, which can aid in rapidly locating the object (as stated in Jindal ¶ [18]).
However, the combination above fails to specifically teach the features of initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category.
However, this is well known in the art as evidenced by Ravichandran. Similar to the primary reference, Ravichandran discloses using a classifier algorithm to classify a product (same field of endeavor or reasonably pertinent to the problem).
Ravichandran discloses initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category (e.g. the system discloses determining a classification of Beauty for a scanned object, which is taught in col. 3, ll. 50-col. 4, ll. 41. Based on a category being determined, a string of characters that are considered descriptors are referenced as well as the image associated with the string of characters. However, if the descriptors and images are not associated with the category, these can be generated, which is taught in col. 5, ll. 19-47 and col. 6, ll. 26-39. When creating a categorization tree, the character string and images associated with the category can be used to generate a category ID, or item ID, and an image ID for a category. This occurs after identifying categories, which is taught in col. 9, ll. 1-65. Applying these steps after determining a Beauty product to the combination above would result in more accurate searching or recommendations to a user.).
(15) As a first step, a neural network-based approach can be used to train a classifier algorithm on one or more categories (e.g., apparel, shoes, etc.) An example neural network is a convolutional neural network (CNN). Convolutional neural networks are a family of statistical learning models using in machine learning applications to estimate or approximate functions that depend on a large number of inputs. The various inputs are interconnected with the connections having numeric weights that can be tuned over time, enabling the networks to be capable of “learning” based on additional information. The adaptive numeric weights can be thought of as connection strengths between various inputs of the network, although the networks can include both adaptive and non-adaptive components. CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between nodes of adjacent layers of the network. Different layers of the network can be composed for different purposes, such as convolution and sub-sampling. CNN is trained on a similar data set (which includes dress, pants, watches etc.), so it learns the best feature representation for this type of image. Trained CNN is used as a feature extractor: an input image is passed through the network and intermediate outputs of layers can be used as feature descriptor of the input image. Similarity scores can be calculated based on the distance between the one or more feature descriptors and the one or more candidate content feature descriptors and used in a categorization tree as described herein.
(16) A content provider can thus analyze a set of images to determine a probability that a respective image includes an instance of a particular category. For example, for an image, rotated versions of the image can be generated. The increments of rotation between the multiple rotated versions can include, for example, one degree, five degrees, forty-five degrees, ninety-degrees, or some other increment. The classifier algorithm can be configured to analyze at least a portion of the rotated versions of the image. The classifier can generate, for each analyzed image of the rotated images, a classification vector, categorization value, weighting, or other score that indicates a probability that a respective image includes an instance of a certain category of a categorization tree. A category can refer to, for example, a class or division of items regarded as having particular shared characteristics. An example category can be Sports and Outdoors, Beauty, Health and Grocery, Books, Movies, Music and Games, Clothing, Shoes, and Jewelry, among others.
(17) The classification vector can include an entry (i.e., a probability) for each of the categories the classification algorithm is trained to recognize. The probabilities can be utilized to generate a probability distribution of output category data. Using an entropy algorithm or other such selection algorithm or approach, the probability distribution of output category data is analyzed to select an image of the rotated versions of the image. Thus, the classification result of the classifier includes a classification of the image at a particular viewpoint. As will be described further herein, a categorization tree can then be utilized, whereby for an item of interest represented in a query image, the categorization tree can be consulted to determine a category of the item.
(20) Categories in the categorization tree may be referenced and/or defined by category data. The example category data includes multiple data objects each corresponding to one of a category data object, a parent item data object, a child item data object, and an image data object. The category data object may reference and/or define a particular category of the categorization tree with a category identifier (ID) corresponding to the category. For example, each category in the categorization tree may be associated with a uniquely identifying string of alphanumeric characters, and the category ID may be a copy of the uniquely identifying string of the category. The category data object may further reference an item set of content in the collection of content corresponding to items that are categorized by the category having the category ID. For example, each item referenced by the collection of content may be associated with a uniquely identifying string of alphanumeric characters (an “item ID”), and the item set may include copies corresponding to the categorized items. The category data object may yet further reference an image set of images corresponding to items referenced by the item set. For example, each image corresponding to content in the collection of content corresponding to one or more items may be associated with a uniquely identifying string of alphanumeric characters (an “image ID”), and the image set may include copies corresponding to the referenced images. The category data object may still further include a similarity descriptor set including copies of similarity descriptors (e.g., histogram descriptors) corresponding to the images referenced by the image set.
(23) For the determined level, a category can be selected and one or more images associated with the selected category can be determined. For example, a set of images associated with content in the collection of content corresponding to items that are categorized by the selected category can be determined. Local-texture, global-shape, local-shape descriptors, or other features are obtained or determined for the query image. If the query image is not part of the collection and does not already have associated descriptors, a search module or other module may generate local-texture, global-shape, and/or local-shape descriptors for the query image. If the query image is part of the collection, the descriptors for the query image can be obtained an appropriate location storing the descriptors for the query image.
(31) FIG. 4 illustrates an example process 400 for generating a categorization tree that can be utilized in accordance with various embodiments. It should be understood that there can be additional, fewer, or alternative steps performed in similar or alternative orders, or in parallel, within the scope of the various embodiments unless otherwise stated. In this example, a set of images is obtained 402, where the set of images includes images of items of a collection of items. At least a portion of the set of images includes label information that describes a category of an item represented in a respective image. The set can include subsets from different sources and/or received at different times. The images may also include metadata regarding that which is represented in the images, such as may include item descriptions or identifiers, location information, collection data, category information, and the like. The images can be stored to a data store or in memory for subsequent analysis.
(32) From the set of images, an image can be selected 404 for processing. This can include any pre-processing, such as noise removal, color or intensity adjustment, and the like. The image can be segmented 406 into item portions using an appropriate process, such as by using connected contours or background removal to identify a potential item of interest, using an object recognition or image matching process on one or more portions of the image, etc. A determination 407 can be made whether the image includes label information. In the situation where it is determined that no label information is available, an object recognition or similar process can then attempt to identify 408 each item portion from an object catalog or other such repository or image library. As discussed, this can include an image matching process that can attempt to match the portion against a library of images in an attempt to determine 410 visually similar items to find a match with sufficient confidence or certainty that the item can be considered to be identified as the product represented in the matching image. Each item in the item catalog can be associated with a product category and the association can be used to generate a categorization tree. In the situation where it is determined that label information is available, each image can be analyzed using at least one classifier to identify 413 label information for a respective image. Approaches to analyzing data to label such data are well known and/or otherwise described herein and will not be discussed in regard to this step.
(33) The labeled images from both paths can be combined to determine 412 an initial set of categories. The initial set of categories can be based on metadata, historical action data, label information, or other data associated with the query image and/or matched image determined in the matching process, for example. The initial set of categories can include a set of higher-level categories and subcategories of a respective higher-level category. The historical interaction data can include, for example, at least one of view data, product data, consumption data, search data, purchase data, or interaction data. A categorization tree can be generated 414 that includes at least the set of higher-level categories and subcategories of a respective higher-level category. A classification threshold can be assigned 416 to each level of the categorization tree. The threshold can be based on, for example, historical interaction data, category types of a level of the categorization tree, view data, product data, consumption data, search data, purchase data, interaction data. Any number of algorithms can be used to determine a threshold, as may include probabilistic algorithms, predictive algorithms, machine learning algorithms, etc.
Therefore, in view of Ravichandran, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category, incorporated in the device of Yim, as modified by Bengio and Jindal, in order to present recommendations to a user that is of interest to the user in a more accurate suggestion, which reduce the time a user searches for items (as stated in Ravichandran col. 10, ll. 53-67).
Re claim 18: However, Yim fails to specifically teach the features of the non-transitory computer-readable storage medium of claim 16, further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a beauty product category.
However, this is well known in the art as evidenced by Jindal. Similar to the primary reference, Jindal discloses detecting an object with a neural network (same field of endeavor or reasonably pertinent to the problem).
Jindal discloses further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a product category (e.g. a neural network is used to receive captured data and categorize the item within the captured scene, which is taught in ¶ [49]. The classification of the product can yield a label that is associated with the identified category.).
Therefore, in view of Jindal, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of further comprising: receiving, from the object detector neural network, a category associated with the packaged item; and based on a determination that the category is a product category, incorporated in the device of Yim, in order to categorize an identified product when the image is input into a neural network, which can aid in rapidly locating the object (as stated in Jindal ¶ [18]).
However, the combination above fails to specifically teach the features of initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category.
However, this is well known in the art as evidenced by Ravichandran. Similar to the primary reference, Ravichandran discloses using a classifier algorithm to classify a product (same field of endeavor or reasonably pertinent to the problem).
Ravichandran discloses initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category (e.g. the system discloses determining a classification of Beauty for a scanned object, which is taught in col. 3, ll. 50-col. 4, ll. 41. Based on a category being determined, a string of characters that are considered descriptors are referenced as well as the image associated with the string of characters. However, if the descriptors and images are not associated with the category, these can be generated, which is taught in col. 5, ll. 19-47 and col. 6, ll. 26-39. When creating a categorization tree, the character string and images associated with the category can be used to generate a category ID, or item ID, and an image ID for a category. This occurs after identifying categories, which is taught in col. 9, ll. 1-65. Applying these steps after determining a Beauty product to the combination above would result in more accurate searching or recommendations to a user.).
Therefore, in view of Ravichandran, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of initiating generation of the text feature data and generation of the image feature data based on a determination that the category is a beauty product category, incorporated in the device of Yim, as modified by Bengio and Jindal, in order to present recommendations to a user that is of interest to the user in a more accurate suggestion, which reduce the time a user searches for items (as stated in Ravichandran col. 10, ll. 53-67).
Claim(s) 5 and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claim 1 above, and further in view of Yang (US Pub 2023/0087587) and Kim (US Pub 2023/0206298).
Re claim 5: However, Yim fails to specifically teach the features of the system of claim 1, wherein the text feature data is generated using an optical character recognition (OCR) neural network.
However, this is well known in the art as evidenced by Yang. Similar to the primary reference, Yang discloses OCR using a CNN (same field of endeavor or reasonably pertinent to the problem).
Yang discloses wherein the text feature data is generated using an optical character recognition (OCR) neural network (e.g. the invention discloses a CNN used to perform the operation of determining text features, which is taught in ¶ [72]-[74].).
End-to-End OCR Dataflow
[0072] In end-to-end OCR dataflow, identifier images are fed into a text detector to localize text in the identifier image. In some embodiments, the text detector not only detects text bounding boxes, but also orients the detected bounding boxes. Detected text bounding boxes are then fed into a convolutional neural network (CNN) followed by a connectionist temporal classification (CTC) decoder according to some embodiments. After the texts is decoded, they are queried to an identifier 12 database to determine the correct category of the merchandise item 10. FIG. 13 shows end-to-end OCR dataflow in the identifier OCR module in accordance with some embodiments. In some embodiments the system is configured to execute one or more steps including: localizing, using a text detector, text in the localized portion of the image containing the identifier 12; rotating the localized text to a predetermined orientation; extracting one or more features of the text using a convolutional neural network (CNN); generating, using a connectionist temporal classification (CTC), an output distribution over all possible text outputs. The steps may also include inferring, from the output distribution, a likely output and identifying the text defining the identifier 12 by: collapsing, in the likely output, any repeats; and removing, in the likely output, any blank symbols.
Text Detection
[0073] From detected text bounding boxes a CNN+CTC Decoder is designed to predict the exact text content (e.g., numbers). For a visual recognition task, a CNN is used due to its good tolerance on distortions and noises which are often cases in identifier images. FIG. 14 shows a CNN backbone used by the system to extract text image features according to some embodiments. CNN also provides the ability to extract high level features: those bounding boxed text images will go through several convolutional layers with pooling padding and some activations to get final calculated feature maps with shape of [b, h′ , w′, c], where b is batch size of inputs, h′, w′ denotes the height and width after some striding and pooling operations in previous convolutional neural network, and c is the number of output channels. In some embodiments, the CNN is a regular backbone (e.g. resnet50, shufflenet v2, mobilenet v3) which turns a batch of images into a set of high level feature maps. In some embodiments, the extracted feature maps are sent to CTC Decoder as high level semantic features for further prediction.
[0074] Since text area size in image varies and identifier text length varies the system is configured to implement a connectionist temporal classification (CTC) decoder to avoid the problem of lacking an accurate alignment of image and text content in some embodiments. FIG. 15 illustrates CTC implementation by the system according to some embodiments. For a given bounding boxed text image, the CTC algorithm gives an output distribution over all possible text outputs instead of directly predicting the numbers since traditional decoders can only predict a fixed length output. In some embodiments, the system is configured to use this distribution either to infer a likely output or to assess the probability of a given output. FIG. 16 depicts CTC in inference according to some embodiments. The feature maps generated by the previous convolutional neural network are reshaped and fed into a dense layer. An output of dense layer is then be reshaped back to [b, t, C], where b is the batch size of inputs, t is pre-defined CTC timestamp (e.g., 8 in FIG. 16), and C is the number of categories (possible digits and a blank symbol, e.g., 11 in FIG. 16). The dense layer output which gives the probabilistic distribution of each character at each CTC timestamp is then decoded by a CTC decoder.
Therefore, in view of Yang, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the text feature data is generated using an optical character recognition (OCR) neural network, incorporated in the device of Yim, in order to utilize a CNN to perform text feature extraction, which improves scalability over traditional approaches (as stated in Yang ¶ [71]).
However, the combination above fails to specifically teach the features of and image feature data is generated using an image encoder neural network.
However, this is well known in the art as evidenced by Kim. Similar to the primary reference, Kim discloses neural networks to extract text and image features (same field of endeavor or reasonably pertinent to the problem).
Kim discloses and image feature data is generated using an image encoder neural network (e.g. the system discloses an encoder neural network that is used to extract image features from an image, which is taught in ¶ [68]-[70].).
[0068] The first and second sub-encoder models receive the image related to the product as an input, but output feature information of different modalities. The image related to the product may be divided into a first element and a second element, based on types of elements included in the image. For example, the first element may be an image element, and the second element may be a text element. The image related to the product may be divided into the image element and the text element, considering properties of an image area and a text area.
[0069] The first sub-encoder model may extract a first feature from the first element divided from the image, and encode the extracted first feature as first feature information. The first sub-encoder model may have a form of a neural network including a plurality of layers. For example, the first feature information may be image feature information. The image feature information may be information indicating features about the appearance of shapes, patterns, colors, and the like, identified from an image.
[0070] The second sub-encoder model may extract a second feature from the second element divided from the image, and encode the extracted second feature as second feature information. The second sub-encoder model may have a form of a neural network including a plurality of layers. For example, the second feature information may be text feature information. The text feature information may be information indicating features about the meaning of characters, numbers, symbols, and the like, identified from text.
Therefore, in view of Kim, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of and image feature data is generated using an image encoder neural network, incorporated in the device of Yim, as modified by Bengio, in order to use an encoder to identify image features of a product, which can aid in product search and identification on a device (as stated in Kim ¶ [04]).
Re claim 13: However, Yim fails to specifically teach the features of the method of claim 10, wherein the text feature data is generated using an optical character recognition (OCR) neural network.
However, this is well known in the art as evidenced by Yang. Similar to the primary reference, Yang discloses OCR using a CNN (same field of endeavor or reasonably pertinent to the problem).
Yang discloses wherein the text feature data is generated using an optical character recognition (OCR) neural network (e.g. the invention discloses a CNN used to perform the operation of determining text features, which is taught in ¶ [72]-[74] above.).
Therefore, in view of Yang, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the text feature data is generated using an optical character recognition (OCR) neural network, incorporated in the device of Yim, in order to utilize a CNN to perform text feature extraction, which improves scalability over traditional approaches (as stated in Yang ¶ [71]).
However, the combination above fails to specifically teach the features of and image feature data is generated using an image encoder neural network.
However, this is well known in the art as evidenced by Kim. Similar to the primary reference, Kim discloses neural networks to extract text and image features (same field of endeavor or reasonably pertinent to the problem).
Kim discloses and image feature data is generated using an image encoder neural network (e.g. the system discloses an encoder neural network that is used to extract image features from an image, which is taught in ¶ [68]-[70] above.).
Therefore, in view of Kim, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of and image feature data is generated using an image encoder neural network, incorporated in the device of Yim, as modified by Bengio, in order to use an encoder to identify image features of a product, which can aid in product search and identification on a device (as stated in Kim ¶ [04]).
Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claim 15 above, and further in view of Yang (US Pub 2023/0087587).
Re claim 19: However, Yim fails to specifically teach the features of the non-transitory computer-readable storage medium of claim 15, wherein the text feature data is generated using an optical character recognition (OCR) neural network.
However, this is well known in the art as evidenced by Yang. Similar to the primary reference, Yang discloses OCR using a CNN (same field of endeavor or reasonably pertinent to the problem).
Yang discloses wherein the text feature data is generated using an optical character recognition (OCR) neural network (e.g. the invention discloses a CNN used to perform the operation of determining text features, which is taught in ¶ [72]-[74] above.).
Therefore, in view of Yang, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the text feature data is generated using an optical character recognition (OCR) neural network, incorporated in the device of Yim, in order to utilize a CNN to perform text feature extraction, which improves scalability over traditional approaches (as stated in Yang ¶ [71]).
Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claim 10 above, and further in view of Dong (US Pub 2024/0362267).
Re claim 14: However, Yim fails to specifically teach the features of the method of claim 10, wherein the computing device is a head-wearable apparatus.
However, this is well known in the art as evidenced by Tendulkar. Similar to the primary reference, Tendulkar discloses using a head mounted device for capturing data (same field of endeavor or reasonably pertinent to the problem).
Tendulkar discloses wherein the computing device is a head-wearable apparatus (e.g. the invention discloses capturing a product using smart glasses, which is taught in ¶ [231].).
[0231] In some other embodiments of these embodiments, a lookup service may be performed by a mobile app installed on a user’s mobile computing device (e.g., a client’s or prospective client’s mobile phone or tablet, a beauty advisor’s mobile phone or tablet, a wearable artificial intelligence hardware such as a virtual/augmented/extended reality or VR/AR/XR goggles, smart glasses, and others. having an image capturing device) Each of the aforementioned store digital app or the mobile app, or a combination, may function individually and independently in some embodiments or may be connected (e.g., via a cellular or wired network) to a server owned and operated by a cosmetic product manufacturer in some other embodiments.
Therefore, in view of Tendulkar, it would have been obvious to one of ordinary skill before the effective filing date of the claimed invention was made to have the feature of wherein the computing device is a head-wearable apparatus, incorporated in the device of Yim, as modified by Bengio, in order to utilize smart glasses to capture a product, which can aid a user in selecting a product that is suitable for their personal care (as stated in Tendulkar ¶ [07]-[09]).
Claim(s) 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claim 10 above, and further in view of Dong (US Pub 2024/0362267).
Re claim 20: However, Yim fails to specifically teach the features of the non-transitory computer-readable storage medium of claim 15, wherein the image feature data is generated using a text and image encoder neural network.
However, this is well known in the art as evidenced by Dong. Similar to the primary reference, Dong discloses generating an image with a neural network (same field of endeavor or reasonably pertinent to the problem).
Dong discloses wherein the image feature data is generated using a text and image encoder neural network (e.g. image feature data is generated using a variational autoencoder that can generate image from text, which is taught in ¶ [38] above.).
Therefore, in view of Dong, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein the image feature data is generated using a text and image encoder neural network, incorporated in the device of Yim, in order to generate image feature data using an encoder, which can enhance search accuracy with multi-modality (as stated in Dong ¶ [21]).
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yim, as modified by Bengio, as applied to claim 1 above, and further in view of Delgado (US Pub 2025/0005947).
Re claim 7: However, Yim fails to specifically teach the features of the system of claim 1, wherein generating the first ranked set comprises applying a term frequency - inverse document frequency (TFIDF) calculation on the generated text feature data.
However, this is well known in the art as evidenced by Delgado. Similar to the primary reference, Delgado discloses ranking a set of products (same field of endeavor or reasonably pertinent to the problem).
Delgado discloses wherein generating the first ranked set comprises applying a term frequency - inverse document frequency (TFIDF) calculation on the generated text feature data (e.g. for generating a ranked list, the TFIDF is used for the list, which is taught in ¶ [37] and [102].).
[0037] The grid facilitates improved recognition accuracy by enabling example product recognition disclosed herein to take advantage of similarities between the product text (e.g., extracted word(s)) and the reference text (e.g., reference word(s)). In particular, disclosed examples evaluate text similarity between the product text depicted in the product frame and reference text of each product candidate predicted for the product frame. The product text is compared with respective reference text of each product candidate to generate a respective text-based confidence score for each product candidate. The generating of the text-based confidence scores can be performed using, for example, a text similarity metric(s) (e.g., a Levenshtein distance and/or another string similarity metric) and/or a sorting algorithm based on weight (e.g., Term Frequency-Inverse Document Frequency (TF-IDF) and/or another numerical statistic). However, it is understood that examples disclosed are not limited thereto. Rather, different techniques can be used in additional or alternative examples, such that a degree of closeness of between pieces of text can be determined (e.g., Word2Vec, smooth inverse frequency, cosine similarity, etc.).
[0102] The rank adjuster circuitry 222 determines a respective text similarity score for each reference BOW based on respective string distances of its matched reference words. For example, upon identifying one or more matched words in the reference BOWs of the given product frame, the rank adjuster circuitry 222 applies a respective first (e.g., frequency) weight to each matched reference word. In this example, the rank adjuster circuitry 222 determines a first weight for a given matched reference word based on a respective term frequency-inverse document frequency (TF-IDF) score (e.g., value, weight, etc.), which is a statistical value indicative of how “important” a word is to a collection or corpus. The TD-IDF is useful for reducing a weight of a word that is common within the reference BOWs for the product frame. In other words, the TF-IDF is used to adjust for the fact that some words appear more frequently in general, but may not be relevant or meaningful. For example, a brand name appearing a product is likely to appear in each product description, thus lowering its relevance relative to other words. On the other hand, a particular flavor may appear in one product description, giving it more weight relative to other words.
Therefore, in view of Delgado, it would have been obvious to one of ordinary skill at the time the invention was made to have the feature of wherein generating the first ranked set comprises applying a term frequency - inverse document frequency (TFIDF) calculation on the generated text feature data, incorporated in the device of Yim, in order to use a term frequency-inverse document frequency to weigh words to adjust a rank, which aid in making items in a list more relevant to the search (as stated in Delgado ¶ [102]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Hsiao discloses a category of beauty to set for content items.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHAD S DICKERSON whose telephone number is (571)270-1351. The examiner can normally be reached Monday-Friday 10AM-6PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abderrahim Merouan can be reached at 571-270-5254. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/CHAD DICKERSON/ Primary Examiner, Art Unit 2682