DETAILED ACTION
Notice of Pre-AIA or AIA Status
Claims 1-18 are pending in this application. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.
Claims 1-18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhang et al. (US PGPub US20090116698A1 published May 7, 2009), hereby referred to as “Zhang”, in view of Chen et al. (US PGPub 20200320769, published on October 8, 2020), hereby referred to as “Chen”.
Consider Claims 1, 7 and 13.
Zhang teaches:
1. A processor implemented method, comprising: / 7. A system, comprising: one or more hardware processors; a communication interface; and a memory storing a plurality of instructions, wherein the plurality of instructions when executed, cause the one or more hardware processors to: / 13. One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: (Zhang: abstract, One embodiment of the present invention provides a system for recognizing and classifying clothes. During operation, the system captures at least one image of a clothing item. The system further determines a region on the captured image which corresponds to a torso and/or limbs. The system also determines at least one color composition, texture composition, collar configuration, and sleeve configuration of the clothing item. Additionally, the system classifies the clothing item into at least one category based on the determined color composition, texture composition, collar configuration, and sleeve configuration. The system then produces a result which indicates the classification. [0032]-[0038], Figure 1, [0064]-[0065], Figure 8)
1. receiving, via one or more hardware processors, a plurality of product images as input data; / 7. receive a plurality of product images as input data; / 13. receiving a plurality of product images as input data; (Zhang: [0064] FIG. 8 illustrates an exemplary computer system that facilitates a clothes-recognition system in accordance with one embodiment of the present invention. Computer system 802 includes a processor 804, a memory 806, and a storage device 808. Computer system 802 is coupled to a display 801 and a camera 803. [0065] Storage device 808 stores code for an operating system 816, as well as applications 820 and 822. Also included in storage device 808 are clothes-recognition applications 818. During operation, clothes-recognition applications are loaded into memory 806. When processor 804 executes the corresponding code stored in memory 806, processor 804 performs the aforementioned analysis to the images captured by camera 803 and displays the matching clothing items on display 801. [0038] FIG. 1 illustrates exemplary modules of a clothes-recognition system in accordance with one embodiment of the present invention. The system can perform a number of image-processing and pattern-recognition functions as part of the color, texture, and pattern analysis. These functions include collar recognition, sleeve recognition, trousers-length recognition, belt recognition, button detection, and demographic recognition. Clothes Detection [0039] In order to recognize the clothes, such as shirts, which the customer is wearing, the system first detects the location of the clothes within an image. When the system is given an image of a person wearing clothes, the detection of shirts is equivalent to the detection of the torso part of the human body. In the fitting room of a clothing retail store, shoppers, wearing clothes, typically stand upright in front of a mirror. The present clothes-detection system captures a relatively large torso region from the person's image.)
1.generating, via the one or more hardware processors, a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; / 7. generate a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; / 13. generating a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; (Zhang: [0041] In one embodiment, the system separates the background from the foreground so that a contour of the person's body can be identified. FIG. 2A illustrates an exemplary cleaned foreground map in accordance with one embodiment of the present invention. Given the cleaned foreground map, the system then applies a bounding box, which is represented by the blue box in FIG. 2B, to the person's body. The system then extracts the approximate torso part, which is represented by the green box in FIG. 2B, using heuristic ratios within the bounding box. This clothes-detection mechanism is sufficiently robust to different clothes localizations and the recognition results have proven satisfactory using this segmentation method. Clothes Matching Based on Color and Texture [0042] In one embodiment, the system uses color information for clothes matching. During operation, the system computes a color histogram in Red, Green, and Blue (RGB) channels from the segmented torso part. The system then compares the histogram with the histograms of other clothing items. The system further measures the similarity between two pieces of clothes by applying the φ2 test between two histograms. The details of φ2 tests can be found in Chernoff H, Lehmann E. L., “The use of maximum likelihood estimates in φ2 tests for goodness-of-fit,” The Annals of Mathematical Statistics 1954; 25:579-586, which is incorporated by reference herein. [0043] The system then retrieves the most similar and/or the most dissimilar clothes from the same category and display their images to the person for comparison. FIG. 3A illustrates a set of exemplary clothes-retrieval results based on color matching in accordance with one embodiment of the present invention. [0044] Besides color, clothing texture is also identified as a significant cue for clothes recognition due to its connection with fabric and pattern. In order to explore color and texture information simultaneously for clothes recognition, the system employs an “Eigen-Patch” approach. [0045] In the Eigen-Patch approach, instead of building histograms on the RGB values on each pixel, the system crops overlapping small image patches within the torso region and represents each patch by a multi-dimensional vector. In one embodiment, all the patches from all the clothes are stacked. The system then performs a Principal Component Analysis (PCA) to the feature stack to reduce the feature dimension and extract the most significant features from the clothes. PCA is a mathematical tool for statistical pattern recognition and its details are described in Fukunaga, K, “Introduction to Statistical Pattern Recognition,” Elsevier 1990, which is incorporated by reference herein. [0046] The system then projects the small patches to the first k principal components (referred to as “eigen patches”) which are obtained from the PCA.)
1.and generating a plurality of triplets, via the one or more hardware processors, wherein each of the plurality of triplets wherein the plurality of triplets form a training data. / 7. and generate a plurality of triplets, wherein each of the plurality of triplets wherein the plurality of triplets form a training data. / 13. and generating a plurality of triplets, wherein each of the plurality of triplets wherein the plurality of triplets form a training data. (Zhang: [0045]-[0046], The system then compares the histogram with all the histograms of other clothing items based on χ2 test to find similar and dissimilar clothes. FIG. 3B illustrates a set of exemplary clothes-retrieval results based on eigen-patch analysis in accordance with one embodiment of the present invention. Collar Recognition [0047] In one embodiment, the system uses a supervised learning algorithm to classify the clothes into different categories. In general, a collar on a shirt is an important cue to discriminate between formal shirts (e.g., dress shirts and polo shirts) and casual shirts (e.g., t-shirts and sweaters). Although it is very easy for human eyes to determine the existence of collar, recognizing it automatically from a camera is not a trivial problem. [0048]-[0053], Sleeve Recognition [0054] Sleeve length is another important factor for clothes recognition. It is also mentioned in the Wikipedia definition for “shirt” as a significant cue to discriminate between polo-shirts, T-shirts, sweat shirts (short-sleeved or none-sleeve) from dress shirts or jackets (long-sleeved). In order to recognize these two categories, it is assumed that long-sleeved clothes usually expose less skin area on arms than short-sleeved or none-sleeved clothes do. In one embodiment, the sleeve-recognition is divided into two sub-problems: skin detection and sleeve classification.[0055]-[0057] Next, for every pixel x in the rough arm area (right and left side of the upper body), a small patch p(x) of size 5×5 centered at x is extracted. x is identified as a skin pixel only if the following two conditions are true: [0058] 1. Patch p(x) is coherent in color. That is, the variance of RGB values within p(x) is smaller than a threshold. This is to prevent false detections from skin-like colors in sleeves. [0059] 2. The minimal Mahalanobis distance from the mean of the RGB values within p(x) to the two face pixel clusters is smaller than threshold ts. The skin detection results using ts=5 is shown in light blue areas in FIGS. 5A and 5B. [0060] After skin detection, the sleeve length is approximated by the number of skin pixels detected in the arms. A Decision Stump is learned on these features to recognize the sleeve lengths.)
Zhang does not teach:
1.generating, via the one or more hardware processors, a plurality of negative images for the query image; / 7. generate a plurality of negative images for the query image; / 13. generating a plurality of negative images for the query image;
1/7/13. triplets comprises a query image, a selected positive image from the plurality of positive images, a selected negative image from the plurality of negative images,
Chen teaches:
1. A processor implemented method, comprising: / 7. A system, comprising: one or more hardware processors; a communication interface; and a memory storing a plurality of instructions, wherein the plurality of instructions when executed, cause the one or more hardware processors to: / 13. One or more non-transitory machine-readable information storage mediums comprising one or more instructions which when executed by one or more hardware processors cause: (Chen: abstract, There is provided a computer implemented method for predicting garment or accessory attributes using deep learning techniques, comprising the steps of: (i) receiving and storing one or more digital image datasets including images of garments or accessories; (ii) training a deep model for garment or accessory attribute identification, using the stored one or more digital image datasets, by configuring a deep neural network model to predict (a) multiple-class discrete attributes; (b) binary discrete attributes, and (c) continuous attributes, (iii) receiving one or more digital images of a garment or an accessory, and (iv) extracting attributes of the garment or the accessory from the one or more received digital images using the trained deep model for garment or accessory attribute identification. A related system is also provided. [0011]-[0026])
1. receiving, via one or more hardware processors, a plurality of product images as input data; / 7. receive a plurality of product images as input data; / 13. receiving a plurality of product images as input data; (Chen: [0011] According to a first aspect of the invention, there is provided a computer implemented method for predicting garment or accessory attributes using deep learning techniques, comprising the steps of: [0012] (i) receiving and storing one or more digital image datasets including images of garments or accessories; [0013] (ii) training a deep model for garment or accessory attribute identification, using the stored one or more digital image datasets, by configuring a deep neural network model to predict (a) multiple-class discrete attributes; (b) binary discrete attributes, and (c) continuous attributes, [0017]
(iii) receiving one or more digital images of a garment or an accessory, and [0018] (iv) extracting attributes of the garment or the accessory from the one or more received digital images using the trained deep model for garment or accessory attribute identification.)
1.generating, via the one or more hardware processors, a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; / 7. generate a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; / 13. generating a plurality of positive images for each image determined as a query image from the plurality of product images, wherein each of the plurality of positive images for the query image is similar to the query image in terms of a plurality of characteristics with at least one of a plurality of characteristics being different for the query image and the plurality of positive images; (Chen: [0067] The method may be one further including a computer-implemented method to evaluate the level of photo-realism of synthetic renders of body images against real photos. [0068] The method may be one which includes the steps of: [0069] i) collecting one or more real photos and one or more synthetic rendered images as positive and negative samples, [0070] ii) training a machine learning model to generate a difference image, [0071] iii) using the machine learning model to generate a difference image, [0072] iv) superposing the difference images onto the input synthetic rendered image to generate a more photo-realistic synthetic image. [0073] The method may be one wherein the machine learning model is a deep neural network. [0081] The method may be one in which the triplet-loss objective function is a cost function of an optimization problem that can enforce distance constraints among positive and negative sample pairs. [0298] To learn a similarity embedding with the aforementioned desired behavior we adopt the triplet loss (J. Huang, R. S. Feris, Q. Chen, and S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, In Proceedings of the IEEE International Conference on Computer Vision, pages 1062-1070, 2015) as the cost function of the optimization problem that can enforce distance constraints among positive and negative sample pairs. For a training sample i, we denote its feature (i.e. the output from the convolutional layers) as xi. Then, from the same training set, we select a different image of the same item as the positive sample (here denoting its corresponding feature vector as xi +), and an image of a randomly-selected different item as a negative sample (denoting its corresponding feature vector as xi −). This forms a sample triplet (xi, xi +, xi −).)
1.generating, via the one or more hardware processors, a plurality of negative images for the query image; / 7. generate a plurality of negative images for the query image; / 13. generating a plurality of negative images for the query image; (Chen: [0081] The method may be one in which the triplet-loss objective function is a cost function of an optimization problem that can enforce distance constraints among positive and negative sample pairs. [0274] ,[0298] To learn a similarity embedding with the aforementioned desired behavior we adopt the triplet loss (J. Huang, R. S. Feris, Q. Chen, and S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, In Proceedings of the IEEE International Conference on Computer Vision, pages 1062-1070, 2015) as the cost function of the optimization problem that can enforce distance constraints among positive and negative sample pairs.).
1.and generating a plurality of triplets, via the one or more hardware processors, wherein each of the plurality of triplets comprises a query image, a selected positive image from the plurality of positive images, a selected negative image from the plurality of negative images, wherein the plurality of triplets form a training data. / 7. and generate a plurality of triplets, wherein each of the plurality of triplets comprises a query image, a selected positive image from the plurality of positive images, a selected negative image from the plurality of negative images, wherein the plurality of triplets form a training data. / 13. and generating a plurality of triplets, wherein each of the plurality of triplets comprises a query image, a selected positive image from the plurality of positive images, a selected negative image from the plurality of negative images, wherein the plurality of triplets form a training data. (Chen: [0081] The method may be one in which the triplet-loss objective function is a cost function of an optimization problem that can enforce distance constraints among positive and negative sample pairs. [0298] To learn a similarity embedding with the aforementioned desired behavior we adopt the triplet loss (J. Huang, R. S. Feris, Q. Chen, and S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, In Proceedings of the IEEE International Conference on Computer Vision, pages 1062-1070, 2015) as the cost function of the optimization problem that can enforce distance constraints among positive and negative sample pairs. For a training sample i, we denote its feature (i.e. the output from the convolutional layers) as xi. Then, from the same training set, we select a different image of the same item as the positive sample (here denoting its corresponding feature vector as xi +), and an image of a randomly-selected different item as a negative sample (denoting its corresponding feature vector as xi −). This forms a sample triplet (xi, xi +, xi −).
PNG
media_image1.png
380
498
media_image1.png
Greyscale
[0299] FIG. 12 shows an illustration of an example of a deep network architecture usable for triplet similarity learning. The convolutional and pool layers in the diagram can accommodate an arbitrary recent architecture for image classification, e.g. VGG11/16/19, GoogLeNet. [0300] To train the deep neural network model for learning a triplet similarity metric, we adopt a three-way Siamese architecture to handle the 3-way parallel image inputs, as illustrated in FIG. 12, in which we first initialise the model weights with those of a pre-trained attributed classification model (as described in Section 2) and apply weight sharing for all the convolutional layers, and we then retrain the last fully-connected layer while fine-tuning the earlier convolutional layers at a lower learning rate for the similarity learning. By doing so, the query image, the positive sample image, and the negative sample image in a triplet all pass through the same network for visual feature evaluation. For the training data, we rearrange the training data for attribute classification (as described in Section 2.1) into triplet groups and then perform data augmentation. For each possible pair of positive samples (xi,xi +) of sample i, we generate M=20 randomly selected negative sample pairs (xi,xi,m −), m=1,2, . . . ,M. [0301] In the prediction stage we simply evaluate the feature vectors of each image, by feeding it through the convolutional and fully-connected layers of the trained network. A “Feature Comparison & Ranking Module” (see FIG. 11) then models the similarity between the query and each gallery item I. The similarity score S of the query image and each gallery image can be defined by e.g. 1) computing the distance of their corresponding feature vectors in the visual feature space; or 2) counting the number of overlapping attributes or keywords predicted from the attribute classifier. In the implementation, we adopt the L2-distance metric (i.e. Euclidean distance) in the visual feature space to evaluate the similarity between samples as follows:)
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify Zhang’s overall method and system for intelligent fashion exploration using image analysis to leverage the improvements of Chen for a deep machine learning model that can be trained on garment prediction. The determination of obviousness is predicated upon the following findings: Both Zhang and Chen are directed towards the same field of endeavor for image analysis of clothing articles and garments, and one skilled in the art would have been motivated to modify Zhang in order to use an improved deep machine learning model that can be more accurate in the overall identification and prediction process based on trained data using both negative and positive similarity measures. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of Zhang, while the teaching of Chen continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of leveraging an automated trainable deep model that can extract properties and attributes for enhanced accuracy and customer satisfaction. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.
Consider Claims 2, 8 and 14.
The combination of Zhang and Chen teaches:
2. The method of claim 1, wherein generating the plurality of positive images comprises: generating a first set of the plurality of positive images, comprising: identifying a plurality of key points in the query image; generating a plurality of segments by performing a segmentation of the query image based on a body part associated with each of the plurality of key points; and generating one or more variants of the query image by exchanging color channels of two or more of the plurality of segments, wherein the generated one or more variants of the query image forms the first set of the plurality of positive images; generating a second set of the plurality of positive images, comprising: extracting a plurality of color channels in the query image; dividing each of the plurality of color channels to a plurality of slices; altering color channel of one or more of the plurality of slices; and generating a plurality of synthetic positive images from the plurality of slices for which the color channels have been altered, wherein the plurality of synthetic positive images forms the second set of the plurality of positive images; and augmenting the first set of the plurality of positive images and the second set of the plurality of positive images to generate a final set of positive images. / 8. The system of claim 7, wherein the one or more hardware processors are configured to generate the plurality of positive images by: generating a first set of the plurality of positive images, comprising: identifying a plurality of key points in the query image; generating a plurality of segments by performing a segmentation of the query image based on a body part associated with each of the plurality of key points; and generating one or more variants of the query image by exchanging color channels of two or more of the plurality of segments; and generating a second set of the plurality of positive images, comprising: extracting a plurality of color channels in the query image; dividing each of the plurality of color channels to a plurality of slices; altering color channel of one or more of the plurality of slices; and generating a plurality of synthetic positive images from the plurality of slices for which the color channels have been altered, wherein the plurality of synthetic positive images forms the second set of the plurality of positive images. / 14. The one or more non-transitory machine-readable information storage mediums of claim 13, wherein generating the plurality of positive images comprises: generating a first set of the plurality of positive images, comprising: identifying a plurality of key points in the query image; generating a plurality of segments by performing a segmentation of the query image based on a body part associated with each of the plurality of key points; and generating one or more variants of the query image by exchanging color channels of two or more of the plurality of segments, wherein the generated one or more variants of the query image forms the first set of the plurality of positive images; generating a second set of the plurality of positive images, comprising: extracting a plurality of color channels in the query image; dividing each of the plurality of color channels to a plurality of slices; altering color channel of one or more of the plurality of slices; and generating a plurality of synthetic positive images from the plurality of slices for which the color channels have been altered, wherein the plurality of synthetic positive images forms the second set of the plurality of positive images; and augmenting the first set of the plurality of positive images and the second set of the plurality of positive images to generate a final set of positive images. (Zhang: [0024]-[0025], [0042] In one embodiment, the system uses color information for clothes matching. During operation, the system computes a color histogram in Red, Green, and Blue (RGB) channels from the segmented torso part. The system then compares the histogram with the histograms of other clothing items. The system further measures the similarity between two pieces of clothes by applying the φ2 test between two histograms. The details of φ2 tests can be found in Chernoff H, Lehmann E. L., “The use of maximum likelihood estimates in φ2 tests for goodness-of-fit,” The Annals of Mathematical Statistics 1954; 25:579-586, which is incorporated by reference herein. [0043] The system then retrieves the most similar and/or the most dissimilar clothes from the same category and display their images to the person for comparison. FIG. 3A illustrates a set of exemplary clothes-retrieval results based on color matching in accordance with one embodiment of the present invention. [0051] The Harris measure is an indicator of the “strength of corneness” at point x, that is, how distinctive the corner is. After the system computes the Harris measure at each pixel within the neck region, the peak points are detected using non-maximal suppression with a radius r (in one embodiment, r=9). If the Harris measure at a peak point x is higher than a threshold tc, x is identified as a Harris corner point. The Harris corner detector is applied to each of the RGB channels. FIG. 4A illustrates a set of exemplary Harris corner points detected in the Red channel on non-collar clothes with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. Similarly, FIG. 4B illustrates a set of exemplary Harris corner points detected in the Red channel on clothes with collar with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. [0052] Similar to clothes detection, the neck part of the human body can be detected by segmenting within the bounding box of the human body (the green boxes shown in FIGS. 4A and 4B). Then, based on our assumption, the system can determine the presence of collar based on the number of Harris corner points detected from all the channels within the neck part.Chen: [0259] The “Physics Analysis Module” is a deep neural network (DNN) model for fabric attribute prediction or regression, as described in Section 2, which analyzes the captured garment images in different phases of the motion and predicts the garment fabric properties and/or model parameters for garment physics simulation. Two network architecture options can be adopted to implement the module; in the first the captured images are merged into one single multi-channel image (assuming RGB images are used it will be of 3×(K+1) channels) and fed as the input of the “Physics Analysis Module”; the second is to use an attribute prediction network based on multiple images input, as illustrated in FIGS. 2 and 3 in Section 2.2.3. [0260] The output of the model can be 1) a multi-class label of fabric types of the garment (e.g. “cotton”, “silk”, “polyester”) and/or associated class probabilities, or 2) an array of decimal values of fabric parameters (e.g. Young's modulus, stress and strain, or model parameters of the garment physics engine used in the virtual fitting system). These predicted physics parameters are stored into a garment database together, as shown in FIG. 6, with all the original garment photos digitised from the garment samples.)
Consider Claims 3, 9 and 15.
The combination of Zhang and Chen teaches:
3. The method of claim 1, wherein generating the plurality of negative images comprises: classifying the plurality of product images into a plurality of coarse class buckets; and processing the plurality of product images in the plurality of coarse class buckets using a pre-trained deep learning model, comprising: forming a long vector for each of the plurality of product images, wherein each of a plurality of bits in the long vector belongs to a particular filter of a particular layer among a plurality of layers of the deep learning model; generating a binary vector representation of each of the plurality of product images, based on value of the plurality of bits; identifying an exclusive bit in the binary vector representation of each of the plurality of product images; generating a combined binary vector for each of the plurality of coarse class buckets, by merging the identified exclusive bit in the binary vector representation of each of the plurality of product images in each of the plurality of coarse class buckets; generating a layer-wise filter index dictionary using the combined binary vector for the plurality of coarse class buckets, wherein the layer-wise filter index dictionary holds information on a subset of the plurality of layers of the deep learning model, that have a significant role in distinguishing the plurality of images belonging to the plurality of coarse class buckets; generating a distinguisher vector for each of the plurality of product images, based on the layer-wise filter index dictionary; and determining a similarity score representing similarity of the plurality of images, in each of the plurality of coarse class buckets, based on the distinguisher vector for each of the plurality of product images, wherein, based on the determined similarity score, a plurality of the images from the plurality of coarse class buckets are identified as the plurality of negative images. / 9. The system of claim 7, wherein the one or more hardware processors are configured to generate the plurality of negative images, by: classifying the plurality of product images into a plurality of coarse class buckets; and processing the plurality of product images in the plurality of coarse class buckets using a pre-trained deep learning model, comprising: forming a long vector for each of the plurality of product images, wherein each of a plurality of bits in the long vector belongs to a particular filter of a particular layer among a plurality of layers of the deep learning model; generating a binary vector representation of each of the plurality of product images, based on value of the plurality of bits; identifying an exclusive bit in the binary vector representation of each of the plurality of product images; generating a combined binary vector for each of the plurality of coarse class buckets, by merging the identified exclusive bit in the binary vector representation of each of the plurality of product images in each of the plurality of coarse class buckets; generating a layer-wise filter index dictionary using the combined binary vector for the plurality of coarse class buckets, wherein the layer-wise filter index dictionary holds information on a subset of the plurality of layers of the deep learning model, that have a significant role in distinguishing the plurality of images belonging to the plurality of coarse class buckets; generating a distinguisher vector for each of the plurality of product images, based on the layer-wise filter index dictionary; and determining a similarity score representing similarity of the plurality of images, in each of the plurality of coarse class buckets, based on the distinguisher vector for each of the plurality of product images, wherein, based on the determined similarity score, a plurality of the images from the plurality of coarse class buckets are identified as the plurality of negative images. / 15. The one or more non-transitory machine-readable information storage mediums of claim 13, wherein generating the plurality of negative images comprises: classifying the plurality of product images into a plurality of coarse class buckets; and processing the plurality of product images in the plurality of coarse class buckets using a pre-trained deep learning model, comprising: forming a long vector for each of the plurality of product images, wherein each of a plurality of bits in the long vector belongs to a particular filter of a particular layer among a plurality of layers of the deep learning model; generating a binary vector representation of each of the plurality of product images, based on value of the plurality of bits; identifying an exclusive bit in the binary vector representation of each of the plurality of product images; generating a combined binary vector for each of the plurality of coarse class buckets, by merging the identified exclusive bit in the binary vector representation of each of the plurality of product images in each of the plurality of coarse class buckets; generating a layer-wise filter index dictionary using the combined binary vector for the plurality of coarse class buckets, wherein the layer-wise filter index dictionary holds information on a subset of the plurality of layers of the deep learning model, that have a significant role in distinguishing the plurality of images belonging to the plurality of coarse class buckets; generating a distinguisher vector for each of the plurality of product images, based on the layer-wise filter index dictionary; and determining a similarity score representing similarity of the plurality of images, in each of the plurality of coarse class buckets, based on the distinguisher vector for each of the plurality of product images, wherein, based on the determined similarity score, a plurality of the images from the plurality of coarse class buckets are identified as the plurality of negative images. (Chen: [0051] The Harris measure is an indicator of the “strength of corneness” at point x, that is, how distinctive the corner is. After the system computes the Harris measure at each pixel within the neck region, the peak points are detected using non-maximal suppression with a radius r (in one embodiment, r=9). If the Harris measure at a peak point x is higher than a threshold tc, x is identified as a Harris corner point. The Harris corner detector is applied to each of the RGB channels. FIG. 4A illustrates a set of exemplary Harris corner points detected in the Red channel on non-collar clothes with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. Similarly, FIG. 4B illustrates a set of exemplary Harris corner points detected in the Red channel on clothes with collar with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. [0052] Similar to clothes detection, the neck part of the human body can be detected by segmenting within the bounding box of the human body (the green boxes shown in FIGS. 4A and 4B). Then, based on our assumption, the system can determine the presence of collar based on the number of Harris corner points detected from all the channels within the neck part. [0219] [0259] The “Physics Analysis Module” is a deep neural network (DNN) model for fabric attribute prediction or regression, as described in Section 2, which analyzes the captured garment images in different phases of the motion and predicts the garment fabric properties and/or model parameters for garment physics simulation. Two network architecture options can be adopted to implement the module; in the first the captured images are merged into one single multi-channel image (assuming RGB images are used it will be of 3×(K+1) channels) and fed as the input of the “Physics Analysis Module”; the second is to use an attribute prediction network based on multiple images input, as illustrated in FIGS. 2 and 3 in Section 2.2.3. [0301] In the prediction stage we simply evaluate the feature vectors of each image, by feeding it through the convolutional and fully-connected layers of the trained network. A “Feature Comparison & Ranking Module” (see FIG. 11) then models the similarity between the query and each gallery item I. The similarity score S of the query image and each gallery image can be defined by e.g. 1) computing the distance of their corresponding feature vectors in the visual feature space; or 2) counting the number of overlapping attributes or keywords predicted from the attribute classifier. In the implementation, we adopt the L2-distance metric (i.e. Euclidean distance) in the visual feature space to evaluate the similarity between samples as follows: S(x i ,q)=∥x i −q∥ 2. (13) where q and xi stand for the feature vectors of the query item and the gallery item i, respectively. Other similarity metrics (e.g. L1 distance, or cosine-similarity (J. Huang, R. S. Feris, Q. Chen, and S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, In Proceedings of the IEEE International Conference on Computer Vision, pages 1062-1070, 2015)) are also applicable here. Once the similarity scores are evaluated over all the gallery items, the results of visual search or retrieval can be then presented based on a ranking of similarity scores of the candidate gallery garments to the query garment in a descending order. Zhang: [0042] In one embodiment, the system uses color information for clothes matching. During operation, the system computes a color histogram in Red, Green, and Blue (RGB) channels from the segmented torso part. The system then compares the histogram with the histograms of other clothing items. The system further measures the similarity between two pieces of clothes by applying the φ2 test between two histograms. The details of φ2 tests can be found in Chernoff H, Lehmann E. L., “The use of maximum likelihood estimates in φ2 tests for goodness-of-fit,” The Annals of Mathematical Statistics 1954; 25:579-586, which is incorporated by reference herein. [0043] The system then retrieves the most similar and/or the most dissimilar clothes from the same category and display their images to the person for comparison. FIG. 3A illustrates a set of exemplary clothes-retrieval results based on color matching in accordance with one embodiment of the present invention. [0044] Besides color, clothing texture is also identified as a significant cue for clothes recognition due to its connection with fabric and pattern. In order to explore color and texture information simultaneously for clothes recognition, the system employs an “Eigen-Patch” approach. [0045] In the Eigen-Patch approach, instead of building histograms on the RGB values on each pixel, the system crops overlapping small image patches within the torso region and represents each patch by a multi-dimensional vector. In one embodiment, all the patches from all the clothes are stacked. The system then performs a Principal Component Analysis (PCA) to the feature stack to reduce the feature dimension and extract the most significant features from the clothes. PCA is a mathematical tool for statistical pattern recognition and its details are described in Fukunaga, K, “Introduction to Statistical Pattern Recognition,” Elsevier 1990, which is incorporated by reference herein. [0046] The system then projects the small patches to the first k principal components (referred to as “eigen patches”) which are obtained from the PCA.)
Consider Claims 4, 10 and 16.
The combination of Zhang and Chen teaches:
4. The method of claim 3, wherein identifying the plurality of images as the plurality of negative images, based on the determined similarity score, comprises: comparing the similarity score of each of the plurality of images with a threshold of similarity; and identifying all images from the plurality of images, having value of the similarity score below the threshold of similarity, as the plurality of negative images. / 10. The system of claim 9, wherein the one or more hardware processors are configured to identify the plurality of images as the plurality of negative images, based on the determined similarity score, by: comparing the similarity score of each of the plurality of images with a threshold of similarity; and identifying all images from the plurality of images, having value of the similarity score below the threshold of similarity, as the plurality of negative images. / 16. The one or more non-transitory machine-readable information storage mediums of claim 15, wherein identifying the plurality of images as the plurality of negative images, based on the determined similarity score, comprises: comparing the similarity score of each of the plurality of images with a threshold of similarity; and identifying all images from the plurality of images, having value of the similarity score below the threshold of similarity, as the plurality of negative images. (Chen: [0051] The Harris measure is an indicator of the “strength of corneness” at point x, that is, how distinctive the corner is. After the system computes the Harris measure at each pixel within the neck region, the peak points are detected using non-maximal suppression with a radius r (in one embodiment, r=9). If the Harris measure at a peak point x is higher than a threshold tc, x is identified as a Harris corner point. The Harris corner detector is applied to each of the RGB channels. FIG. 4A illustrates a set of exemplary Harris corner points detected in the Red channel on non-collar clothes with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. Similarly, FIG. 4B illustrates a set of exemplary Harris corner points detected in the Red channel on clothes with collar with tc=500 (left) and tc=2000 (right) in accordance with one embodiment of the present invention. [0052] Similar to clothes detection, the neck part of the human body can be detected by segmenting within the bounding box of the human body (the green boxes shown in FIGS. 4A and 4B). Then, based on our assumption, the system can determine the presence of collar based on the number of Harris corner points detected from all the channels within the neck part. [0219] [0259] The “Physics Analysis Module” is a deep neural network (DNN) model for fabric attribute prediction or regression, as described in Section 2, which analyzes the captured garment images in different phases of the motion and predicts the garment fabric properties and/or model parameters for garment physics simulation. Two network architecture options can be adopted to implement the module; in the first the captured images are merged into one single multi-channel image (assuming RGB images are used it will be of 3×(K+1) channels) and fed as the input of the “Physics Analysis Module”; the second is to use an attribute prediction network based on multiple images input, as illustrated in FIGS. 2 and 3 in Section 2.2.3. [0301] In the prediction stage we simply evaluate the feature vectors of each image, by feeding it through the convolutional and fully-connected layers of the trained network. A “Feature Comparison & Ranking Module” (see FIG. 11) then models the similarity between the query and each gallery item I. The similarity score S of the query image and each gallery image can be defined by e.g. 1) computing the distance of their corresponding feature vectors in the visual feature space; or 2) counting the number of overlapping attributes or keywords predicted from the attribute classifier. In the implementation, we adopt the L2-distance metric (i.e. Euclidean distance) in the visual feature space to evaluate the similarity between samples as follows: S(x i ,q)=∥x i −q∥ 2. (13) where q and xi stand for the feature vectors of the query item and the gallery item i, respectively. Other similarity metrics (e.g. L1 distance, or cosine-similarity (J. Huang, R. S. Feris, Q. Chen, and S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, In Proceedings of the IEEE International Conference on Computer Vision, pages 1062-1070, 2015)) are also applicable here. Once the similarity scores are evaluated over all the gallery items, the results of visual search or retrieval can be then presented based on a ranking of similarity scores of the candidate gallery garments to the query garment in a descending order. Zhang: [0042] In one embodiment, the system uses color information for clothes matching. During operation, the system computes a color histogram in Red, Green, and Blue (RGB) channels from the segmented torso part. The system then compares the histogram with the histograms of other clothing items. The system further measures the similarity between two pieces of clothes by applying the φ2 test between two histograms. The details of φ2 tests can be found in Chernoff H, Lehmann E. L., “The use of maximum likelihood estimates in φ2 tests for goodness-of-fit,” The Annals of Mathematical Statistics 1954; 25:579-586, which is incorporated by reference herein. [0043] The system then retrieves the most similar and/or the most dissimilar clothes from the same category and display their images to the person for comparison. FIG. 3A illustrates a set of exemplary clothes-retrieval results based on color matching in accordance with one embodiment of the present invention. [0044] Besides color, clothing texture is also identified as a significant cue for clothes recognition due to its connection with fabric and pattern. In order to explore color and texture information simultaneously for clothes recognition, the system employs an “Eigen-Patch” approach. [0045] In the Eigen-Patch approach, instead of building histograms on the RGB values on each pixel, the system crops overlapping small image patches within the torso region and represents each patch by a multi-dimensional vector. In one embodiment, all the patches from all the clothes are stacked. The system then performs a Principal Component Analysis (PCA) to the feature stack to reduce the feature dimension and extract the most significant features from the clothes. PCA is a mathematical tool for statistical pattern recognition and its details are described in Fukunaga, K, “Introduction to Statistical Pattern Recognition,” Elsevier 1990, which is incorporated by reference herein. [0046] The system then projects the small patches to the first k principal components (referred to as “eigen patches”) which are obtained from the PCA.)
Consider Claims 5, 11 and 17.
The combination of Zhang and Chen teaches:
5. The method of claim 4, wherein all of the plurality of images having the similarity score exceeding the threshold of similarity are identified as the plurality of positive images. / 11. The system of claim 10, wherein the one or more hardware processors are configured to identify all of the plurality of images having the similarity score exceeding the threshold of similarity as the plurality of positive images. / 17. The one or more non-transitory machine-readable information storage mediums of claim 16, wherein all of the plurality of images having the similarity score exceeding the threshold of similarity are identified as the plurality of positive images. (Zhang: [0045]-[0046], The system then compares the histogram with all the histograms of other clothing items based on χ2 test to find similar and dissimilar clothes. FIG. 3B illustrates a set of exemplary clothes-retrieval results based on eigen-patch analysis in accordance with one embodiment of the present invention. Collar Recognition [0047] In one embodiment, the system uses a supervised learning algorithm to classify the clothes into different categories. In general, a collar on a shirt is an important cue to discriminate between formal shirts (e.g., dress shirts and polo shirts) and casual shirts (e.g., t-shirts and sweaters). Although it is very easy for human eyes to determine the existence of collar, recognizing it automatically from a camera is not a trivial problem. [0048]-[0053], Sleeve Recognition [0054] Sleeve length is another important factor for clothes recognition. It is also mentioned in the Wikipedia definition for “shirt” as a significant cue to discriminate between polo-shirts, T-shirts, sweat shirts (short-sleeved or none-sleeve) from dress shirts or jackets (long-sleeved). In order to recognize these two categories, it is assumed that long-sleeved clothes usually expose less skin area on arms than short-sleeved or none-sleeved clothes do. In one embodiment, the sleeve-recognition is divided into two sub-problems: skin detection and sleeve classification.[0055]-[0057] Next, for every pixel x in the rough arm area (right and left side of the upper body), a small patch p(x) of size 5×5 centered at x is extracted. x is identified as a skin pixel only if the following two conditions are true: [0058] 1. Patch p(x) is coherent in color. That is, the variance of RGB values within p(x) is smaller than a threshold. This is to prevent false detections from skin-like colors in sleeves. [0059] 2. The minimal Mahalanobis distance from the mean of the RGB values within p(x) to the two face pixel clusters is smaller than threshold ts. The skin detection results using ts=5 is shown in light blue areas in FIGS. 5A and 5B. [0060] After skin detection, the sleeve length is approximated by the number of skin pixels detected in the arms. A Decision Stump is learned on these features to recognize the sleeve lengths. Chen: [0082] The method may be one in which to train the deep neural network model for learning a triplet similarity metric, a three-way Siamese architecture is adopted to handle the 3-way parallel image inputs, in which the model weights are initialized with those of a pre-trained attributed classification model, and weight sharing is applied for all the convolutional layers, and the last fully-connected layer is retrained while fine-tuning the earlier convolutional layers at a lower learning rate for the similarity learning.) [0293]-[0294] Section 4.1. Image-Based Search or Retrieval, [0295] The standard approach for image-based search and image retrieval is: 1) performing feature extraction on both the query and the gallery images, 2) computing the feature distances between the query image and each gallery image using a distance metric (e.g. Euclidean distance or L1 distance); 3) presenting the search or retrieval results by ranking the similarity scores. [0296] To achieve good retrieval and search performance, step 1) is most critical. The goal is to learn an invariant feature transform and similarity embedding such that images of the same item but in different photography styles (e.g. shop images vs. mannequin images), or images of visually similar items should stay together in the feature space whilst those of visually dissimilar items should stay apart. In our system, we solve this problem in a unified framework by adopting a deep learning approach. For feature extraction, instead of using hand-crafted visual features (e.g. histogram of oriented gradient (HoG), SIFT) we take the outputs of the deep neural network model used for attribute classification (described in Section 2) as the visual features. To learn an effective similarity embedding we fine-tune the deep model and retrain the fully connected layers against a triplet-loss objective function as detailed in the following Section 4.1.1. [0414]-[0421], Section 6.2.1 Outfit Search and Section 6.2.2 Recommending Complementary Items)
Consider Claims 6, 12 and 18.
The combination of Zhang and Chen teaches:
6. The method of claim 1, wherein a deep learning model trained using the generated training data is used to perform a similarity search, comprising: collecting a query image as input; processing the query image using the deep learning model to obtain a query embedding; determining similarity of the query image with a plurality of reference images, based on the query embedding; and generating at least one recommendation of at least one reference that is most similar to the real-time query image, based on the determined similarity. / 12. The system of claim 7, wherein the one or more hardware processors are configured to use a deep learning model trained using the generated training data, to perform a similarity search, by: collecting a query image as input; processing the query image using the deep learning model to obtain a query embedding; determining similarity of the query image with a plurality of reference images, based on the query embedding; and generating at least one recommendation of at least one reference that is most similar to the real-time query image, based on the determined similarity. / 18. The one or more non-transitory machine-readable information storage mediums of claim 13, wherein the one or more instructions which when executed by the one or more hardware processors cause: collecting a query image as input; processing the query image using the deep learning model to obtain a query embedding; determining similarity of the query image with a plurality of reference images, based on the query embedding; and generating at least one recommendation of at least one reference that is most similar to the real-time query image, based on the determined similarity. (Zhang: [0045]-[0046], The system then compares the histogram with all the histograms of other clothing items based on χ2 test to find similar and dissimilar clothes. FIG. 3B illustrates a set of exemplary clothes-retrieval results based on eigen-patch analysis in accordance with one embodiment of the present invention. Collar Recognition [0047] In one embodiment, the system uses a supervised learning algorithm to classify the clothes into different categories. In general, a collar on a shirt is an important cue to discriminate between formal shirts (e.g., dress shirts and polo shirts) and casual shirts (e.g., t-shirts and sweaters). Although it is very easy for human eyes to determine the existence of collar, recognizing it automatically from a camera is not a trivial problem. [0048]-[0053], Sleeve Recognition [0054] Sleeve length is another important factor for clothes recognition. It is also mentioned in the Wikipedia definition for “shirt” as a significant cue to discriminate between polo-shirts, T-shirts, sweat shirts (short-sleeved or none-sleeve) from dress shirts or jackets (long-sleeved). In order to recognize these two categories, it is assumed that long-sleeved clothes usually expose less skin area on arms than short-sleeved or none-sleeved clothes do. In one embodiment, the sleeve-recognition is divided into two sub-problems: skin detection and sleeve classification.[0055]-[0057] Next, for every pixel x in the rough arm area (right and left side of the upper body), a small patch p(x) of size 5×5 centered at x is extracted. x is identified as a skin pixel only if the following two conditions are true: [0058] 1. Patch p(x) is coherent in color. That is, the variance of RGB values within p(x) is smaller than a threshold. This is to prevent false detections from skin-like colors in sleeves. [0059] 2. The minimal Mahalanobis distance from the mean of the RGB values within p(x) to the two face pixel clusters is smaller than threshold ts. The skin detection results using ts=5 is shown in light blue areas in FIGS. 5A and 5B. [0060] After skin detection, the sleeve length is approximated by the number of skin pixels detected in the arms. A Decision Stump is learned on these features to recognize the sleeve lengths. Chen: [0082] The method may be one in which to train the deep neural network model for learning a triplet similarity metric, a three-way Siamese architecture is adopted to handle the 3-way parallel image inputs, in which the model weights are initialized with those of a pre-trained attributed classification model, and weight sharing is applied for all the convolutional layers, and the last fully-connected layer is retrained while fine-tuning the earlier convolutional layers at a lower learning rate for the similarity learning.) [0293]-[0294] Section 4.1. Image-Based Search or Retrieval, [0295]
The standard approach for image-based search and image retrieval is: 1) performing feature extraction on both the query and the gallery images, 2) computing the feature distances between the query image and each gallery image using a distance metric (e.g. Euclidean distance or L1 distance); 3) presenting the search or retrieval results by ranking the similarity scores. [0296] To achieve good retrieval and search performance, step 1) is most critical. The goal is to learn an invariant feature transform and similarity embedding such that images of the same item but in different photography styles (e.g. shop images vs. mannequin images), or images of visually similar items should stay together in the feature space whilst those of visually dissimilar items should stay apart. In our system, we solve this problem in a unified framework by adopting a deep learning approach. For feature extraction, instead of using hand-crafted visual features (e.g. histogram of oriented gradient (HoG), SIFT) we take the outputs of the deep neural network model used for attribute classification (described in Section 2) as the visual features. To learn an effective similarity embedding we fine-tune the deep model and retrain the fully connected layers against a triplet-loss objective function as detailed in the following Section 4.1.1. [0414]-[0421], Section 6.2.1 Outfit Search and Section 6.2.2 Recommending Complementary Items)
Conclusion
The prior art made of record in form PTO-892 and not relied upon is considered pertinent to applicant's disclosure.
PNG
media_image2.png
220
900
media_image2.png
Greyscale
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA ANSARI whose telephone number is 571-270-3379. The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O’NEAL MISTRY can be reached on 313-446-4912. The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications. TC 2600’s customer service number is 571-272-2600.
Any inquiry of a general nature or relating to the status of this application or proceeding should be directed to the receptionist whose telephone number is 571-272-2600.
2674
/Tahmina Ansari/
January 10, 2026
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2674