DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
Applicant’s Amendments filed on 12/03/2025 has been entered and made of record.
Currently pending Claim(s)
2-7 and 10–13
Independent Claims(s)
6, 12, and 13
Amended Claim(s)
6, 12, and 13
Canceled Claims(s)
8 and 9
Withdrawn Claim(s)
1
Response to Amendments
This office action is responsive to Applicant’s Arguments/Remarks Made in an Amendment received on 12/03/2025.
In view of the claim amendments [Remarks] filed on 12/03/2025 with respect to 35 U.S.C. 112(b) claim rejections to claims 3–7 and 10–13 have been carefully considered and the claim rejections under 35 U.S.C. 112(b) are withdrawn.
Applicant’s Reply (December 3, 2025) includes substantive amendments to the claims. This Office action has been updated with new grounds of rejection addressing those amendments. Further Applicant’s Arguments/Remarks with respect to independent claims 6, 12, and 13, on pages 13-18, have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection and the arguments are now rejected by newly cited art ‘Melekhov et al. (“Digging Into Self-Supervised Learning of Feature Descriptors”)’ as explained in the body of the rejection below. Furthermore, the dependent claims are rejected with 103 rejections based on the new 103 rejections from the independent claims.
Furthermore, Applicant argues, on the bottom of page 15, as follows:
PNG
media_image1.png
306
706
media_image1.png
Greyscale
The Applicant argues that Florence does not describe “known transformation mappings of pixel coordinates within two images based on a predefined transformation by which those images relate to each other due to a performed augmentation” or “single-image deterministic augmentation mappings.” However, the Examiner respectfully disagrees.
Florence discloses all of the subject matter as described above except for specifically teaching “generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image [...] which each includes the camera image and one of the augmented versions of the camera image or two of the augmented versions of the camera image [...] wherein the determinations are made based on the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by which the augmented version or versions of the pair of training images were generated” as described in the rejection below. However, Melekhov in the same field of endeavor teaches “generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image [...] which each includes the camera image and one of the augmented versions of the camera image or two of the augmented versions of the camera image [...] wherein the determinations are made based on the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by which the augmented version or versions of the pair of training images were generated” (Melekhov, Figure 1; pg. 1146, Viewpoint Variance). In summary, Florence discloses all the limitations in the independent claims, including generation of training images for training the machine learning model with contrastive loss for descriptor vectors, except for the augmentation of the camera images which is disclosed by Melekhov. Therefore, it would have been obvious to one of ordinary skill in the art to combine Florence and Melekhov before the effective filing date of the claimed invention. The motivation for this combination of references would have been to enhance the discriminativeness of feature descriptors and make descriptors invariant to variables such as, viewpoint and illumination changes (Melekohv, Abstract). This motivation for the combination of Florence and Melekhov is supported by KSR exemplary rationale (G) Some teaching, suggestion, or motivation in the prior art that would have led one of ordinary skill to modify the prior art reference or to combine prior art reference teachings to arrive at the claimed invention. MPEP 2141 (III). In addition, in response to applicant’s arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.
In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, at the bottom of page 17, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning. But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper. In addition, in response Applicant’s statement at the middle of page 18, the newly cited reference of Melekhov “suggests a system where model training can occur using just a single source image and by direct pixel descriptor correspondence analysis based on a transform applied to such a single source image” which cures the deficiency of Florence.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claim 2 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 recites “wherein the augmentation includes a rotation, and/or a perspective transformation and/or an affine transformation.” The phrase “and/or” renders the claim indefinite because it is unclear whether the limitation following the phrase are part of the claimed invention. See MPEP § 2173.05(d).
Please clarify this indefiniteness in this claim. In light of the specifications, the examiner assumes that the limitation of dependent claim 2 reads that the augmentation can include rotation only, perspective transformation only, affine transformation only, or a combination of any of the different augmentations.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 2–4, 6, 10, and 12–13 are rejected under 35 U.S.C. 103 as being unpatentable over Florence et al. (Florence, Peter R., Lucas Manuelli, and Russ Tedrake. "Dense object nets: Learning dense visual object descriptors by and for robotic manipulation." arXiv preprint arXiv:1806.08756 (2018)) (hereafter, “Florence”) in view of Melekhov et al. (Melekhov, Iaroslav, et al. "Digging into self-supervised learning of feature descriptors." 2021 international conference on 3D vision (3DV). IEEE, 2021) (hereafter, “Melekhov”).
Regarding claim 6, Florence discloses a method for a robot, the method comprising the following steps [we present Dense Object Nets, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation ... we demonstrate the novel application of learned dense descriptors to robotic manipulation, pg. 1, Abstract]: training a machine learning model for generating descriptor images for images of one or more objects [we additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance ... we use self-supervised pixelwise contrastive loss ... this learns a dense visual descriptor mapping which maps a full-resolution RGB image, RW×H×3 to a dense descriptor space, RW×H×D, where for each pixel we have a D-dimensional descriptor vector ... the images are mapped to corresponding descriptor images, pg. 1, Abstract; pg. 2-3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph], the descriptor images having, for each pixel coordinate of the images, a respective descriptor vector [learns a dense visual descriptor mapping which maps a full-resolution RGB image, RW×H×3 to a dense descriptor space, RW×H×D, where for each pixel we have a D-dimensional descriptor vector ... the images are mapped to corresponding descriptor images, pg. 2-3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph], the training including: recording multiple camera images, each of the camera images showing one or more objects [a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the minimum requirement for raw data is to collect an RGBD video of an object or objects, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; pg. 5, 4. Experimental, first paragraph]; generating one or more pairs of training images [which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image] [training is performed in a Siamese fashion where a pair of RGB images, Ia and Ib are samples from one RGBD video, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other, [wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training image] [many pixel matches and non-matches are generated from the pair of images. A pixel ua ∈ R2 from image Ia is a match with pixel ub from image Ib if they correspond to the same vertex, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and using the pair of training images for training the machine learning model with contrastive loss [training is performed in a Siamese fashion, where a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the dense descriptor mapping is trained via pixelwise contrastive loss, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph] by which: descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which correspond to each other are used as positive pairs, modifying the machine learning model to reduce a descriptor distance between the descriptor vectors of the pixel pair [the dense descriptor mapping is trained via pixelwise contrastive loss. The loss function aims to minimize the distance between descriptors corresponding to a match, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which do not correspond to each other are used as negative pairs, modifying the machine learning model to increase a descriptor distance between the descriptor vectors of the pixel pair [while descriptors corresponding to a non-match should be at least a distance M apart, where M is a margin parameter, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; using the trained machine learning model to obtain a descriptor image for a current control scenario [Figure 2 & 6; the dense descriptor mapping f(·) is used to map an image I ∈ RW×H×3 to descriptor space f(I) ∈ RW×H×D. Given a pixel u we use f(I)(u) to denote the descriptor corresponding to pixel u in image I, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; shown for each (a) and (b-d) are RGB frames (top) and corresponding descriptor images (bottom) that are the direct output of a feed-forward pass through a trained network, pg. 5, Figure 2 citation of 5 Results]; and controlling the robot according to the obtained descriptor image [Figure 6; we set a simple threshold to determine whether a valid match exists. If a match is identified in the new scene we can instruct the robot to autonomously graph this point by looking up the corresponding location in the point cloud and using simple geometric grasping techniques (details in Appendix C), pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph].
Florence fails to explicitly disclose for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images.
However, Melekhov teaches for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image [Figure 1; given an image and a set of interest points (top-left), we create a stylized copy of the image with a forward-pass through Style CNN. A pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator ... we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator). In addition to stylization, we apply synthetic CA independently on the anchor and positive crops by adding blur, random per-pixel Gaussian noise, adaptive histogram equalization (CLAHE) along with augmentation in brightness, contrast, hue, and saturation, pg. 1145, Figure 1 citation; pg. 1146, right column, Viewpoint Invariance, first paragraph]; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image [Figure 1; we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator) ... a pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator, pg. 1146, right column, Viewpoint Invariance, first paragraph; pg. 1145, Figure 1 citation]; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images [we pass the image crops through Descriptor CNN to obtain representation of corresponding interest points in the feature space (embeddings). For each 𝑎𝑖 and 𝑝𝑖 descriptor from positive pair (anchors-positives), we create a list of hard-negative descriptors (shown with color gradient), pg. 1145, Figure 1 citation].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence and incorporate the teachings of Melekhov with augmentation to enhance the discriminativeness of feature descriptors and make descriptors invariant to viewpoint and illumination changes, as recognized by Melekhov.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Melekhov with Florence to obtain the invention as specified in claim 6.
Regarding claim 2, which claim 6 is incorporated, Florence fails to explicitly disclose wherein the augmentation includes a rotation, and/or a perspective transformation and/or an affine transformation.
However, Melekhov teaches wherein the augmentation includes a rotation, and/or a perspective transformation and/or an affine transformation [we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator), pg. 1146, right column, Viewpoint Invariance, first paragraph].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence and incorporate the teachings of Melekhov to enhance the discriminativeness of feature descriptors and make descriptors invariant to viewpoint changes, as recognized by Melekhov.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Melekhov with Florence to obtain the invention as specified in claim 2.
Regarding claim 3, which claim 6 is incorporated, Florence discloses prior to generating the augmented versions and the pairs of training images, cropping the camera images taking into account object masks of the one or more objects [while we collect only a modest number of scenes (4-10) per object or class, we ensure they are diverse. We physically acquire a diverse set of orientations, crops, and lighting conditions ... since we are trying to learn descriptors of objects that take up only a fraction of a full image, we observe significant improvements if the representational power of the models are focused on the objects rather than the backgrounds ... initial testing with human-labeled object masks showed that if matches for data associations were sampled only on the object (while non-matches were sampled from the full image) then correspondence performance was significantly improved, pg. 4, 3.2 Training Procedures for Object-Centric Descriptors, sixth paragraph, third paragraph].
Regarding claim 4, which claim 6 is incorporated, Florence discloses wherein the machine learning model is a neural network [our visual representations Dense Objects Nets, which are deep neural networks trained to provide dense (pixelwise) description of objects, pg. 2, 1. Introduction, second paragraph].
Regarding claim 10, which claim 6 is incorporated, Florence discloses recording a camera image which shows an object in the current control scenario [Figure 2 & 6; we demonstrate a variety of manipulation applications in grasping specific points on objects ... the robot moves to a handful of pre-specified poses and records those RGBD images, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph; pg. 12, Appendix C Grasping Pipeline, first paragraph], wherein the using of the trained machine learning model to obtain the descriptor image for the current control scenario includes feeding the camera image to the trained machine learning model with the trained machine learning model generating the descriptor image for the current control scenario based on the fed camera image [we set a simple threshold to determine whether a valid match exists ... shown for each (a) and (b-d) are RGB frames (top) and corresponding descriptor images (bottom) that are the direct output of a feed-forward pass through a trained network ... the RGB images are used to look up the best descriptor match and determine a pixel space location to grasp in one image, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph; pg. 5, Figure 2 citation of 5 Results; pg. 12, Appendix C Grasping Pipeline, first paragraph]; and determining, based on the descriptor image for the current control scenario, a position of a location for picking up or processing the object in the current control scenario, wherein the controlling of the robot is performed according to the determined position [Figure 6; if a match is identified in the new scene we can instruct the robot to autonomously grasp this point by looking up the corresponding location in the point cloud and using simple geometric grasping techniques, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph].
Regarding claim 12, Florence discloses a control unit system for a robot, the control unit system including processors that are configured to [Figure 7; we utilize a 7-DOF robot arm (Kuka IIWA LBR) with an RGBD sensor (Primesense Carmine 1.09) ... we demonstrate the novel application of learned dense descriptors to robotic manipulation, pg. 5, 4 Experimental; pg. 1, Abstract]: train a machine learning model for generating descriptor images for images of one or more objects [we additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance ... we use self-supervised pixelwise contrastive loss ... this learns a dense visual descriptor mapping which maps a full-resolution RGB image, RW×H×3 to a dense descriptor space, RW×H×D, where for each pixel we have a D-dimensional descriptor vector ... the images are mapped to corresponding descriptor images, pg. 1, Abstract; pg. 2-3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph], the descriptor images having, for each pixel coordinate of the images, a respective descriptor vector [learns a dense visual descriptor mapping which maps a full-resolution RGB image, RW×H×3 to a dense descriptor space, RW×H×D, where for each pixel we have a D-dimensional descriptor vector ... the images are mapped to corresponding descriptor images, pg. 2-3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph], the training including: recording multiple camera images, each of the camera images showing one or more objects [a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the minimum requirement for raw data is to collect an RGBD video of an object or objects, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; pg. 5, 4. Experimental, first paragraph]; generating one or more pairs of training images [which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image] [training is performed in a Siamese fashion where a pair of RGB images, Ia and Ib are samples from one RGBD video, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other, [wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training image] [many pixel matches and non-matches are generated from the pair of images. A pixel ua ∈ R2 from image Ia is a match with pixel ub from image Ib if they correspond to the same vertex, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and using the pair of training images for training the machine learning model with contrastive loss [training is performed in a Siamese fashion, where a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the dense descriptor mapping is trained via pixelwise contrastive loss, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph] by which: descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which correspond to each other are used as positive pairs, modifying the machine learning model to reduce a descriptor distance between the descriptor vectors of the pixel pair [the dense descriptor mapping is trained via pixelwise contrastive loss. The loss function aims to minimize the distance between descriptors corresponding to a match, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which do not correspond to each other are used as negative pairs, modifying the machine learning model to increase a descriptor distance between the descriptor vectors of the pixel pair [while descriptors corresponding to a non-match should be at least a distance M apart, where M is a margin parameter, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; use the trained machine learning model to obtain a descriptor image for a current control scenario [Figure 2 & 6; the dense descriptor mapping f(·) is used to map an image I ∈ RW×H×3 to descriptor space f(I) ∈ RW×H×D. Given a pixel u we use f(I)(u) to denote the descriptor corresponding to pixel u in image I, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; shown for each (a) and (b-d) are RGB frames (top) and corresponding descriptor images (bottom) that are the direct output of a feed-forward pass through a trained network, pg. 5, Figure 2 citation of 5 Results]; and control the robot according to the obtained descriptor image [Figure 6; we set a simple threshold to determine whether a valid match exists. If a match is identified in the new scene we can instruct the robot to autonomously graph this point by looking up the corresponding location in the point cloud and using simple geometric grasping techniques (details in Appendix C), pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph].
Florence fails to explicitly disclose for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images.
However, Melekhov teaches for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image [Figure 1; given an image and a set of interest points (top-left), we create a stylized copy of the image with a forward-pass through Style CNN. A pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator ... we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator). In addition to stylization, we apply synthetic CA independently on the anchor and positive crops by adding blur, random per-pixel Gaussian noise, adaptive histogram equalization (CLAHE) along with augmentation in brightness, contrast, hue, and saturation, pg. 1145, Figure 1 citation; pg. 1146, right column, Viewpoint Invariance, first paragraph]; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image [Figure 1; we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator) ... a pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator, pg. 1146, right column, Viewpoint Invariance, first paragraph; pg. 1145, Figure 1 citation]; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images [we pass the image crops through Descriptor CNN to obtain representation of corresponding interest points in the feature space (embeddings). For each 𝑎𝑖 and 𝑝𝑖 descriptor from positive pair (anchors-positives), we create a list of hard-negative descriptors (shown with color gradient), pg. 1145, Figure 1 citation].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence and incorporate the teachings of Melekhov with augmentation to enhance the discriminativeness of feature descriptors and make descriptors invariant to viewpoint and illumination changes, as recognized by Melekhov.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Melekhov with Florence to obtain the invention as specified in claim 12.
Regarding claim 13, Florence discloses when executed by the processors, cause the processors to perform the following for a robot [Figure 7; we utilize a 7-DOF robot arm (Kuka IIWA LBR) with an RGBD sensor (Primesense Carmine 1.09) ... we demonstrate the novel application of learned dense descriptors to robotic manipulation, pg. 5, 4 Experimental; pg. 1, Abstract]: training a machine learning model for generating descriptor images for images of one or more objects [we additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance, pg. 1, Abstract], the descriptor images having, for each pixel coordinate of the images, a respective descriptor vector [learns a dense visual descriptor mapping which maps a full-resolution RGB image, RW×H×3 to a dense descriptor space, RW×H×D, where for each pixel we have a D-dimensional descriptor vector ... the images are mapped to corresponding descriptor images, pg. 2-3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph], the training including: recording multiple camera images, each of the camera images showing one or more objects [a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the minimum requirement for raw data is to collect an RGBD video of an object or objects, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; pg. 5, 4. Experimental, first paragraph]; generating one or more pairs of training images [which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image] [training is performed in a Siamese fashion where a pair of RGB images, Ia and Ib are samples from one RGBD video, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other, [wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training image] [many pixel matches and non-matches are generated from the pair of images. A pixel ua ∈ R2 from image Ia is a match with pixel ub from image Ib if they correspond to the same vertex, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and using the pair of training images for training the machine learning model with contrastive loss [training is performed in a Siamese fashion, where a pair of RGB images, Ia and Ib are sampled from one RGBD video ... the dense descriptor mapping is trained via pixelwise contrastive loss, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph] by which: descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which correspond to each other are used as positive pairs, modifying the machine learning model to reduce a descriptor distance between the descriptor vectors of the pixel pair [the dense descriptor mapping is trained via pixelwise contrastive loss. The loss function aims to minimize the distance between descriptors corresponding to a match, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; and descriptor vectors which are generated by the machine learning model for the pixel pairs whose pixels which do not correspond to each other are used as negative pairs, modifying the machine learning model to increase a descriptor distance between the descriptor vectors of the pixel pair [while descriptors corresponding to a non-match should be at least a distance M apart, where M is a margin parameter, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph]; using the trained machine learning model to obtain a descriptor image for a current control scenario [Figure 2 & 6; the dense descriptor mapping f(·) is used to map an image I ∈ RW×H×3 to descriptor space f(I) ∈ RW×H×D. Given a pixel u we use f(I)(u) to denote the descriptor corresponding to pixel u in image I, pg. 3, 3.1 Preliminary: Self-Supervised Pixelwise Contrastive Loss, first paragraph; shown for each (a) and (b-d) are RGB frames (top) and corresponding descriptor images (bottom) that are the direct output of a feed-forward pass through a trained network, pg. 5, Figure 2 citation of 5 Results]; and controlling the robot according to the obtained descriptor image [Figure 6; we set a simple threshold to determine whether a valid match exists. If a match is identified in the new scene we can instruct the robot to autonomously graph this point by looking up the corresponding location in the point cloud and using simple geometric grasping techniques (details in Appendix C), pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph].
Florence fails to explicitly disclose non-transitory computer-readable media on which are stored instructions that are executable by processors; for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images.
However, Melekhov teaches non-transitory computer-readable media on which are stored instructions that are executable by processors [the training was performed on a single GeForce RTX 2080Ti GPU, pg. 1152, left column, B. Training details, first paragraph]; for each camera image of the camera images: generating one or more augmented versions of the camera image by applying respective predefined transformations to the camera image for each augmented version of the camera image, wherein the respective predefined transformation changes respective coordinates of pixels of the camera image [Figure 1; given an image and a set of interest points (top-left), we create a stylized copy of the image with a forward-pass through Style CNN. A pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator ... we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator). In addition to stylization, we apply synthetic CA independently on the anchor and positive crops by adding blur, random per-pixel Gaussian noise, adaptive histogram equalization (CLAHE) along with augmentation in brightness, contrast, hue, and saturation, pg. 1145, Figure 1 citation; pg. 1146, right column, Viewpoint Invariance, first paragraph]; [generating one or more pairs of training images] which each includes the camera image and one of the versions of the camera image or two of the augmented versions of the camera image [Figure 1; we use similar types of the homographic adaptation operations: translation, rotation, perspective transform, and scale whose parameters are sampled from a uniform distribution during training (H-crops generator) ... a pair of homography related crops (𝑐𝑖) is randomly sampled from the original and stylized image in H-crops generator, pg. 1146, right column, Viewpoint Invariance, first paragraph; pg. 1145, Figure 1 citation]; and [for each pair of training images, analyzing a plurality of pixel pairs that each consists of one respective pixel of a first image of the respective pair of training images and of one respective pixel of a second image of the respective pair of training images, wherein the analyzing includes determining, for each of the pixel pairs, whether or not the pixels of the respective pixel pair correspond to each other] wherein the determinations are made based on according to the changes to the respective coordinates of the pixels as deterministically set by the respective predefined transformation or transformations by with which the augmented version or versions of the pair of training images [we pass the image crops through Descriptor CNN to obtain representation of corresponding interest points in the feature space (embeddings). For each 𝑎𝑖 and 𝑝𝑖 descriptor from positive pair (anchors-positives), we create a list of hard-negative descriptors (shown with color gradient), pg. 1145, Figure 1 citation].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence and incorporate the teachings of Melekhov with augmentation to enhance the discriminativeness of feature descriptors and make descriptors invariant to viewpoint and illumination changes, as recognized by Melekhov.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Melekhov with Florence to obtain the invention as specified in claim 13.
Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Florence ("Dense object nets: Learning dense visual object descriptors by and for robotic manipulation.") in view of Melekhov (US 2023/0074706 A1), as applied above, and further in view of Jang et al. (US 2019/0385022 A1) (hereafter, “Jang”).
Regarding claim 5, which claim 6 is incorporated, neither Florence nor Melekhov appears to explicitly disclose wherein the plurality of camera images are recorded from the same perspective.
However, Jang teaches wherein the plurality of camera images are recorded from the same perspective [the system 100 has access to one or more cameras that capture images of the environment and of objects in the environment ... the one or more cameras may be sensors of the robot (e.g., the camera(s) that are used to capture the state representation) or stationary cameras external to the robot in the environment. The system uses these cameras to capture images of objects in the environment and provides the images as input to the object representation neural network 112, para 0044].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence in view of Melekhov and incorporate the teachings of Jang to allow the robot to complete the task with information about the state representation of the environment, as recognized by Jang.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Jang with Florence and Melekhov to obtain the invention as specified in claim 5.
Claims 7 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Florence ("Dense object nets: Learning dense visual object descriptors by and for robotic manipulation.") in view of Melekhov (US 2023/0074706 A1), as applied above, and further in view of Ma et al. (US 2021/0027058 A1) (hereafter, “Ma”).
Regarding claim 11, which claim 10 is incorporated, neither Florence nor Melekhov appears to explicitly disclose wherein the controlling of the robot causes the robot to pick up the object.
However, Ma teaches wherein the controlling of the robot causes the robot to pick up the object [navigation behavior may be used to move the robot from one point to another. This may allow the robot to locate an object such as a bottle that may be anywhere within an environment and then perform a task such as “picking up” the bottle, para 0050].
It would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to modify Florence in view of Melekhov and incorporate the teachings of Ma to locate an object and perform a task regardless of where it is located during the training process, as recognized by Ma.
Further, one skilled in the art could have combined the elements as described above with known method with no change in their respective functions, and the combination would have yielded nothing more than predictable results. Therefore, it would have been obvious to combine Ma with Florence and Melekhov to obtain the invention as specified in claim 11.
Regarding claim 7, which claim 11 is incorporated, Florence discloses identifying a reference location in a reference image [Figure 6; we demonstrate a variety of manipulation applications in grasping specific points on objects, where the point of interest is specified in a reference image, pg. 7, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph]; determining a descriptor of the identified reference location by feeding the reference image to the machine learning model [Figure 6, we set a simple threshold to determine whether a valid match exists ... the RGB images are used to look up the best descriptor match and determine a pixel space location to grasp in one image, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph; pg. 12, Appendix C Grasping Pipeline, first paragraph]; determining a position of the reference location in the current control scenario by searching for the determined descriptor in the descriptor image generated from the camera image which shows the object [the robot has the ability to autonomously identify the corresponding point ... for each the user specifies a pixel in a single reference image, and the robot automatically grasps the best match in test configurations, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph; Figure 6 citation of 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points]; and determining the position of the location for picking up or processing the object in the current control scenario from the determined position of the reference location [if a match is identified in the new scene we can instruct the robot to autonomously grasp this point by looking up the corresponding location in the point cloud, pg. 8, 5.4 Example Applications to Robotic Manipulation: Grasping Specific Points, first paragraph].
Conclusion
The art made of record and not relied upon is considered pertinent to applicant's disclosure:
Efficient and Robust Training of Dense Object Nets for Multi-Object Robot Manipulation to Adrian et al. discloses training of Dense Object Nets with augmentation and pixelwise loss formulation for a robotic grasping task.
Relative Distribution Entropy Loss Function in CNN Image Retrieval to Liu et al. discloses relative distribution entropy (RDE) to describe image descriptors that uses Euclidean distance and contrastive loss with a convolutional neural network (CNN).
US 20230274532 A1 to Shlens et al. discloses systems and methods for learning data augmentation for object detection model performance where augmentation operations are applied to training image to generate augmented images.
Image Patch Matching Using Convolutional Descriptors with Euclidean Distance to Melekhov, Iaroslav et al. discloses a neural network based image descriptor for image patch matching that uses a deep convolutional neural network (CNN).
Finger vein verification using a Siamese CNN to Tang et al. discloses a Siamese CNN with a contrastive loss function to extract discriminative features from finger vein images.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TOLUWANI MARY-JANE IJASEUN whose telephone number is (571)270-1877. The examiner can normally be reached Monday - Friday 7:30AM-4PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Henok Shiferaw can be reached at (571) 272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/TOLUWANI MARY-JANE IJASEUN/Examiner, Art Unit 2676
/Henok Shiferaw/Supervisory Patent Examiner, Art Unit 2676