DETAILED ACTION
This action is in response to the application filed 29 December 2021.
Claim 1 is pending. Claim 1 is independent.
Claim 1 is rejected.
Notice of Pre-AIA or AIA Status
The present application, filed on or after 16 March 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA 35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Drawings
The drawings are objected to because they are not black and white line drawings; see 37 C.F.R. § 1.84(a). Furthermore, due to the use of grayscale, they do not have satisfactory reproduction characteristics (e.g., when converted to black and white, the text becomes illegible); see 37 C.F.R. § 1.84(l). Corrected drawing sheets in compliance with 37 C.F.R. § 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended”. If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 C.F.R. § 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Claim Objections
Claim 1 is objected to for the following informalities:
Claim 1 is missing spaces between several words or other pieces of text; for example, there is no space in the phrase “…the sizes of the convolution kernel are9x9…”
Claim 1 does not consistently use the serial comma; for example, the phrase “the number of output channels is 32, 16 and 8 respectively” does not use the serial comma, but the phrase “
P
,
W
,
S
, and
C
represent four separate deconvolution layers” does use the serial comma.
Claim 1 contains words that begin with capital letters beyond the first word of the claim. Each claim should begin with a capital letter and end with a period; see MPEP § 608.01(m). Furthermore, the use of capital letters is inconsistent; for example, the limitation under “(1.1) module input” does not begin with a capital letter, but the limitation under “(1.2) module structure” does begin with a capital letter.
Claim Rejections—35 U.S.C. § 112
The following is a quotation of 35 U.S.C. § 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
Claim 1 is rejected under 35 U.S.C. § 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 1 is rejected as failing to define the invention in the manner required by 35 U.S.C. § 112(b). The claim(s) are narrative in form and replete with indefinite language. The structure which goes to make up the device must be clearly and positively specified. The structure must be organized and correlated in such a manner as to present a complete operative device. The claim(s) must be in one sentence form only. Note the format of the claims in the patent(s) cited.
Claim Rejections—35 U.S.C. § 103
The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 C.F.R. § 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
Claim 1 is rejected under 35 U.S.C. § 103 as being unpatentable over Wang et al. (“Efficient Fully Convolution Neural Network for Generating Pixel Wise Robotic Grasps With High Resolution Images”) [hereinafter Wang] in view of Yingying Yu et al. (“A Two-Stream CNN With Simultaneous Detection and Segmentation for Robotic Grasping”) [hereinafter Y. Yu] and Sheng Yu et al. (“Object recognition and robot grasping technology based on RGB-D data”) [hereinafter S. Yu].
Regarding independent claim 1, Wang teaches [a]n active data learning selection method for robot grasp, which is mainly divided into two branches, an object grasp method detection branch and a data selection strategy branch, which specifically comprises the following three modules: A neural network model for generating robotic grasps, comprising down-sampling to obtain features, and up-sampling to generate output robotic grasps (Wang, p. 474, abstract). (1) data feature extraction module The data feature extraction module is a convolutional neural network feature extraction layer; after the input data is processed by the data feature extraction module, the input data is called feature data and provided to other modules for use; The neural network architecture includes a plurality of convolutional neural network layers (Wang, pp. 476–477, § IV(C), fig. 4). (1.1) module input: the input of this module can be freely selected between RGB image and a depth image; there are three input schemes: a single RGB image, a single depth image and a combination of RGB and the depth image; the corresponding input channels are 3 channels, 1 channel and 4 channels respectively; the length and width of the input image are both 300 pixels; The input comprises RGB-D images comprising RGB images and depth images of unknown objects [one channel for each of red, green, blue, and depth] (Wang, p. 475, § III). (1.2) module structure: This module uses a three-layer convolutional neural network structure; the sizes of the convolution kernel are[ ]9x9, 5x5 and 3x3; the number of output channels is 32, 16 and 8 respectively; each layer of the data feature extraction module is composed of convolutional layers and activation functions, and the whole process is expressed as the following formulas: The kernel sizes are 9-5-3-3-3-3 [the kernel sizes for the first three layers are 9x9, 5x5, and 3x3] (Wang, p. 477, § IV(C)). The input and output channels are powers of 2, e.g., 16, 32, 64, etc. (Wang, p. 477, fig. 4).
O
u
t
1
=
F
R
G
B
D
1
O
u
t
2
=
F
O
u
t
1
2
O
u
t
3
=
F
O
u
t
2
3
RGBD represents the 4-channel input data combining RGB image and the depth image, and
F
represents the combination of the convolutional layer and the activation functions,
O
u
t
1
,
O
u
t
2
and
O
u
t
3
[ ]represent the feature maps of the three-layer output; when the length and width of the input image are both 300 pixels, the size of
O
u
t
1
is 100 pixels x 100 pixels, the size of
O
u
t
2
is 50 pixels x 50 pixels, and the size of[ ]
O
u
t
3
is 25 pixels x[ ]25 pixels; The convolutional neural network layers include layers having input sizes of 100 x 100, 50 x 50, and 25 x 25 (Wang, p. 477, fig. 4). [Wang teaches using an image size of 400 x 400, but also teaches that other systems may use varying image sizes, including 300 x 300; see p. 478, table 1.] (2) grasp method detection module This module uses a final feature map obtained by the data feature extraction module to perform deconvolution operation to restore the feature map to the original input size, which is 300 pixels x 300 pixels, and obtain the final result, namely a grasp value map, a width map and sine and cosine diagrams of the rotation angle; according to these four images, the center point, width and rotation angle of the object grasp method are obtained; Each convolutional neural network layer is paired with a deconvolution layer that up-samples the feature map to generate three grasp maps having the same size as the input images (Wang, pp. 476–477, § IV(C), fig. 4). The grasp maps represent the grasp quality, rotation angle, and gripper width (Wang, p. 475, § III, p. 477, § IV(C)). (2.1)[ ]module input: The input of this module is the feature map
O
u
t
3
[ ]obtained in formula (3); The data moves through the neural network layers from the input, to the convolutional layers, to the deconvolutional layers (Wang, p. 477, fig. 4). (2.2) module structure: The grasp method detection module contains three deconvolution layers and four separate convolutional layers; the sizes of the convolution kernels of the three deconvolution layers are set to 3x3, 5x5 and 9x9; the […]; in addition, after the deconvolution operation, each layer also comprises the ReLU activation function to achieve a more effective representation, and the four separate convolutional layers will directly output the result; the process is expressed as: The model includes a plurality of deconvolution layers having 3-3-3-5-9-5 kernel sizes [layers having 3 x 3, 5 x 5, and 9 x 9 size kernels]; the ReLU activation function is used for all layers (Wang, p. 477, § IV(C)).
x
=
D
F
O
u
t
3
4
p
=
P
x
[
]
5
w
=
W
x
[
]
6
s
=
S
x
[
]
7
c
=
C
x
[
]
8
O
u
t
3
is the final output of the feature extraction layer,
D
F
is the combination of three deconvolution layers and the corresponding activation function ReLU;
P
,
W
,
S
, and
C
represent four separate deconvolution layers, and […]; The neural network outputs a grasp map including the grasp quality, rotation, and gripper width (Wang, p. 476, § III).
Wang teaches a model architecture having convolutional and deconvolutional layers, but does not expressly teach the “third module” as claimed. However, Y. Yu teaches: correspondingly
p
,
w
,
s
and
c
respectively represent the final output capture value map, width map, and the sine and cosine diagram of the rotation angle; the final capture method is expressed by the following formulas:
i
,
j
=
a
r
g
m
a
x
p
9
w
i
d
t
h
=
w
i
,
j
[
]
10
s
i
n
θ
=
s
i
,
j
[
]
11
c
o
s
θ
=
c
i
,
j
[
]
12
θ
=
arctan
sin
θ
cos
θ
[
]
13
a
r
g
m
a
x
represents the horizontal and vertical coordinates
i
,
j
of the maximum point in the figure; the width
w
i
d
t
h
, the sine value of the rotation angle
s
i
n
θ
and the cosine value of the rotation angle
c
o
s
θ
are respectively obtained from the corresponding output image and the above coordinates, and the final rotation angle
θ
is obtained by the arctangent function[ ]
a
r
c
t
a
n
; The grasp branch outputs a quality and pose of a grasp, including the height H and width W of the grasping rectangle, and the cosine C and sine S orientation of the grasping rectangle (Y. Yu, p. 1172, § III(B), fig. 4). The best grasp is obtained using a formula including the argmax of Q [equivalent to Applicant’s p] and arctan of the sine divided by the cosine of the points (u,v) (Y. Yu, p. 1173, equation 12). (3)[ ]data selection module The data selection module shares all the feature maps obtained by the data feature extraction module, and uses these feature maps to obtain the final output; the output is between 0 and 1, which represents the probability that the input data is labeled data; the closer the value is to 0, it means the probability that the data has been labeled is smaller, so this labeled data should be selected less likely; A model for robot grasping comprising a first “BlitzNet” branch and a second “TsGNet” branch; the BlitzNet branch labels a pixel by outputting a probability between 0 and 1 that the pixel belongs to a class, where 1 represents that the pixel does belong to the class (Y. Yu, p. 1171, § III(A), final paragraph). (3.1) module input: The input of this module is the combination of
O
u
t
1
,
O
u
t
2
and[ ]
O
u
t
3
obtained by formulas (1), (2) and (3); The input is an image from a Microsoft Kinect camera [which captures RGB and depth data] (Y. Yu, p. 1169, § II). (3.2) module structure: since the feature maps obtained by the data feature extraction module are of different sizes, this module first uses the average pooling layer to perform dimensionality reduction operations on the feature maps; according to the number of channels of the three feature maps, they are reduced into feature vectors with 32, 16 and 8 channels respectively; […] the process is expressed as the following formulas:
f
1
=
F
C
G
A
P
O
u
t
1
14
f
2
=
F
C
G
A
P
O
u
t
2
[
]
15
f
3
=
F
C
G
A
P
O
u
t
3
[
]
16
k
=
F
f
1
+
f
2
+
f
3
17
GAP represents the global average pooling layer, FC represents the fully connected layer, +[ ]represents the connection operation, F represents the combination of the convolutional layer, the activation function ReLU and the fully connected layer, and k is the final output value. The convolutional neural network architecture includes Global Average Pooling layers and concatenation [connection] (Y. Yu, pp. 1170–1171, § III(A)).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to combine the teachings of Wang with those of Y. Yu. Doing so would have been a matter of applying a known technique [combining object detection with grasp generation] to a known method ready for improvement [the robot grasp generation method of Wang] to yield predictable results [a combined object detection/robot grasp generation method using CNNs configured as in Wang and Y. Yu].
Wang/Y. Yu teaches a model architecture having object detection and grasp generation, but does not expressly teach using a fully connected layer in the object detection branch, a pooling size of 2 x 2, etc. However, S. Yu teaches: sizes of the convolution kernels of the four separate convolutional layers is 2x2 A convolutional neural network architecture having pooling kernels of size 2 x 2 (S. Yu, p. 3872, § 2.4). after that, each feature vector goes through a fully connected layer separately, and outputs a vector of length 16; three vectors of length 16 are connected and merged to obtain a vector of length 48; in order to better extract features, a vector with a length of 48 is input to a convolutional layer and an activation function ReLU, and the number of output channels is 24; the vector with a length of 24 finally passes through the fully connected layer to output the final result value; A convolutional neural network architecture for object recognition in robotic grasping, based on an input RGB-D image, with fully connected layers (S. Yu, p. 3870, § 2.2, fig. 1). [S. Yu further teaches using ReLU as the activation function in CNNs; see e.g., p. 3872, § 2.4.]
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to combine the teachings of Wang/Y. Yu with those of S. Yu. Doing so would have been a matter of simple substitution of one known element [the final layers of the neural network architecture of Y. Yu] for another [the fully connected layers of S. Yu] to obtain predictable results [a combined objection detection/robot grasp generation method with the neural networks configured as in Wang, Y. Yu, and S. Yu].
Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tyler Schallhorn whose telephone number is 571-270-3178. The examiner can normally be reached Monday through Friday, 8:30 a.m. to 6 p.m. (ET).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached at 571-272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (in the USA or Canada) or 571-272-1000.
/Tyler Schallhorn/Examiner, Art Unit 2144
/TAMARA T KYLE/Supervisory Patent Examiner, Art Unit 2144