Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
The instant application having Application No. 16654041 has a total of 20 claims pending in the application.
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-16 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-7 of U.S. Patent No. 10762894 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because each of the limitations of the instant claims can be met by those of the patent.
Instant Application
10762894 B2
Examiners Note
A system comprising: data processing hardware; and memory hardware storing instructions that when executed by the data processing hardware cause the data processing hardware to implement a speech recognition model, the speech recognition model comprising:
Claim 7: “using the convolutional neural network for keyword detection by receiving an audio signal”
Here the audio signal is being monitored for keywords, which clearly denotes some form of speech recognition it would be obvious to one of ordinary skill at the time of filing that artificial neural networks need some sort of hardware to implement their function.
A convolutional neural network comprising
Claim 1: … “a convolutional neural network…”
A first convolution neural network layer configured to receive, as input, at a two-dimensional input layer of the first convolution neural network layer, a two-dimensional matrix of input values, the set of input values comprising input values across a first dimension in time and input values across a second dimension in frequency, the input values across the second dimension in frequency representing log-mel filterbank features computed at a particular time,
Claim 1: “a convolutional neural network that comprises a first convolutional layer and a second convolutional layer… providing a two dimensional set of input values to the convolutional neural network, the input values including values across a first dimension in time and values across a second dimension in frequency… to generate a first output comprising a feature map”
It would be obvious to one of ordinary skill in the art to have similar convolutional neural network layer makeups as it would allow the system to treat the neural networks similarly and require less design on the user’s part as each convolutional layer would have the same number of hidden units. The use of a particular type of frequency values would also be obvious to one of ordinary skill in the art at the time of filing because using known mathematics to get generically claimed values would be required in order to have the values for the use of the neural network, as well as the use of the appropriate number of dimensions to take in the input values/process the data.
Generate, by processing two-dimensional data of a two-dimensional time frequency area of the two-dimensional matrix of input values input to the two-dimensional input layer together at a same time, a first output from the two-dimensional matrix of input values and the first output comprising a feature map
Claim 1: using (i) a frequency stride greater than one, and (ii) a time stride equal to one, to generate a first output comprising a feature map;
A second convolution neural network layer having a same number of hidden units as the first convolutional neural network layer and a different number of parameters than the first convolution neural network layer, the second convolution neural network layer configured to receive the feature map generated by the first convolution neural network layer, and generate a second output using the feature map
Claim 1: “generating, by the second convolutional layer of the convolutional neural network, using the feature map, a second output”
a linear low rank layer configured to receive the second output generated by the second convolution neural network layer and generate aa third output using the second output
Claim 1: “generating, by a linear low rank layer, using the second output, a third output.”
A deep neural network, comprising a single deep neural network layer, the single deep neural network layer configured to receive the third output generated by the linear low rank layer of the convolutional neural network and generate a fourth output using the third output
Claim 1: “generating, by a deep neural network, using the third output, a fourth output”
“Wherein the first convolution neural network layer and the second convolution neural network layer each use a same pooling in time dimension value”
Claim 1: “a convolutional neural network that comprises a first convolutional layer and a second convolutional layer… providing a two dimensional set of input values to the convolutional neural network, the input values including values across a first dimension in time and values across a second dimension in frequency… to generate a first output comprising a feature map”
It would be obvious to one of ordinary skill in the art at the time of filing to allow the various layers to have parameters set by the user in order to pool time and frequency at the rates the user is interested in. As the ‘894 case has these layers, setting values to the user’s specification would be an obvious variant of the previous invention.
“Wherein the first convolution neural network layer uses a first frequency pooling dimension value and the second convolution neural network layer uses a second frequency pooling dimension value, the first frequency pooling dimension value is greater than the second pooling dimension value”
Claim 1: “a convolutional neural network that comprises a first convolutional layer and a second convolutional layer… providing a two dimensional set of input values to the convolutional neural network, the input values including values across a first dimension in time and values across a second dimension in frequency… to generate a first output comprising a feature map”
It would be obvious to one of ordinary skill in the art at the time of filing to allow the various layers to have parameters set by the user in order to pool time and frequency at the rates the user is interested in. As the ‘894 case has these layers, setting values to the user’s specification would be an obvious variant of the previous invention.
As can be shown above, each limitation of the independent claim can be met by limitations from patent no. 10762894 B2. This causes the claim to be rejected under nonstatutory type double patenting.
As per claims 2, 4-11, and 13-20, these claims are rejected over claims 1-7 of patent no. 10762894 B2 for similar reasons given above.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 11 and 13-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
As per claim 11, this claim calls for “… input to the two-dimensional input first layer of the convolution neural network.” However, the “a two dimensional input” has been removed from the claim in the previous receiving step. This is likely supposed to be just “the first layer of the convolutional neural network” but leaving the “two dimensional input” in has caused this claim to have insufficient antecedent basis in line 10-11 of this claim, and therefore is rejected under 112(b) as failing to particularly point out and claim the intended invention.
As per claims 13-20, these claims are rejected as being dependent on a claim rejected under U.S.C. 112(b) for failing to particularly point out and claim the intended invention.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 5, 8-14, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sainath et al (“Learning Filter Banks Within a Deep Neural Network Framework”) in view of Simard et al (“US 20070086655 A1), Abdel-Hamid (Hereinafter Abdel, “Convolutional Neural Networks for Speech Recognition”), Sainath-2 et al (“Low-rank Matrix factorization for Deep Neural networks training with High-Dimensional Output Targets”), Soltau et al (“Joint Training of Convolutional and Non-Convolutional Neural Networks”) and Sainath-3 et al (“Improvements to Deep Convolutional Neural Networks for LVCSR”).
As per claim 1, Sainath discloses, “A system comprising: …a speech recognition model comprising, the speech recognition model comprising” (Pg.297, particularly the introduction section; EN: this denotes the neural networks being used for speech recognition).
“a convolutional neural network comprising” (pg.298, particularly section 2; EN: this denotes the use of a convolutional neural network).
“A first convolutional neural network layer configured to” (Pg.299, particularly section 3; EN: this denotes two convolutional layers). “receive as input … a two-dimensional … of input values, the two dimensional … input values comprising input values across a first dimension in time and input values across a second dimension in frequency” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency). “The input values across the second dimension in frequency representing log-mel filterbank features computed at a particular time” (Figure 1 and associated paragraphs; EN: this denotes putting log mel filter bank features in as input).
“generate, by processing two-dimensional data of a two-dimensional time-frequency area of the two-dimensional … of input values input to the first layer of the convolution neural network together at a same time… ” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency). “…a first output from the two dimensional … of input values… (Pg.299, particularly section 3; EN: this denotes two convolutional layers with outputs).
“a second convolution neural network layer having a same number of hidden units as the first convolution neural network layer” (Pg.299, particularly section 3; EN: this denotes two convolutional layers. Each convolutional layer has 256 hidden units). “…the second convolution neural network layer configured to receive the …. Generated by the first convolution neural network layer and generate a second output using the …” (Pg.299, particularly section 3; EN: this denotes two convolutional layers).
“A deep neural network comprising a single deep neural network layer, the single deep neural network layer configured to receive the … output generated by the … layer of the convolutional neural network and generate a … output using the … output” (pg.167, particularly section 3.4; EN: this denotes the use of 3 fully connected layers. Pg.298, C1, second paragraph; EN: this denotes the neural network being a deep convolutional neural network, making these layers of a deep neural network).
However, Sainath fails to explicitly disclose, “data processing hardware; and memory hardware storing instructions that, when executed by the data processing hardware, cause the data processing hardware to implement” “receiving, as input at a two-dimensional input layer of the first convolutional neural network layer, a two-dimensional matrix”, “… the two-dimensional matrix of input values input to the two-dimensional input layer together at a same time”, “and the first output comprising a feature map”, “the feature map”, “… a different number of parameters than the first convolution neural network layer”, “a linear low rank layer configured to receive the second output generated by the second convolution neural network layer and generate a third output using the second output”, “the third output”, “a fourth output”, “wherein the first convolution neural network layer and the second convolution neural network layer each use a same pooling in time dimension value”, and “wherein the first convolution neural network layer uses a first frequency pooling dimension value and the second convolution neural network layer uses a second frequency pooling dimension value, the first frequency pooling dimension value is greater than the second frequency pooling dimension value”
Simard discloses, “data processing hardware; and memory hardware storing instructions that, when executed by the data processing hardware, cause the data processing hardware to implement” (Pg.2, particularly paragraph 0024; EN: this denotes the hardware used to implement the convolutional neural network).
“receiving, as input, a two-dimensional matrix”, “two dimensional matrix”, “and the two-dimensional matrix” (pg.1-2, particularly paragraph 0009; EN: this denotes using matrices for inputs to convolutional neural networks).
Abdel discloses, “and the first output comprising a feature map”, “the feature map” (Pg.1535, particularly Section A; EN: this denotes using feature maps with speech based convolutional neural networks with the two dimensional feature map being based on frequency and time as discussed in the Sainath reference above).
Soltau discloses, “…having a different number of parameters than the first convolution neural network layer” (Pg.5573, particularly Figure 1 and associated paragraphs; EN: this denotes two convolutional layers in the bottom right, each which have a different number of parameters).
Sainath-2 discloses, “a linear low rank layer configured to receive the second output generated by the second convolution neural network layer and generate a third output using the second output”, “the third output”, “a fourth output” (Pg.6656, particularly section 2; EN: this denotes adding a low-rank layer to the neural network).
Sainath-3 discloses, “wherein the first convolution neural network layer and the second convolution neural network layer each use a same pooling in time dimension value” (Pg.3, particularly section 3.5; EN: this denotes not pooling in time (i.e. a pooling value of 1, as seen in the instant specification in Table 1 and associated paragraphs)).
“wherein the first convolution neural network layer uses a first frequency pooling dimension value and the second convolution neural network layer uses a second frequency pooling dimension value, the first frequency pooling dimension value is greater than the second frequency pooling dimension value” (Pg.3, particularly section 3.5; EN: this denotes setting the pooling for frequency at 3 in the first layer, and not doing pooling in the second layer (i.e. 1, as in the instant specification in Table 1 and associated paragraphs).
Sainath and Simard are analogous art because both involve convolutional neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Simard in order to make use of matrices for inputting data.
The motivation for doing so would be to allow the “input matrix [to] thus be rewritten in a manner such that each row thereof comprises all input feature values required for generation of one element of an output feature” (Simard, Pg.1-2, paragraph 0009).
Therefore before the effective filing date it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Simard in order to make use of matrices for inputting data.
Sainath and Abdel are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of feature maps with convolutional neural network layers.
The motivation for doing so would be to “organize speech feature vectors into feature maps that are suitable for CNN processing” (Abdel, Pg.1535, Section A, second paragraph).
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of feature maps with convolutional neural network layers.
Sainath and Sainath-2 are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-2 in order to make use of a linear low rank layer.
The motivation for doing so would be because “a low-rank factorization reduces the number of parameters of the network by 30-50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation” (Sainath-2, abstract).
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-2 in order to make use of a linear low rank layer.
Sainath and Soltau are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Soltau in order to make use of different number of parameters in convolutional layers.
The motivation for doing so would be because allowing the flexibility in parameters can improve performance, such as a “gain another .5% improvement by switching to a CNN. The Jointly trained MLP/CNN improves the error rate from 11.85% to 11.2%” (Soltau, Pg.5574, C1, second paragraph) or in the case of Sainath, allow the system to be manipulated as needed to improve performance, such as allowing the convolutional neural network layers to have different parameters so they can be changed and updated in order to improve performance such as seen in the Soltau reference.
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Soltau in order to make use of different number of units in the convolutional layers.
Sainath and Sainath-3 are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-3 in order to make use of specific values for pooling in time and frequency.
The motivation for doing so would be because “In [4] it was found that having two convolutional layers and four fully connected layers was optimal for LVCSR tasks. We found that a pooling size of 3 was appropriate for the first convolutional layer, while no pooling was used in the second layer” (Sainath-3, Pg.2, section 2, first paragraph) or in the case of Sainath, allow the system to use the setup found in Sainath’s previous work to be optimal for language processing of the Sainath reference.
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-3 in order to make use of specific values for pooling in time and frequency.
As per claim 2, Sainath discloses, “further comprising a SoftMax layer configured to receive the fourth output from the deep neural network and generate a final output for the speech recognition model” (Pg.299, particularly Section 3; EN: this denotes the neural network ending with a SoftMax layer).
As per claims 4 and 13, Sainath discloses, “wherein an accuracy of the final output is used to update the convolution neural network” (Pg.299, particularly section 3; EN; this denotes using loss during the training factor in order to improve the performance of the neural network).
As per claims 5 and 14, Sainath discloses, “the second output comprises a second matrix” (Pg.299, particularly section 4.2.1; EN: this denotes manipulating the data in matrices made up of rows and columns I and j).
“creative a vector from the second matrix; and generating the … using the vector” (Pg.298, particularly C2; EN: this denotes making use of the vectors found in the matrix to perform the operations of the neural network).
Abdel discloses, “wherein the feature map comprises a first matrix” (Pg.1536, C2; EN: this denotes the feature map being manipulated as matrices when input).
Sainath-2 discloses, “the linear low rank layer is configured to generate the third output by… generating the third vector…” (Pg.6656, particularly section 2; EN: this denotes adding a low-rank layer to the neural network).
As per claim 8, Sainath discloses, “wherein the convolution neural network is configured to: receive an audio signal encoding an utterance” (Abstract; EN: this denotes the use of the neural network for speech recognition).
“analyze the audio signal to identify a command included in the utterance” (Abstract; EN: this denotes performing speech recognition, which would include commands given within the speech).
As per claims 9 and 18, Sainath discloses, “Wherein the convolution neural network further comprises at least one max-pooling layer configured to remove variability in the input values in the first dimension and the input values in the second dimension” (Pg.300, particularly C2, last paragraph; EN: this denotes the use of max-pooling).
As per claims 10 and 19, Sainath discloses, “wherein the first convolution neural network layer comprises a filter size in time that spans … an overall size of the input values across the first dimension in time” (Pg.301, particularly section 4.4; EN; this denotes manipulating the filter size in relation to the inputs).
However, Sainath fails to explicitly disclose, “two thirds of the input values.”
Abdel discloses, “approximately two thirds of the input values” (Pg.1535, particularly C2, last paragraph; EN: this denotes an input range of 9-15 frames; Pg.1538, particularly Fig.4; EN; this denotes a filter size of 5, which would be roughly 2/3 of an input range of 9. Pg.1541, particularly Fig.7; EN: this denotes a filter size of 8, which would be 2/3 a filter size of 12).
Furthermore, the Examiner is taking official notice that selecting a particular stride is routine experimentation and it is not inventive to discover the optimum or workable ranges via routine experimentation. Someone of ordinary skill in the art would be able to pick various filter sizes which meet the needs of their neural network and selecting a particular filter size for a neural network is nothing more than routine optimization. See In re Aller, 220 F.2d 454, 456, 105 USPQ 233, 235 (CCPA 1955) and MPEP 2144.05(II). The rationale is that multiple references show different filter sizes and merely choosing a particular filter stride is a routine aspect of designing and operating a neural network.
Further, as Applicant fails to respond or challenge this official notice, this now considered to be Applicant admitted prior art.
Sainath and Abdel are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of filter sizes of 2/3 the input values.
The motivation for doing so would be to improve the error rate by setting an appropriate filter size (See Abdel, Pg.1542, Fig.10) or in the case of Sainath, allow the filter size to be adjusted to get the best performance of the neural network.
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of filter sizes of 2/3 the input values.
As per claim 11, Sainath discloses, “A …method for training a speech recognition model, the method comprising” (Pg.297, particularly the introduction section; EN: this denotes the neural networks being used for speech recognition).
“Receiving, as input at a first layer of a convolutional neural network” (pg.298, particularly section 2; EN: this denotes the use of a convolutional neural network). “a two-dimensional … of input values, the two-dimensional … of input values comprising input values across a first dimension of time and input values across a second dimension of frequency” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency). “The input values across the second dimension in frequency representing log-mel filterbank features computed at a particular time” (Figure 1 and associated paragraphs; EN: this denotes putting log mel filter bank features in as input).
“Generating, by processing two-dimensional data of a two-dimensional time-frequency area…” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency). “input to the … first layer of the convolutional neural network together at a same time by the first layer of the convolutional neural network” (Pg.299, particularly section 3; EN: this denotes two convolutional layers). “output from the two-dimensional … of input values the first output” (Pg.299, particularly section 3; EN: this denotes two convolutional layers with outputs).
“Generating, by a second layer of the neural network, a second output…” (Pg.299, particularly section 3; EN: this denotes two convolutional layers). “The second layer of the convolutional neural network having a same number of hidden units as the first layer of the convolutional neural network” (Pg.299, particularly section 3; EN: this denotes two convolutional layers. Each convolutional layer has 256 hidden units).
“Generating, by a deep neural network comprising a single deep neural network layer, a … output, using the … output” (pg.167, particularly section 3.4; EN: this denotes the use of 3 fully connected layers . Pg.298, C1, second paragraph; EN: this denotes the neural network being a deep convolutional neural network, making these hidden layers of a deep neural network).
“generating, by a SoftMax layer, a final output of the speech recognition model using the …output” (Pg.299, particularly Section 3; EN: this denotes the neural network ending with a SoftMax layer).
However, Sainath fails to explicitly disclose, “computer implemented… executed by data processing hardware that causes the data processing hardware to perform operations comprising” “receiving as input at a two dimensional input layer of a convolutional neural network a two-dimensional matrix of input”, “…of the two-dimensional matrix of input values input to the two-dimensional input layer together at the same time”, “the first output comprising a feature map”, “using the feature map”, “…a different number of parameters than the first layer of the convolutional neural network”, “Generating, by a linear low rank layer, a third output, using the second output”, “third output”, “fourth output”, “wherein the first convolution neural network layer and the second convolution neural network layer each use a same pooling in dimension value”, and “wherein the first convolution neural network layer uses a first frequency pooling dimension value and the second convolution neural network layer uses a second frequency pooling dimension value, the first frequency pooling dimension value is greater than the second frequency pooling dimension value”
Simard discloses, , “computer implemented… executed by data processing hardware that causes the data processing hardware to perform operations comprising” (Pg.2, particularly paragraph 0024; EN: this denotes the hardware used to implement the convolutional neural network).
“receiving as input … a two-dimensional matrix of input, “of the two-dimensional matrix of input values”, ” and “two dimensional matrix” (pg.1-2, particularly paragraph 0009; EN: this denotes using matrices for inputs to convolutional neural networks).
Abdel discloses, “receiving as input at a two dimensional input layer of a convolutional neural network” and “input to the two-dimensional input layer together at the same time” (pg.1536, particularly Figures 1 and 2; EN : Both of these images show the first layer, the input layer, taking in two dimensional data, thus making them two dimensional input layers).
“the first output comprising a feature map”, “using the feature map” (Pg.1535, particularly Section A; EN: this denotes using feature maps with speech based convolutional neural networks with the two dimensional feature map being based on frequency and time as discussed in the Sainath reference above).
Sainath-2 discloses, “Generating, by a linear low rank layer, a third output, using the second output”, “third output”, and “fourth output” (Pg.6656, particularly section 2; EN: this denotes adding a low-rank layer to the neural network).
Soltau discloses, “…a different number of parameters than the first layer of the convolutional neural network” (Pg.5573, particularly Figure 1 and associated paragraphs; EN: this denotes two convolutional layers in the bottom right, each which have a different number of parameters).
Sainath-3 discloses, “wherein the first convolution neural network layer and the second convolution neural network layer each use a same pooling in dimension value” (Pg.3, particularly section 3.5; EN: this denotes not pooling in time (i.e. a pooling value of 1, as seen in the instant specification in Table 1 and associated paragraphs)).
“wherein the first convolution neural network layer uses a first frequency pooling dimension value and the second convolution neural network layer uses a second frequency pooling dimension value, the first frequency pooling dimension value is greater than the second frequency pooling dimension value” (Pg.3, particularly section 3.5; EN: this denotes setting the pooling for frequency at 3 in the first layer, and not doing pooling in the second layer (i.e. 1, as in the instant specification in Table 1 and associated paragraphs).
Sainath and Simard are analogous art because both involve convolutional neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Simard in order to make use of matrices for inputting data.
The motivation for doing so would be to allow the “input matrix [to] thus be rewritten in a manner such that each row thereof comprises all input feature values required for generation of one element of an output feature” (Simard, Pg.1-2, paragraph 0009).
Therefore before the effective filing date it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Simard in order to make use of matrices for inputting data.
Sainath and Abdel are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of feature maps with convolutional neural network layers.
The motivation for doing so would be to “organize speech feature vectors into feature maps that are suitable for CNN processing” (Abdel, Pg.1535, Section A, second paragraph).
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Abdel in order to make use of feature maps with convolutional neural network layers.
Sainath and Sainath-2 are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-2 in order to make use of a linear low rank layer.
The motivation for doing so would be because “a low-rank factorization reduces the number of parameters of the network by 30-50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation” (Sainath-2, abstract).
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-2 in order to make use of a linear low rank layer.
Sainath and Soltau are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Soltau in order to make use of different number of parameters in convolutional layers.
The motivation for doing so would be because allowing the flexibility in parameters can improve performance, such as a “gain another .5% improvement by switching to a CNN. The Jointly trained MLP/CNN improves the error rate from 11.85% to 11.2%” (Soltau, Pg.5574, C1, second paragraph) or in the case of Sainath, allow the system to be manipulated as needed to improve performance, such as allowing the convolutional neural network layers to have different parameters so they can be changed and updated in order to improve performance such as seen in the Soltau reference.
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Soltau in order to make use of different number of units in the convolutional layers.
Sainath and Sainath-3 are analogous art because both involve speech recognition neural networks.
Before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-3 in order to make use of specific values for pooling in time and frequency.
The motivation for doing so would be because “In [4] it was found that having two convolutional layers and four fully connected layers was optimal for LVCSR tasks. We found that a pooling size of 3 was appropriate for the first convolutional layer, while no pooling was used in the second layer” (Sainath-3, Pg.2, section 2, first paragraph) or in the case of Sainath, allow the system to use the setup found in Sainath’s previous work to be optimal for language processing of the Sainath reference.
Therefore before the effective filing date it would have been obvious to one skilled in the art of speech recognition neural networks to combine the work of Sainath and Sainath-3 in order to make use of specific values for pooling in time and frequency.
As per claim 20, Sainath discloses, “Further comprising, after training the speech recognition model, providing the trained speech recognition model to a device for use by the device for keyword detection of one or more key phrases” (Pg.165, particularly the introduction section; EN: this denotes the neural networks being used for speech recognition).
Claim Rejections - 35 USC § 103
Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Sainath et al (“Learning Filter Banks Within a Deep Neural Network Framework”) in view of Simard et al (“US 20070086655 A1), Abdel-Hamid (Hereinafter Abdel, “Convolutional Neural Networks for Speech Recognition”), Sainath-2 et al (“Low-rank Matrix factorization for Deep Neural networks training with High-Dimensional Output Targets”), Soltau et al (“Joint Training of Convolutional and Non-Convolutional Neural Networks”) and Sainath-3 et al (“Improvements to Deep Convolutional Neural Networks for LVCSR”) and further in view of Gibiansky (“Convolutional Neural Networks).
As per claim 6, Sainath discloses, “Wherein the first convolution neural network layer is configured to generate the … by performing … on the two-dimensional … of input values for a filter that has a time span that extends over all of the input values in the first dimension” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency. As there is no discussion of span, it is assumed that both of these describe the full range of these values). “and a frequency span that extends over less than all of the input values in the second dimension” (Pg.298, particularly C2, second paragraph; EN: this denotes limiting the span to local frequencies).
Simard discloses, “two dimensional matrix” (pg.1-2, particularly paragraph 0009; EN: this denotes using matrices for inputs to convolutional neural networks).
Abdel discloses, “feature map” (Pg.1535, particularly Section A; EN: this denotes using feature maps with speech based convolutional neural networks with the two dimensional feature map being based on frequency and time as discussed in the Sainath reference above).
However, Sainath fails to explicitly disclose, “convolutional multiplication.”
Gibiansky discloses, “convolution multiplication” (Pg.2-3, particularly the “convolutional layers” section; EN: this denotes the actual mathematics of performing convolution, which includes multiplication).
Sainath and Gibiansky are analogous art because both involve convolutional neural networks.
At the time of invention it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Gibiansky in order to allow the use of matrix multiplication with convolution.
The motivation for doing so would be to use the mathematics needed to perform the convolution of the Sainath reference by “sum[ming] up the contributions (weighted by the filter components) from the previous layer cells… this is just a convolution, which we can express via matlab…” (Gibiansky, Pg. 2-3, convolution layers section).
Therefore at the time of invention it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Gibiansky to allow the use of matrix multiplication with convolution.
As per claim 16, Sainath discloses, “wherein generating, by the first layer of the convolution neural network, the first output comprises performing convolution… on the two-dimensional … of input values for a filter that has a time span that extends over all of the input values in the first dimension” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency. As there is no discussion of span for time, it is assumed that this covers all of the time). “and a frequency span that extends over less than all of the input values in the second dimension” (Pg.298, particularly C2, second paragraph; EN: this denotes limiting the span to local frequencies).
Simard discloses, “two dimensional matrix” (pg.1-2, particularly paragraph 0009; EN: this denotes using matrices for inputs to convolutional neural networks).
However, Sainath fails to explicitly disclose, “Convolution multiplication” Gibiansky discloses, “convolution multiplication” (Pg.2-3, particularly the “convolutional layers” section; EN: this denotes the actual mathematics of performing convolution, which includes multiplication).
Sainath and Gibiansky are analogous art because both involve convolutional neural networks.
At the time of invention it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Gibiansky in order to allow the use of matrix multiplication with convolution.
The motivation for doing so would be to use the mathematics needed to perform the convolution of the Sainath reference by “sum[ming] up the contributions (weighted by the filter components) from the previous layer cells… this is just a convolution, which we can express via matlab…” (Gibiansky, Pg. 2-3, convolution layers section).
Therefore at the time of invention it would have been obvious to one skilled in the art of convolutional neural networks to combine the work of Sainath and Gibiansky to allow the use of matrix multiplication with convolution.
Claim Rejections - 35 USC § 103
Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Sainath et al (“Learning Filter Banks Within a Deep Neural Network Framework”) in view of Simard et al (“US 20070086655 A1), Abdel-Hamid (Hereinafter Abdel, “Convolutional Neural Networks for Speech Recognition”), Sainath-2 et al (“Low-rank Matrix factorization for Deep Neural networks training with High-Dimensional Output Targets”), Soltau et al (“Joint Training of Convolutional and Non-Convolutional Neural Networks”), Sainath-3 et al (“Improvements to Deep Convolutional Neural Networks for LVCSR”) and Gibiansky (“Convolutional Neural Networks) and further in view of Wulfing et al (“Unsupervised Learning of Local Features for Music Classification”).
As per claims 7 and 17, Sainath discloses, “Wherein performing the convolution… on the two-dimensional … of inputs comprises performing the convolution ... on the two dimensional set of input values for the filter…” (Pg.298, particularly C1, last paragraph; EN: this denotes the data making use of both time and frequency).
Simard discloses, “two-dimensional matrix” (pg.1-2, particularly paragraph 0009; EN: this denotes using matrices for inputs to convolutional neural networks).
Gibiansky discloses, “convolution multiplication” (Pg.2-3, particularly the “convolutional layers” section; EN: this denotes the actual mathematics of performing convolution, which includes multiplication).
However, Sainath fails to explicitly disclose, “using a frequency stride greater than one and a time stride equal to one”
Wulfing discloses, “using a frequency stride greater than one and a time stride equal to one” (pg.142-143 particularly section 6; EN; this denotes using various strides to improve accuracy).
Furthermore, the Examiner is taking official notice that selecting a particular stride is routine experimentation and it is not inventive to discover the optimum or workable ranges via routine experimentation. Someone of ordinary skill in the art would be able to pick various strides which meet the needs of their neural network and selecting a particular stride number for a neural network is nothing more than routine optimization. See In re Aller, 220 F.2d 454, 456, 105 USPQ 233, 235 (CCPA 1955) and MPEP 2144.05(II). The rationale is that multiple references show different strides, and merely choosing a particular stride is a routine aspect of designing and operating a neural network.
Note: Since the Applicant failed to respond to this official notice given in the previous office action, this is now applicant admitted prior art.
Sainath and Wulfing are analogous art because both involve convolutional machine learning.
At the time of invention it would have been obvious to one skilled in the art of convolutional machine learning to combine the work of Sainath and Wulfing in order to use various strides.
The motivation for doing so would be because “another way of speeding up the extraction is to increase the stride s. This however, has a stronger effect on accuracy…” (Wulfing, PG.143) or in the case of Sainath, allow the system to pick strides as needed to improve speed vs accuracy for the neural network.
Therefore at the time of invention it would have been obvious to one skilled in the art of convolutional machine learning to combine the work of Sainath and Wulfing in order to use various strides.
Claim Rejections - 35 USC § 103
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Sainath et al (“Learning Filter Banks Within a Deep Neural Network Framework”) in view of Simard et al (“US 20070086655 A1), Abdel-Hamid (Hereinafter Abdel, “Convolutional Neural Networks for Speech Recognition”), Sainath-2 et al (“Low-rank Matrix factorization for Deep Neural networks training with High-Dimensional Output Targets”), Soltau et al (“Joint Training of Convolutional and Non-Convolutional Neural Networks”) and Sainath-3 et al (“Improvements to Deep Convolutional Neural Networks for LVCSR”) and further in view of Wheeler et al (“Voice recognition will always be stupid”).
As per claim 15, Sainath discloses, “Further comprising using the convolution neural network for keyword detection by: receiving an audio signal encoding an utterance” (Abstract; EN: this denotes the use of the neural network for speech recognition).
“analyzing the audio signal to identify a command included in the utterance” (Abstract; EN: this denotes performing speech recognition, which would include commands given within the speech).
However, Sainath fails to explicitly disclose, “performing an action that corresponds to the command.”
Wheeler discloses, “performing an action that corresponds to the command” (pg.1; EN: this denotes speech commands being used for customer support).
Wheeler and Sainath are analogous art because both involve speech recognition.
At the time of invention it would have been obvious to one skilled in the art of speech detection to combine the work of Sainath and Wheeler in order to make use of speech detection in a device.
The motivation for doing so would be to provide “non-human customer service” (Wheeler, Pg.1) or in the case of Sainath, allow the systems speech recognition to be used for customer service or other machine based responses.
Therefore at the time of invention it would have been obvious to one skilled in the art of speech detection to combine the work of Sainath and Wheeler in order to make use of speech detection in a device.
Response to Arguments
Applicant's arguments with respect to claims 1-2, 4-11, and 13-20 have been considered but are moot in view of the new ground(s) of rejection.
Conclusion
Applicant's amendment necessitated the new ground(