DETAILED ACTION
This office action is in response to the Application No. 17476345 filed on
11/26/2025. Claim 1-20 are presented for examination and are currently pending. Applicant’s arguments have been carefully and respectfully considered.
Response to Arguments
The claim amendments of 11/26/2025 as overcome the 112(b) rejection of 08/27/2025. As a result the 112(b) has been withdrawn.
On page 15 of the remarks, the Applicant argued that “It is noted that Arora uses a UK English phoneme set that is not associated with a particular speaker for defining the "expected target value of that phonological feature within the phrase". According to Arora ([0012]), "Phonological features are a finite set which should characterise all segmental contrasts across the world's languages." Note that the phonological features are not related to any specific speaker, but are associated with a particular language. By contrast, the claims of the pending claims are directed to a system that is related to a specific speaker, and not a particular language”.
The above argument is not persuasive because Arora discloses that each phoneme uttered by the speaker is classified into correct or incorrect pronunciation [0146]. Since Arora is classifying phoneme’s uttered by the speaker, then Arora discloses a system that is related to the speaker. Also, Arora teaches that “The input to the detector NN 116 is prepared as follows. The average values of phonological feature probabilities are obtained for each phoneme segment” [0130]. This shows that the phonological features are gotten from each phoneme and the phoneme is uttered from a speaker. As a result, Arora relates to the claimed invention.
In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
It is noted that the mixed speech input signal is being taught by Lee, the secondary reference and it is the combination of Arora in view of Lee in view of Li as a whole that discloses the current claim.
The argument continued on page 15 that “Moreover, as noted by the Examiner, the label in the Arora reference is a binary value that indicates whether a phoneme is correct or incorrect. By contrast, the plurality of labeled phonemes of the pending claims define a mapping relationship between different sound elements of the mixed speech and the corresponding phonemes of a target speaker, which is different from the binary expression in the Arora reference”.
The above argument has been considered and are moot in light of a new secondary reference that has been applied to the amended limitations.
Furthermore on page 15, the Applicant argued that “Lastly, Arora does not teach using the intermediate transition representation as being amended for (i) recognizing a second set of estimate candidate phonemes associated with the target speaker from the mixed speech and then (ii) optimizing (i.e., training) the neural networks by reducing the differences between the two sets of phonemes above. Arora discloses generating a probability of phonological feature, comparing it against that of the target, and providing feedback for the features with the largest discrepancy. But neither the phonological features nor the probability values of phonological features is equivalent to a second set of estimated candidate phonemes, as disclosed in the application at paragraphs 57, 80, and 130”.
The above argument is not persuasive because the amended limitation of “intermediate transition representation of the target speech spectra of the plurality of time-frequency windows of the mixed speech spectrum” of claim 1 has now been mapped with a new secondary reference, Lee.
It is noted that it is the combination of Arora modified by Lee modified by Li as a whole that discloses the current claim limitation argued above.
Furthermore, in response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references. See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
On page 16 of the remarks, the Applicant argued that “Nor does any of the other references, Hijazi, Li, Chen and Neil, teach or suggest all the newly added claim features in claims 1, 11 and 18 as described above. In view of the above, the Applicant respectfully submits that all the pending claims are patentable over the cited references”.
It is noted that the Applicant’s argument has been considered but are moot in light of the newly added secondary reference that has been applied to the amended claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
3. Claims 1,10, 11, 17, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Arora et al. (US20210134277 PCT filed 04/17/2018) in view of Lee et al. (US20180190268) and further in view of Li et al. (US20190147854)
Regarding claim 1, Arora teaches a method of training a neural network for implementing speech recognition task (The system 100 for providing automatic speech analysis, and pronunciation training, will typically comprise an Automatic Speech Recognition (ASR) system 102 consisting of an acoustic model 104 and a decoder 106 [0055]; A computer implemented method for automatic speech analysis, abstract) performed by an electronic device (In other embodiments at least a portion of the system 100 may be provided remotely. In such embodiments, it is conceivable that a user may input a speech signal on a local device 500 (such as telephone, tablet (such as an iPad), or the like) [0084]),
the neural network comprising a first subnetwork (DNN 112 [0063], Fig. 1),
a second subnetwork (second DNN 113 [0065], Fig. 1), and
a third subnetwork (NN 116 with one hidden layer of 500 neurons with ReLU non-linearity [0129], Fig. 1), the method comprising:
obtaining sample data, the sample data (the speech signal 108 represents the learner's attempt to utter the target sequence of phonemes [0117], Fig. 1) comprising a mixed speech spectrum ((ii) receiving a speech signal, wherein the speech signal comprises a user's attempt to say the target phrase [0007]. The Examiner notes that instant specification of Applicant discloses a received speech is a mixed speech) and
a plurality of labeled phonemes thereof (each phoneme uttered by the speaker is classified into correct or incorrect pronunciation [0146]; The two phoneme transcriptions are time-synchronised (which may also be referred to as force aligned) and hence, it is easy to label the target transcriptions with binary mispronunciation markers (i.e., correct or not) [0113]. The Examiner notes phonemes are labeled correct and incorrect),
adaptively transforming the target speech spectrum by using the second subnetwork (A further deep neural network 113 is used to map the features, for which the probability has been determined by the DNN 112 [0064], Fig. 1),
performing phoneme recognition based on the intermediate transition representation by using the third subnetwork (A further deep neural network 113 is used to map the features, for which the probability has been determined by the DNN 112 [0064], Fig. 1)
to obtain a plurality of estimated candidate phonemes corresponding to the different sound sources of the mixed speech (the desired output of the NN 116 at the pith neuron is set to 0 or 1, for incorrect or correct pronunciation, respectively. In order to estimate the ground truth for training, the speech utterance is force-aligned with the actually uttered phoneme sequence. If the aligned segment of the target phoneme overlaps more than 50% with the same phoneme in the actually uttered alignments, it is marked as correctly pronounced. However, if the overlap is less than 10%, it is marked as incorrect pronunciation [0135]);
updating parameters of the first subnetwork, the second subnetwork, and the third subnetwork according to the plurality of estimated candidate phonemes corresponding to the different sound sources of the mixed speech (Stochastic gradient descent is used to update the weights of all the layers by minimising the squared error objective function. While computing the weight update for the phoneme pi, an important issue for the shared classifier is to determine the desired output for other phonemes. In the embodiment being described, the error feedback from output neurons corresponding to other phonemes [0135]) and
the mapping relationship between different sound sources of the mixed speech and corresponding phonemes in the mixed speech defined by the labeled phonemes by (In this architecture, all the phonemes have a shared hidden layer. This allows for improvement in the performance in the face of scanty training data as different phonemes benefit by mutual sharing of statistical properties [0132]):
Arora does not explicitly teach wherein the mixed speech is a speech signal by a target speaker mixed with simultaneous interferences from sound sources different from the target speaker and the plurality of labeled phonemes of the mixed speech define a mapping relationship between the different sound elements of the mixed speech and corresponding phonemes of the target speaker in the mixed speech; extracting a target speech spectrum corresponding to the target speaker from the mixed speech spectrum by using the first subnetwork, wherein the target speech spectrum includes target speech spectra of a plurality of time-frequency windows of the mixed speech spectrum; to obtain an intermediate transition representation of the target speech spectra of the plurality of time-frequency windows of the mixed speech spectrum; and improving an accuracy of the speech recognition task performed by the neural network for the target speaker in the mix speech via updating parameters of the first subnetwork,
Lee teaches wherein the mixed speech is a speech signal by a target speaker mixed with simultaneous interferences (A person has an ability of concentrating on a signal of a particular spectrum area based on a speech to be input and adaptively removing a noise included in the speech signal [0070]; a plurality of speech frames may be simultaneously input to the speech recognizing model [0076]) from sound sources different from the target speaker (based on information on a speaker to be recognized for emphasizing the speaker to be recognized from among noise [0010]; based on information on a speaker to be recognized for emphasizing the speaker to be recognized from among other speakers [0028].The Examiner notes speaker to be recognized is the target speaker, and simultaneous interference are from among other speakers) and the plurality of labeled phonemes of the mixed speech (training of the neural network with respect to labeled input training data through a supervised training operation [0121]; The speech recognizing apparatus 110 may estimate words indicated by the speech signal based on the speech recognition result in the phoneme unit obtained by the acoustic model 120 [0067]; … at plural times to estimate plural phonemes for the estimated word [0018]) define a mapping relationship between the different sound elements of the mixed speech and corresponding phonemes of the target speaker in the mixed speech (In this example, the layer 425 may indicate a recognition result value, for example, a probability value or a probability vector of a phoneme, of a previous speech frame. Thus, in an example, an output value of at least one layer among the output values ht−1 1, ht−1 2, ht−1 3, . . . , st−1 may be used to determine the attention weight. In Equation 1, Ct denotes the context value including information on a target speaker to be recognized and a parameter for performing speech recognition by concentrating on a speech of the target speaker [0093]. The Examiner notes concentrating on a speech of the target speaker of a mixed speech is mapped to a probability vector of a phoneme);
extracting a target speech spectrum corresponding to the target speaker from the mixed speech spectrum (recognize a speech of a person other than noise from a captured speech signal and/or to intensively recognize a speech of a particular or select speaker to be recognized when plural speeches of a plurality of speakers are present in the captured speech signal [0069]; The speech recognizing apparatus 110 extracts a feature from a speech signal and estimates a speech recognition result based on the extracted feature [0065]) by using the first subnetwork (In an example, the speech recognizing model implemented by the speech recognizing apparatus 110 configured as the neural network [0069]),
wherein the target speech spectrum includes target speech spectra of a plurality of time-frequency windows of the mixed speech spectrum (The speech recognizing apparatus 110 may obtain or generate the spectrogram by representing a result of analyzing a spectrum of the speech signal in a time-frequency domain using a Fourier transform [0065]; For example, in the training, in response to feature values of different frequency components of the speech signal being input to the speech recognizing model being trained, the speech recognizing model being trained may be trained to select a feature value of a frequency component which is to be more intensively considered from the feature values of other frequency components at a current time based on information of a previous time during the training [0129]);
to obtain an intermediate transition representation of the target speech spectra of the plurality of time-frequency windows of the mixed speech spectrum (For example, in the training, in response to feature values of different frequency components of the speech signal being input to the speech recognizing model being trained, the speech recognizing model being trained may be trained to select a feature value of a frequency component which is to be more intensively considered from the feature values of other frequency components at a current time based on information of a previous time during the training [0129]);
and improving an accuracy of the speech recognition task performed by the neural network for the target speaker in the mix speech (in an example, a speech recognizing model according to one or more embodiments may well recognize speech in a noisy environment more accurately and/or recognize different speech by modeling such a descending path and provide a selective attention ability for improved speech recognition [0070]) via updating parameters of the first subnetwork (For this, an example neural network that forms or is configured to implement the speech recognizing model may adjust the speech signal …, based on a determined attention weighting [0070]),
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora to incorporate the teachings of Lee for the benefit of recognition performance may be enhanced by reducing an influence of a noise component on a result of recognizing the speech signal and/or concentrating on a speech signal of a particular speaker (Lee [0072])
Arora and Lee does not explicitly teach determining a joint loss function of the first subnetwork, the second subnetwork, and the third subnetwork; calculating a value of the joint loss function according to the plurality of estimated candidate phonemes corresponding to the mixed speech, and iteratively updating the parameters of the first subnetwork, the second subnetwork, and the third subnetwork through stochastic gradient decent (SGD) or back propagation (BP) until the value of the joint loss function satisfies a predefined convergence condition.
Li teaches determining a joint loss function of the first subnetwork, the second subnetwork, and the third subnetwork (All the parameters of DSN are jointly optimized through backprogation with stochastic gradient descent (SGD) as follows [0056]; FIG. 1 is a block diagram illustrating a DSN system 100 architecture [0031] comprises a first subnetwork extractor (125 DNN Mc), second subnetwork (senone classifier DNN Mc 160) and a third subnetwork (domain classifier DNN Md 165));
calculating a value of the joint loss function according to the plurality of candidate phonemes corresponding to the mixed speech, and the joint loss function (The DSN learns an intermediate deep representation that is both senone or phoneme-discriminative and domain-invariant through jointly optimizing the primary task of speech unit classification and the secondary task of domain classification with adversarial objective functions [0019]. The Examiner notes that with neural networks, where the target is to minimize the error, the objective function is often referred to as a loss function, and a loss function or an objective function is a way to quantitatively represent how close the predicted value is to the target value); and
iteratively updating the parameters of the first subnetwork, the second subnetwork, and the third subnetwork through stochastic gradient decent (SGD) or back propagation (BP) until the value of the joint loss function satisfies a predefined convergence condition (In one embodiment, all the sub-networks may be jointly optimized using stochastic gradient descent (SGD) [0044]; The total loss of DSN is formulated as follows and is jointly optimized with respect to the parameters [0055]; All the parameters of DSN are jointly optimized through backpropagation with stochastic gradient descent (SGD) [0056]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora and Lee to incorporate the teachings of Li for the benefit of using domain separation networks (DSN) that successfully adapt a clean acoustic model to the unlabeled noisy data and achieves remarkable WER (word error rates) improvement on robust ASR (automatic speech recognition) (Li [0070]).
Regarding claim 10, Arora, Lee and Li teaches the neural network training method according to claim 1, Li teaches further comprising: obtaining a to-be-recognized mixed speech spectrum (Method 400 may begin at operation 410 by obtaining a source speech domain having labels for source speech domain input features [0073]; In a baseline system, a DNN-HMM acoustic model with clean speech may be trained and then adapted to noisy data using GRL unsupervised adaptation [0061]);
extracting a target speech spectrum from the mixed speech spectrum (Operations at 430 are performed to extract private components from each of the source and target speech domain input features [0073])
by using the first subnetwork (The speech frames 110 from the source domain are provided to both source private component extractor M p s 120 and a shared component extractor M c 125. The speech frames 115 from the target domain are provided to the shared component extractor M c 125 and to a target private component extractor M p t 130 [0032]; The Examiner notes extractor 125 DNN Mc [0041] as the first subnetwork);
adaptively transforming the target speech spectrum by using the second subnetwork (The shared component extractor M 125 and senone predictor or classifier 160 of the adapted acoustic model 20X) are initialized from a DNN-HMM acoustic model [0045]; The Examiner notes senone classifier DNN Mc 160 [0041] as the second subnetwork),
to obtain an intermediate transition representation (The unsupervised adaptation was achieved by learning deep intermediate representations that are both discriminative for the main task (image classification) on the source domain and invariant with respect to the shift between source and target domains [0028]);
performing phoneme recognition based on the intermediate transition representation by using the third subnetwork (A domain classifier M d 165 identifies the proper domains ds and dr at 185 using both shared components f c s 140 and f c t 145 [0034]. The Examiner notes domain classifier DNN Md 165 [0042] as the third subnetwork).
The same motivation to combine independent claim 1 applies here.
Regarding claim 11, claim 11 is similar to claim 1. It is rejected in the same manner and reasoning applying. Further Arora teaches an electronic device, comprising: a processor; and a memory (According to a second aspect of the invention, there is provided a system for automatic speech analysis comprising one or more processors [0033]; the memory may comprise a plurality of components under the control of the processor [0082]; In some embodiments, some of the components of the system 100 may be provided … on the local device 500 [0085]), configured
to store executable instructions of the processor, the processor being configured to, when executing the executable instructions, perform a plurality of operations including (However, in the embodiment being described elements of the system are software-based and executed upon a processor based platform including at least one processor [0080]):
Regarding claim 17, claim 17 is similar to claim 10. It is rejected in the same manner and reasoning applying.
Regarding claim 18, claim 18 is similar to claim 1. It is rejected in the same manner and reasoning applying. Further Arora teaches a non-transitory computer-readable storage medium, storing executable instructions, the executable instructions, when executed by a processor of an electronic device, causing the electronic device to perform a plurality of operations including (there is provided a computer-readable medium containing instructions which when read by a machine cause that machine to perform as at least one of the following: [0040] (i) the method of the first aspect of the invention; [0041] and (ii) the system of the second aspect of the invention [0042]):
Regarding claim 20, claim 20 is similar to claim 10. It is rejected in the same manner and reasoning applying.
4. Claims 2, 3, 12, 13 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Arora et al. (US20210134277 PCT filed 04/17/2018) in view of Lee et al. (US20180190268) in view of Li et al. (US20190147854) and further in view of Chen et al. (US20190139563).
Regarding claim 2, Arora, Lee and Li teaches the neural network training method according to claim 1, Li teaches wherein the extracting a target speech spectrum from the mixed speech spectrum (Operations at 430 are performed to extract private components from each of the source and target speech domain input features [0073])
by using the first subnetwork (The speech frames 110 from the source domain are provided to both source private component extractor M p s 120 and a shared component extractor M c 125. The speech frames 115 from the target domain are provided to the shared component extractor M c 125 and to a target private component extractor M p t 130 [0032]; The Examiner notes extractor 125 DNN Mc [0041] as the first subnetwork) comprises:
The same motivation to combine independent claim 1 applies here.
Arora, Lee and Li does not explicitly teach embedding the mixed speech spectrum into a multi-dimensional vector space, to obtain embedding vectors corresponding to time-frequency windows of the mixed speech spectrum; weighting and regularizing the embedding vectors of the mixed speech spectrum by using an ideal ratio mask (IRM), to obtain an attractor corresponding to the target speech spectrum; obtaining a target masking matrix corresponding to the target speech spectrum by calculating similarities between the embedding vectors of the mixed speech spectrum and the attractor; and extracting the target speech spectrum from the mixed speech spectrum based on the target masking matrix.
Chen teaches embedding the mixed speech spectrum into a multi-dimensional vector space, to obtain embedding vectors corresponding to time-frequency windows of the mixed speech spectrum (In operation, an embedding matrix,
PNG
media_image1.png
28
66
media_image1.png
Greyscale
, (1108) is created by the neural network (1104, 1106) by projecting the time-frequency bins to a high dimensional embedding space as given by: V t f,k=Φ(X t,f) [0079]; Where: T, F, K denotes time, frequency and embedding axis; and Φ(⋅) refers to the neural network transformation [0080]);
weighting and regularizing the embedding vectors of the mixed speech spectrum by using an ideal ratio mask (IRM) (We also included the single and multi-channel ideal-ratio-mask (IRM) system for comparison [0109]),
to obtain an attractor corresponding to the target speech spectrum (During the training phase, the attractors are formed using true or estimated source assignments [0078]);
obtaining a target masking matrix corresponding to the target speech spectrum by calculating similarities between the embedding vectors of the mixed speech spectrum and the attractor (Once the set of attractors have been selected, a mask is formed for each source based on the picked attractor. The masks are given by the equation:
PNG
media_image2.png
38
204
media_image2.png
Greyscale
[0089]); and
extracting the target speech spectrum from the mixed speech spectrum based on the target masking matrix (where a mixture spectrogram was masked by oracle IRMs for each target speaker, and converted to a time domain signal with noisy phase information [0109]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora, Lee and Li to incorporate the teachings of Chen for the benefit of reducing the training process complexity [0103] and runtime cost (Chen [0104])
Regarding claim 3, Arora, Lee and Li teaches the neural network training method according to claim 2, Chen teaches further comprising: obtaining attractors corresponding to the sample data, and calculating a mean value of the attractors, to obtain a global attractor (Based on the soft pre-segmentation, Wp,c,t,f, for each combination P, the attractor Ap,c,k is then calculated as the weighted average of embedding points:
PNG
media_image3.png
54
140
media_image3.png
Greyscale
[0085]).
The same motivation to combine dependent claim 2 applies here.
Regarding claim 12, claim 12 is similar to claim 2. It is rejected in the same manner and reasoning applying.
Regarding claim 13, claim 13 is similar to claim 3. It is rejected in the same manner and reasoning applying.
Regarding claim 19, claim 19 is similar to claim 2. It is rejected in the same manner and reasoning applying.
5. Claims 4-9, 14, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Arora et al. (US20210134277 PCT filed 04/17/2018) in view of Lee et al. (US20180190268) in view of Li et al. (US20190147854) and in further view of Neil et al. (US20180005107)
Regarding claim 4, Arora, Lee and Li teaches the neural network training method according to claim 1, Li teaches wherein the adaptively transforming the target speech spectrum by using the second subnetwork (The shared component extractor M 125 and senone predictor or classifier 160 of the adapted acoustic model 20X) are initialized from a DNN-HMM acoustic model [0045]; The Examiner notes senone classifier DNN Mc 160 [0041] as the second subnetwork) comprises:
The same motivation to combine independent claim 1 applies here.
Arora, Lee and Li does not explicitly teach adaptively transforming target speech spectra of time-frequency windows in sequence according to a sequence of the time-frequency windows of the target speech spectrum, a process of transforming one of the time-frequency windows comprising: generating hidden state information of a current transformation process according to a target speech spectrum of a time-frequency window targeted by the current transformation process and hidden state information of a previous transformation process; and obtaining, based on the hidden state information, an intermediate transition representation of the time-frequency window targeted by the current transformation process.
Neil teaches adaptively transforming target speech spectra of time-frequency windows in sequence according to a sequence of the time-frequency windows of the target speech spectrum (The PLSTM cell may achieve faster convergence than the LSTM cell on tasks that perform learning of long sequences, with an update imposed by an oscillation during a fraction of the oscillation period [0122]),
a process of transforming one of the time-frequency windows comprising: generating hidden state information of a current transformation process according to a target speech spectrum of a time-frequency window targeted by the current transformation process and hidden state information of a previous transformation process (When the time gate operates in a closed phase, a previous state may be maintained. When the time gate is partially open, a balance between the previous state and a proposed update may be formed. When the time gate operates in a fully open phase, the time gate may function as an LSTM cell that does not include a time gate [0118]); and
obtaining, based on the hidden state information, an intermediate transition representation of the time-frequency window targeted by the current transformation process (The first cell state value ct and the hidden state output value ht of the memory cell 200 may be updated while the first time gate 280 and the second time gate 290 operate in the open phase. When the first time gate 280 and the second time gate 290 operate in the open phase, the cell state value may be updated based on the input value of the memory cell 200 [0102]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora, Lee and Li to incorporate the teachings of Neil for the benefit of using a PLSTM model which may be the learned timing parameters that sparse implementations lacked. (Neil [0310])
Regarding claim 5, Arora, Lee, Li and Neil teaches the neural network training method according to claim 4, Neil teaches wherein the generating hidden state information of a current transformation process comprises: calculating candidate state information, an input weight of the candidate state information (The internal memory 140 may store the cell state value ct. The internal memory 140 may generate a candidate group of current cell state values that are to be added to previous cell state values, that is, generate a vector of candidate state values [0069]),
a forget weight of target state information of the previous transformation process, and an output weight of target state information of the current transformation process according to a target speech spectrum of a current time-frequency window and the hidden state information of the previous transformation process (The internal memory 140 may add a product of a previously stored value of a memory (for example, a previous cell state value) and the output value of the forget gate 130, to a product of a newly calculated hidden state value and the output value of the input gate 110 [0069]);
retaining the target state information of the previous transformation process according to the forget weight, to obtain first intermediate state information (For example, when the forget gate 130 has a value of “0,” all previous values of the internal memory 140 may be ignored. When the input gate 110 has a value of “0,” all new input values may be ignored [0069]);
retaining the candidate state information according to the input weight of the candidate state information, to obtain second intermediate state information (The first sigmoid unit 120 may be represented by y=s(Σwixi). In y=s(Σwixi), s denotes a squashing function, for example, a logistic function, xi denotes an input value, and wi denotes a weight for the input value [0064]);
obtaining the target state information of the current transformation process according to the first intermediate state information and the second intermediate state information (The input gate 110 may determine a degree to which an input vector value is used to calculate a new hidden state value [0061]); and
retaining the target state information of the current transformation process according to the output weight of the target state information of the current transformation process, to obtain the hidden state information of the current transformation process (The output gate 160 may receive the cell state value ct from the internal memory 140 and may determine a degree to which the cell state value ct is to be output, that is, a degree to which a current cell state value is to be output from the LSTM cell 100 [0074]).
The same motivation to combine dependent 4 applies claim here.
Regarding claim 6, Arora, Lee, Li and Neil teaches the neural network training method according to claim 4, Neil teaches wherein the obtaining, based on the hidden state information (The output value ht may be referred to as a “hidden state output vector” or a “hidden output vector.” [0075]),
an intermediate transition representation of the time- frequency window targeted by the current transformation process comprises (may perform intermediate neuron activations in response to inputs as well as a frequency decomposition of the inputs [0321]):
performing one or more of the following processing on the hidden state information (The input gate 110 may determine a degree to which an input vector value is used to calculate a new hidden state value [0061]),
to obtain the intermediate transition representation of the time-frequency window targeted by the current transformation process (may perform intermediate neuron activations in response to inputs as well as a frequency decomposition of the inputs [0321]):
non-negative mapping, element-wise logarithm finding, calculation of a first-order difference, calculation of a second-order difference, global mean variance normalization, and addition of features of previous and next time-frequency windows (In particular, the above hardware implementation may replicate a tonotopy that emerges from spatial filtering of a basilar membrane in a cochlea through a 64-stage cascaded second-order filter bank, spanning 100 Hz to 20 kHz on a log frequency scale [0305]).
The same motivation to combine dependent 4 applies claim here.
Regarding claim 7, Arora, Lee and Li teaches the neural network training method according to claim 1, Li teaches wherein the performing phoneme recognition based on the intermediate transition representation by using the third subnetwork comprises: applying a multi-dimensional filter to the intermediate transition representation by using at least one convolutional layer (A 29-dimensional log Mel filterbank features together with 1st and 2nd order delta features (totally 87-dimensional) for both the clean and noisy utterances may be extracted by an HTK Toolkit [0062]),
to generate an output of the convolutional layer (To generate the simulated data, the clean speech is first convoluted with the estimated impulse response of the environment and then mixed with the background noise separately recorded in that environment [0059]);
and providing the output of the recursive layer to at least one fully connected layer, applying a nonlinear function to an output of the fully connected layer (The output layer of the Mr has 957 output units with no non-linear activation functions to reconstruct the spliced input features [0065]),
to obtain a posterior probability of a phoneme comprised in the intermediate transition representation (Each output unit of the DNN adapted acoustic model 200 corresponds to one of the senones q in a set Q. The output unit for senone q E Q is the posterior probability p(q|xn s) obtained by a softmax function [0046]).
Arora, Lee and Li does not explicitly teach using the output of the convolutional layer in at least one recursive layer, to generate an output of the recursive layer;
Neil teaches using the output of the convolutional layer in at least one recursive layer, to generate an output of the recursive layer (The CNN may include three alternating layers of 8 kernels of 5×5 convolution with a leaky ReLU nonlinearity and 2×2 max-pooling. The three alternating layers may be fully connected to 256 neurons, and finally fully connected to 10 output classes [0197]; A video stream may use three alternating layers including 16 kernels of 5×5 convolution and 2×2 subsampling to reduce an input of 1×48×48 to 16×2×2, which may be used as an input to 110 recurrent units [0210]);
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora, Lee and Li to incorporate the teachings of Neil for the benefit of using a PLSTM model which may be the learned timing parameters that sparse implementations lacked. (Neil [0310])
Regarding claim 8, Arora, Lee and Li teaches the neural network training method according to claim 7, Neil teaches wherein the recursive layer comprises a long short-term memory (LSTM) network (FIG. 1 is a diagram illustrating an architecture of a standard long short-term memory (LSTM) cell 100. In a recurrent neural network (RNN), an LSTM cell may retain inputs in a memory for a very long period of time in comparison to other memory elements [0058]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora, Lee and Li to incorporate the teachings of Neil for the benefit of using a PLSTM model which may be the learned timing parameters that sparse implementations lacked. (Neil [0310])
Regarding claim 9, Arora, Lee and Li teaches the neural network training method according to claim 1, Neil teaches wherein the first subnetwork comprises a plurality of layers of LSTM networks of a peephole connection (Optional peephole connection weights wci, wcf, and wco may have a further influence on an operation of the input gate 110, the forget gate 130 and the output gate 160 [0082]; Time gates included in different layers may be controlled by different oscillation frequencies or the same oscillation frequency [0137]), and
the second subnetwork comprises a plurality of layers of LSTM networks of a peephole connection (Also, FIG. 6D shows results obtained by training … a batch-normalized (BN)-LSTM cell and a standard LSTM cell based on the sampled inputs [0151]. The Examiner notes (BN)-LSTM as the first subnetwork and standard LSTM cell as the second subnetwork).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Arora, Lee and Li to incorporate the teachings of Neil for the benefit of using a PLSTM model which may be the learned timing parameters that sparse implementations lacked. (Neil [0310])
Regarding claim 14, claim 14 is similar to claim 4. It is rejected in the same manner and reasoning applying.
Regarding claim 15, claim 15 is similar to claim 7. It is rejected in the same manner and reasoning applying.
Regarding claim 16, claim 16 is similar to claim 9. It is rejected in the same manner and reasoning applying.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.G./Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/ Supervisory Patent Examiner, Art Unit 2148