DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.
The claims are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. The claims are directed to the abstract idea of labeling phonetic, as explained in detail below.
The limitations, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “various elements” nothing in the claim element precludes the steps from practically being performed by mental processing. For example, the language, generating first label information by labeling time information in a forward direction according to a plurality of phoneme boundaries set in speech information for learning (can be done by a user using timestamps associated with the phonemes presented in a forward direction and labeling accordingly), generating second label information by labeling time information in a direction opposite to the forward direction according to the plurality of phoneme boundaries set in the speech information for learning (can be done by a user using timestamps associated with the phonemes presented in a backwards direction and labeling accordingly), inverting an order of the time information that has been labeled (can be done by the user reversing the data) and learning a model that detects whether the phoneme boundaries are appropriate based on a difference between time information of a plurality of phoneme boundaries included in the first label information and time information of a plurality of phoneme boundaries included in the second label information (can be done by a user making a determination regarding the boundaries based on time and creating a model). The present claim language under its broadest reasonable interpretation, covers performance of mental processing and recites generic computer components, which all falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements which are recited at a high-level of generality (i.e., as a generic processor performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
The dependent claims recite similar language, such as, generating different labels, making determinations, calculating data and determining a probability, which is all non-statutory and mental processing.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.
Claim(s) 1-12 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Graves et al. (Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition), hereinafter referenced as Graves.
Regarding claims 1, 5-6, Graves discloses labeling processing method, device and medium, hereinafter referenced as a method comprising a processor configured to execute operations comprising:
a forward labeling step comprising: generating first label (classifying) information by labeling time information in a forward direction (bi-directional/forward and backward) according to a plurality of phoneme boundaries (phoneme boundary) set in speech information for learning (Fig. 1 with p. 799; Introduction- An elegant solution to the first problem is provided by bidirectional networks [11,1]. In this model, the input is presented forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. For the second problem, an alternative RNN architecture, LSTM, has been shown to be capable of learning long time-dependencies (see Section 2). In this paper, we extend our previous work on bidirectional LSTM (BLSTM) [7] with experiments on both framewise phoneme classification and phoneme recognition. For phoneme recognition we use the hybrid approach, combining Hidden Markov Models (HMMs) and RNNs in an iterative training procedure (see Section 3). This gives us an insight into the likely impact of bidirectional training on speech recognition, and also allows us to compare our results directly with a traditional HMM system.);
a backward labeling step comprising: generating second label (classifying) information by labeling time information in a direction opposite to the forward direction (bi-directional/forward and backward) according to the plurality of phoneme boundaries (phoneme boundary) set in the speech information for learning (fig. 1 with p. 799; Introduction- An elegant solution to the first problem is provided by bidirectional networks [11,1]. In this model, the input is presented forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. For the second problem, an alternative RNN architecture, LSTM, has been shown to be capable of learning long time-dependencies (see Section 2). In this paper, we extend our previous work on bidirectional LSTM (BLSTM) [7] with experiments on both framewise phoneme classification and phoneme recognition. For phoneme recognition we use the hybrid approach, combining Hidden Markov Models (HMMs) and RNNs in an iterative training procedure (see Section 3). This gives us an insight into the likely impact of bidirectional training on speech recognition, and also allows us to compare our results directly with a traditional HMM system.); and
inverting an order of the time information that has been labeled (fig. 1, section 4.1; A bidirectional LSTM net classifying the utterance ”one oh five” from the Numbers95 corpus. The different lines represent the activations (or targets) of different output nodes. The bidirectional output combines the predictions of the forward and reverse subnets; it closely matches the target, indicating accurate classification. To see how the subnets work together, their contributions to the output are plotted separately (“Forward Net Only” and “Reverse Net Only”). As we would expect, the forward net is more accurate. However there are places where its substitutions (‘w’), insertions (at the start of ‘ow’) and deletions (‘f’) are corrected by the reverse net. In addition, both are needed to accurately locate phoneme boundaries, with the reverse net tending to find the starts and the forward net tending to find the ends (‘ay’ is a good example of this)); and
a learning step comprising: learning a model that detects whether the phoneme boundaries are appropriate based on a difference between time information of a plurality of phoneme boundaries included in the first label information and time information of a plurality of phoneme boundaries included in the second label information (LSTM that handles time dependencies; p. 799-0800, fig. 1, section 4.1).
Regarding claims 2, 7 and 10, Graves discloses the method comprising:
wherein the forward labeling step further comprises:
generating third label information by labeling time information in the forward (time-dependency) direction according to a plurality of phoneme boundaries set in speech information as a detection target, and the backward labeling step further (bidirectional phoneme classification; Introduction with fig. 1) comprises:
generating fourth label information by labeling time information in the direction opposite according to the plurality of phoneme boundaries set in the speech information as the detection target, and inverting an order of the time information that has been labeled (bidirectional phoneme classification; Introduction with fig. 1), the labeling processing method further comprising:
a detection step comprising:
when a difference between time information of a plurality of phoneme boundaries in the fourth label information and the plurality of phoneme boundaries set in the speech information as the detection target are input to the model, detecting whether the phoneme boundaries set in the speech information as the detection target are appropriate based on an output result (Introduction with fig. 1 with section; 4.1; Our first experimental task was the classification of frames of speech data into phonemes. The targets were the hand labelled transcriptions provided with the data, and the recorded scores were the percentage of frames in the training and test sets for which the output classification coincided with the target. We evaluated the following architectures on this task: bidirectional LSTM (BLSTM), unidirectional LSTM (LSTM), bidirectional standard RNN (BRNN), and unidirectional RNN (RNN). For some of the unidirectional nets a delay of 4 timesteps was introduced between the target and the current input — i.e. the net always tried to predict the phoneme of 4 timesteps ago. For BLSTM we also experimented with duration weighted error, where the error injected on each frame is scaled by the duration of the current phoneme. We used standard RNN topologies for all experiments, with one recurrently connected hidden layer and no direct connections between the input and output layers. The LSTM (BLSTM) hidden layers contained 140 (93) blocks of one cell in each, and the RNN (BRNN) hidden layers contained 275 (185) units. This gave approximately 100,000 weights for each network).
Regarding claims 3, 8 and 11, Graves discloses the method wherein the learning step further comprises, when the difference is smaller than a threshold,
determining that the plurality of phoneme boundaries set in the speech information for learning are appropriate, the labeling processing method further comprises:
a calculation step comprising calculating, based on a determination result of the learning step, a prior probability (section 4.2; Initial estimation of transition and prior probabilities was done using the correct transcription for the training set. Network output probabilities were divided by prior probabilities to obtain likelihoods for the HMM. The system was trained until no improvement was observed or the segmentation of the signal did not change. Due to time limitations, the networks were not re-trained to convergence. Since the output of both HMM-b), wherein the prior probability indicates:
a probability that the phoneme boundary is determined to be appropriate, and a probability that the phoneme boundary is determined to be not appropriate, and the detection step further comprises adjusting the output result based on the prior probability (Introduction with fig. 1 with section; 4.1; Our first experimental task was the classification of frames of speech data into phonemes. The targets were the hand labelled transcriptions provided with the data, and the recorded scores were the percentage of frames in the training and test sets for which the output classification coincided with the target. We evaluated the following architectures on this task: bidirectional LSTM (BLSTM), unidirectional LSTM (LSTM), bidirectional standard RNN (BRNN), and unidirectional RNN (RNN). For some of the unidirectional nets a delay of 4 timesteps was introduced between the target and the current input — i.e. the net always tried to predict the phoneme of 4 timesteps ago. For BLSTM we also experimented with duration weighted error, where the error injected on each frame is scaled by the duration of the current phoneme. We used standard RNN topologies for all experiments, with one recurrently connected hidden layer and no direct connections between the input and output layers. The LSTM (BLSTM) hidden layers contained 140 (93) blocks of one cell in each, and the RNN (BRNN) hidden layers contained 275 (185) units. This gave approximately 100,000 weights for each network).
Regarding claims 4, 9 and 12, Graves discloses the method wherein, in the learning step, the model is learned by further using the prior probability (section 4.2; Initial estimation of transition and prior probabilities was done using the correct transcription for the training set. Network output probabilities were divided by prior probabilities to obtain likelihoods for the HMM. The system was trained until no improvement was observed or the segmentation of the signal did not change. Due to time limitations, the networks were not re-trained to convergence. Since the output of both HMM-b).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. This information has been detailed in the PTO 892 attached (Notice of References Cited).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAKIEDA R JACKSON whose telephone number is (571)272-7619. The examiner can normally be reached Mon - Fri 6:30a-2:30p.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571.272.5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JAKIEDA R JACKSON/Primary Examiner, Art Unit 2657