Prosecution Insights
Last updated: April 19, 2026
Application No. 17/951,298

Sound Signal Generation Method, Estimation Model Training Method, and Sound Signal Generation System

Non-Final OA §102§103
Filed
Sep 23, 2022
Examiner
SCOLES, PHILIP GRANT
Art Unit
2837
Tech Center
2800 — Semiconductors & Electrical Systems
Assignee
Yamaha Corporation
OA Round
1 (Non-Final)
56%
Grant Probability
Moderate
1-2
OA Rounds
3y 10m
To Grant
77%
With Interview

Examiner Intelligence

Grants 56% of resolved cases
56%
Career Allow Rate
30 granted / 54 resolved
-12.4% vs TC avg
Strong +21% interview lift
Without
With
+21.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 10m
Avg Prosecution
36 currently pending
Career history
90
Total Applications
across all art units

Statute-Specific Performance

§101
1.6%
-38.4% vs TC avg
§103
53.3%
+13.3% vs TC avg
§102
22.0%
-18.0% vs TC avg
§112
20.2%
-19.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 54 resolved cases

Office Action

§102 §103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Priority Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55. Information Disclosure Statement The information disclosure statements (IDSs) submitted on 8/11/2022, 12/12/2024, and 7/3/2025 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner. Claim Rejections - 35 USC § 102 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action: A person shall be entitled to a patent unless –(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention. Claims 6-7 are rejected under 35 U.S.C. 102 as anticipated by Jeong et al. ("Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance," 2019, retrieved November 24, 2025 from https://proceedings.mlr.press/v97/jeong19a/jeong19a.pdf), hereinafter Jeong. Regarding claim 6, Jeong discloses a computer-implemented estimation model (Jeong abstract: "we design the model using note-level gated graph neural network") training method comprising: obtaining a plurality of training data, each including condition data and a corresponding shortening rate (Jeong § 6.1: "Since there was no available public data set for our task, we created our own data set by collecting pairs of score in MusicXML and corresponding performances in MIDI. We collected the score files from MuseScore1 and Musicalion2 , and the performance MIDI files from Yamaha ecompetition data3 . The performance data consists of human piano performances of classical music from the Baroque to contemporary music, which were recorded by a computer-controlled piano (Yamaha Disklavier) during the competitions. This dataset has been widely used in music generation tasks (Huang et al., 2019; Hawthorne et al., 2019). To make note-level score and performance data pairs, XML-to-MIDI matching is required. We utilized an automatic alignment algorithm (Nakamura et al., 2017) to synchronize the score XML to performance MIDI in note-level."), wherein: the condition data represents a sounding condition specified for a specific note by score data representing: (i) respective durations of a plurality of notes (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."), and (ii) a shortening indication for shortening a duration of the specific note, which is one of the plurality of notes (Jeong § 3.1: "We define six edge types in musical score: next, rest, onset, sustain, voice, and slur as shown in Figure 2. A next edge connects a note to its following note, i.e., the following note that begins exactly when the note ends. A rest edge links a note with the rest following to other notes that begin when the rest ends. If there are consecutive rests, they are combined as a single rest. An onset edge is to connect notes that begin together, i.e., on the same onset. Notes that appear between a note start and its end are connected by sustain edges. voice edges are a subset of next edges which connect notes in the same voice only. Among voice edges, we add slur edges between notes under the same slur."), and the shortening rate represents an amount of shortening of the duration of the specific note (Jeong § 4.1: "The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."); and training an estimation model to learn a relationship between the condition data and the shortening rate by machine learning using the plurality of training data (Jeong § 6.3: "We trained all models using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.0003, weight decay of 1e5, and dropout of 0.1. For the training batch, the notes were sliced at the end of measure that include the 400-th note. We tried both classification and regression for the output, but the output quality was better in regression. Therefore, we used mean square error for the loss, and the output features were all normalized with µ = 0, σ = 1. The loss was calculated note-wise except the tempo loss, which was calculated in beat-wise."). Regarding claim 7, Jeong discloses a computer-implemented estimation model training method comprising the features of claim 6 as discussed above. Jeong further discloses that the sounding condition represented by the condition data includes a pitch and a duration of the specific note (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling.") and information about at least one of a note before the specific note or a note after the specific note (Jeong § 3.1: "We define six edge types in musical score: next, rest, onset, sustain, voice, and slur as shown in Figure 2. A next edge connects a note to its following note, i.e., the following note that begins exactly when the note ends. A rest edge links a note with the rest following to other notes that begin when the rest ends. If there are consecutive rests, they are combined as a single rest. An onset edge is to connect notes that begin together, i.e., on the same onset. Notes that appear between a note start and its end are connected by sustain edges. voice edges are a subset of next edges which connect notes in the same voice only. Among voice edges, we add slur edges between notes under the same slur."). Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claims 1-5 and 8-12 are rejected under 35 U.S.C. 103 as unpatentable over Jeong in view of Oura et al. ("Recent Development of the HMM-based Singing Voice Synthesis System — Sinsy," 2010, retrieved November 24, 2025 from https://www.isca-archive.org/ssw_2010/oura10_ssw.pdf), hereinafter Oura. Regarding claim 1, Jeong teaches generating a shortening rate representative of an amount of shortening of the duration of the specific note (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."), by inputting, to a first estimation model, condition data representative of a sounding condition specified by the score data for the specific note (Jeong § 4.2: "Our system consists of three modules: a score encoder Es, a performance encoder, Ep and a performance decoder Dp. For a given score input X, the score module Es infers a score condition C. The score module employs ISGN with two GGNN layers and a single layer of measure-level RNN with a long short term memory (LSTM). The initial input X passes through the three-layer dense network. The first layer of GGNN updates note-level features only, and measure level features remain fixed. The second layer of GGNN updates the whole hidden state. The outputs of both layers are concatenated to compose C as a skip connection. For the encoded score condition C and its corresponding performance features Y, the performance encoder module encodes the performance style, Y for given C, as a latent vector z. The input of the style encoder is the score condition C concatenated with its corresponding performance features Y. The dimension of concatenated data is reduced with a dense layer and used for input of the performance encoder. The performance encoder consists of a GGNN and two-layers measure-level LSTM. The higher-level of performance encoder employs HAN instead of ISGN since it focuses on the summarization of total sequence rather than characteristics of individual notes. During the inference, the encoder can be bypassed by sampling z from the normal distribution or using a pre-encoded z."); generating a series of control data, each representing a control condition of the sound signal corresponding to the score data (Jeong § 4.2: "The last part of the system is the performance decoder D : C, z, Yˆ 0 7→ Yˆ, which generates the performance features Yˆ. We employ the ISGN structure in the decoder module as well. It infers performance parameters for each note in the input score by the iterative method. We employed the concept of hierarchical decoding of latent vector in VAE introduced in (Roberts et al., 2018). Instead of directly using the encoded z, we decode z into measure-level by concatenating z ×Lm with measure-level representation Hm in the encoded score C and feeding it to another LSTM module. The LSTM module returns measure-level performance style vector zm ∈ R Lm×De, where Lm and De represent number of measures and dimension of encoded vector z.), the series of control data reflecting a shortened duration of the specific note shortened in accordance with the generated shortening rate (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output… The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."). Jeong does not explicitly disclose generating the sound signal in accordance with the series of control data. However, Oura teaches generating the sound signal in accordance with the series of control data (Oura § 2.1: "Second, according to the label sequence, an HMM corresponding to the song is constructed by concatenating the context-dependent HMMs. Third, the state durations of the song HMM are determined with respect to the state duration models. Fourth, the spectrum and excitation parameters are generated by the speech parameter generation algorithm [14]. Finally, a singing voice is synthesized directly from the generated spectrum and excitation parameters by using a Mel Log Spectrum Approximation (MLSA) filter [15]."). It would have been prima facie obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the computer-implemented sound signal generation method of Jeong by adding the sound signal generation of Oura to obtain natural singing voices (Oura § 6). Regarding claim 2, Jeong (in view of Oura) teaches a computer-implemented sound signal generation method comprising the features of claim 1 as discussed above. Jeong further teaches that the first estimation model is a machine learning model (Jeong § 4.2: "Our system consists of three modules: a score encoder Es, a performance encoder, Ep and a performance decoder Dp… The performance encoder consists of a GGNN and two-layers measure-level LSTM.") that learns a relationship between a sounding condition specified for a specific note in a piece of music and a shortening rate of the specific note (Jeong § 4.1: "The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level."). Regarding claim 3, Jeong (in view of Oura) teaches a computer-implemented sound signal generation method comprising the features of claim 2 as discussed above. Jeong further teaches that the sounding condition represented by the condition data includes a pitch and a duration of the specific note (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level.") and information about at least one of a note before the specific note or a note after the specific note (Jeong § 3.1: "We define six edge types in musical score: next, rest, onset, sustain, voice, and slur as shown in Figure 2. A next edge connects a note to its following note, i.e., the following note that begins exactly when the note ends."). Regarding claim 4, Jeong (in view of Oura) teaches a computer-implemented sound signal generation method comprising the features of claim 1 as discussed above. Oura further teaches the sound signal is generated by inputting the series of control data into a second estimation model separate from the first estimation model (Oura § 2.1: "An arbitrarily given musical score including the lyrics to be synthesized is first converted in the synthesis part to a context-dependent label sequence. Second, according to the label sequence, an HMM corresponding to the song is constructed by concatenating the context-dependent HMMs. Third, the state durations of the song HMM are determined with respect to the state duration models. Fourth, the spectrum and excitation parameters are generated by the speech parameter generation algorithm [14]. Finally, a singing voice is synthesized directly from the generated spectrum and excitation parameters by using a Mel Log Spectrum Approximation (MLSA) filter [15]."). Regarding claim 5, Jeong (in view of Oura) teaches a computer-implemented sound signal generation method comprising the features of claim 1 as discussed above. Jeong further teaches that the generating of the series of control data includes: generating intermediate data in which the duration of the specific note has been shortened by the shortening rate (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."); and generating the series of control data that corresponds to the intermediate data (Jeong § 4.2: "The last part of the system is the performance decoder D : C, z, Yˆ 0 7→ Yˆ, which generates the performance features Yˆ. We employ the ISGN structure in the decoder module as well. It infers performance parameters for each note in the input score by the iterative method.). Regarding claim 8, Jeong teaches one or more memories for storing instructions (Jeong abstract: "we design the model using note-level gated graph neural network"); and one or more processors communicatively connected to the one or more memories (Jeong abstract: "we design the model using note-level gated graph neural network") and that execute instructions to: generate a shortening rate representative of an amount of shortening of the duration of the specific note (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."), by inputting, to a first estimation model, condition data representative of a sounding condition specified by the score data for the specific note (Jeong § 4.2: "Our system consists of three modules: a score encoder Es, a performance encoder, Ep and a performance decoder Dp. For a given score input X, the score module Es infers a score condition C. The score module employs ISGN with two GGNN layers and a single layer of measure-level RNN with a long short term memory (LSTM). The initial input X passes through the three-layer dense network. The first layer of GGNN updates note-level features only, and measure level features remain fixed. The second layer of GGNN updates the whole hidden state. The outputs of both layers are concatenated to compose C as a skip connection. For the encoded score condition C and its corresponding performance features Y, the performance encoder module encodes the performance style, Y for given C, as a latent vector z. The input of the style encoder is the score condition C concatenated with its corresponding performance features Y. The dimension of concatenated data is reduced with a dense layer and used for input of the performance encoder. The performance encoder consists of a GGNN and two-layers measure-level LSTM. The higher-level of performance encoder employs HAN instead of ISGN since it focuses on the summarization of total sequence rather than characteristics of individual notes. During the inference, the encoder can be bypassed by sampling z from the normal distribution or using a pre-encoded z."); generate a series of control data, each representing a control condition of the sound signal corresponding to the score data (Jeong § 4.2: "The last part of the system is the performance decoder D : C, z, Yˆ 0 7→ Yˆ, which generates the performance features Yˆ. We employ the ISGN structure in the decoder module as well. It infers performance parameters for each note in the input score by the iterative method. We employed the concept of hierarchical decoding of latent vector in VAE introduced in (Roberts et al., 2018). Instead of directly using the encoded z, we decode z into measure-level by concatenating z ×Lm with measure-level representation Hm in the encoded score C and feeding it to another LSTM module. The LSTM module returns measure-level performance style vector zm ∈ R Lm×De, where Lm and De represent number of measures and dimension of encoded vector z.), the series of control data reflecting a shortened duration of the specific note shortened in accordance with the generated shortening rate (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output… The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."). Jeong does not explicitly disclose generating the sound signal in accordance with the series of control data. However, Oura teaches generating the sound signal in accordance with the series of control data (Oura § 2.1: "Second, according to the label sequence, an HMM corresponding to the song is constructed by concatenating the context-dependent HMMs. Third, the state durations of the song HMM are determined with respect to the state duration models. Fourth, the spectrum and excitation parameters are generated by the speech parameter generation algorithm [14]. Finally, a singing voice is synthesized directly from the generated spectrum and excitation parameters by using a Mel Log Spectrum Approximation (MLSA) filter [15]."). It would have been prima facie obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to have modified the sound signal generation system of Jeong by adding the sound signal generation of Oura to obtain natural singing voices (Oura § 6). Regarding claim 9, Jeong (in view of Oura) teaches a sounds signal generation system comprising the features of claim 8 as discussed above. Jeong further teaches that the first estimation model is a machine learning model (Jeong § 4.2: "Our system consists of three modules: a score encoder Es, a performance encoder, Ep and a performance decoder Dp… The performance encoder consists of a GGNN and two-layers measure-level LSTM.") that learns a relationship between a sounding condition specified for a specific note in a piece of music and a shortening rate of the specific note (Jeong § 4.1: "The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level."). Regarding claim 10, Jeong (in view of Oura) teaches a sounds signal generation system comprising the features of claim 9 as discussed above. Jeong further teaches that the sounding condition represented by the condition data includes a pitch and a duration of the specific note (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level.") and information about at least one of a note before the specific note or a note after the specific note (Jeong § 3.1: "We define six edge types in musical score: next, rest, onset, sustain, voice, and slur as shown in Figure 2. A next edge connects a note to its following note, i.e., the following note that begins exactly when the note ends."). Regarding claim 11, Jeong (in view of Oura) teaches a sounds signal generation system comprising the features of claim 8 as discussed above. Oura further teaches the sound signal is generated by inputting the series of control data into a second estimation model separate from the first estimation model (Oura § 2.1: "An arbitrarily given musical score including the lyrics to be synthesized is first converted in the synthesis part to a context-dependent label sequence. Second, according to the label sequence, an HMM corresponding to the song is constructed by concatenating the context-dependent HMMs. Third, the state durations of the song HMM are determined with respect to the state duration models. Fourth, the spectrum and excitation parameters are generated by the speech parameter generation algorithm [14]. Finally, a singing voice is synthesized directly from the generated spectrum and excitation parameters by using a Mel Log Spectrum Approximation (MLSA) filter [15]."). Regarding claim 12, Jeong (in view of Oura) teaches a sounds signal generation system comprising the features of claim 8 as discussed above. Jeong further teaches that in the generation of the series of control data, the one or more processors execute the instructions to: generate intermediate data in which the duration of the specific note has been shortened by the shortening rate (Jeong § 4.1: "Our model exploits pre-defined score and performance features for input and output. The feature extraction scheme is detailed in (Jeong et al., 2019). The input features include various type of musical information such as pitch and duration of note, tempo and dynamic markings. The input features are all embedded in note-level. The output features consist of tempo, MIDI velocity (loudness), onset deviation (micro-timing of each note), articulation (duration ratio), and seven features to handle piano pedaling."); and generate the series of control data that corresponds to the intermediate data (Jeong § 4.2: "The last part of the system is the performance decoder D : C, z, Yˆ 0 7→ Yˆ, which generates the performance features Yˆ. We employ the ISGN structure in the decoder module as well. It infers performance parameters for each note in the input score by the iterative method.). Conclusion Any inquiry concerning this communication or earlier communications from the examiner should be directed to PHILIP SCOLES whose telephone number is (703)756-1831. The examiner can normally be reached Monday-Friday 8:30-4:30 ET. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Dedei Hammond can be reached on 571-270-7938. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /PHILIP G SCOLES/ Examiner, Art Unit 2837 /JEFFREY DONELS/Primary Examiner, Art Unit 2837
Read full office action

Prosecution Timeline

Sep 23, 2022
Application Filed
Nov 25, 2025
Non-Final Rejection — §102, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12603073
ELECTRONIC PERCUSSION INSTRUMENT, CONTROL DEVICE FOR ELECTRONIC PERCUSSION INSTRUMENT, AND CONTROL METHOD THEREFOR
2y 5m to grant Granted Apr 14, 2026
Patent 12597405
AUTO-RECORDING FOR MUSICAL INSTRUMENT
2y 5m to grant Granted Apr 07, 2026
Patent 12597406
ELECTRONIC CYMBAL AND STRIKING DETECTION METHOD
2y 5m to grant Granted Apr 07, 2026
Patent 12586552
MULTI-LEVEL AUDIO SEGMENTATION USING DEEP EMBEDDINGS
2y 5m to grant Granted Mar 24, 2026
Patent 12579962
DEVICE AND ELECTRONIC MUSICAL INSTRUMENT
2y 5m to grant Granted Mar 17, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

1-2
Expected OA Rounds
56%
Grant Probability
77%
With Interview (+21.3%)
3y 10m
Median Time to Grant
Low
PTA Risk
Based on 54 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month