DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Claims 1 and 11-12 are amended. Claims 2-10 are cancelled. Claims 13-27 are newly added. As such, claims 1 and 11-27 are presented for examination.
Response to Arguments
Rejection under 35 U.S.C. 101
Applicant’s arguments have been fully considered and are persuasive. The amended independent claims recite altering first time-series data based on user instructions, using a estimation model to process first input data containing a pronunciation style, target sound, and sound characteristic to generate the first time-series data, and generating second time-series data with a specified pronunciation style different from a first pronunciation style in the first time-series according to a user’s instructions. Thus, the claims do not recite a mental process.
Rejection under 35 U.S.C. 102/103
Applicant's arguments have been fully considered but they are not persuasive. Applicant argues, “Tachibana is not shown to describe changing of a position of an end point of each of the plurality of notes (phonemes) on the time axis by the user's instruction.” However, Tachibana teaches a note containing arranged phonemes whose end point position is set according to a pronunciation period. Specifically, paragraph [0031] of Tachibana states, “The musical note figure N is an essentially rectangular figure (so-called note bar) in which phonemes are arranged. The position of the musical note figure N in the pitch axis direction is set in accordance with the pitch designated by the musical score data X2. The end points of the musical note figure N in the time axis direction are set in accordance with the pronunciation period designated by the musical score data X2.” Tachibana further teaches that a user can edit the range containing the notes and the pronunciation style of the specific range, changing the end points of the notes. In particular, paragraph [0034] of Tachibana states, “the user can instruct the editing (for example, adding, changing, or deleting) of a note inside the specific range R. The note processing module 23 changes the musical score data X2 in accordance with the user's instruction. In addition, the display control module 21 causes the display device 13 to display a musical note figure N corresponding to each note designated by the musical score data X2 after the change.” Therefore, Tachibana discloses changing of a position of an end point of each of the plurality of notes on the time axis by the user’s instruction.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 11-15, 17-19, 21-23, and 25-27 are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana et al. (US 20210097975 A1; hereinafter referred to as Tachibana) in view of Blaauw (US 20210256960 A1; hereinafter referred to as Blaauw).
Regarding claim 1, Tachibana teaches: a non-transitory computer-readable recording medium storing computer-executable instructions ([0083] The program as exemplified above can be stored on a computer-readable storage medium and installed in a computer. The storage medium, for example, is a non-transitory storage medium) that, when executed by a computer system, causes the computer system to execute operations, the operations comprising: altering a first portion of first time-series data in accordance with an instruction from a user ([0022] the user can instruct the transition of acoustic characteristics, for example, the pitch of the synthesized voice (hereinafter referred to as “first characteristic transition”)), the first time-series data indicating a time series of a sound characteristic corresponding to a first pronunciation style of a target sound to be synthesized… ([0025] The musical score data X2 is a music file specifying a time series of a plurality of notes constituting the synthesized musical piece. The musical score data X2 specify a pitch, a phoneme (pronunciation character), and a pronunciation period for each of a plurality of notes constituting the synthesized musical piece),
the target sound is a voice including a plurality of sound units on a time axis ([0031] The musical score area C is a coordinate plane (piano roll screen) in which a horizontal time axis and a vertical pitch axis are set, and in which a time series of a plurality of notes of the synthesized musical piece is displayed), the sound characteristic includes positions of respective end points of the plurality of sound units, and the first portion of the first time-series data is an end point ([0031] The end points of the musical note figure N in the time axis direction are set in accordance with the pronunciation period designated by the musical score data X2), whose position has been changed by the instruction from the user, of a plurality of end points specified by the first time-series data ([0034] The note processing module 23 arranges notes within the specific range R for which the pronunciation style Q has been set, in accordance with the user's instruction. By appropriately operating the input device 14, the user can instruct the editing (for example, adding, changing, or deleting) of a note inside the specific range R. The note processing module 23 changes the musical score data X2 in accordance with the user's instruction);
and generating second time-series data when a second pronunciation style different from the first pronunciation style is specified for the target sound ([0046] FIG. 8, on the other hand, illustrates the combined characteristic transition V in a case in which the pronunciation style Q of the specific range R has been changed from the pronunciation style Q1 to a pronunciation style Q2 (an example of a second pronunciation style)), the second time-series data indicating a sound characteristic with the alteration made to the first portion in accordance with the instruction from the user ([0033] the user can instruct an addition or change of the specific range R, and the pronunciation style Q of the specific range R. The range setting module 22 adds or changes the specific range R and sets the pronunciation style Q of the specific range R in accordance with the user's instruction, and changes the range data X1 in accordance with said setting), and indicating a sound characteristic with a second portion different from the first portion corresponding to the second pronunciation style ([0080] the pronunciation style Q is set for each of the specific range R, but the range for which the pronunciation style Q is set in the synthesized musical piece is not limited to the specific range R. For example, one pronunciation style Q can be set over the entire synthesized musical piece, or the pronunciation style Q can be set for each note).
Tachibana does not explicitly, but Blaauw discloses: wherein first input data is processed using a first estimation model to generate the first time-series data ([0007] generate, using the synthesis model, feature data representative of acoustic features of a target sound of the sound source to be generated in the performance style and according to the sounding conditions; and generate an audio signal corresponding to the target sound using the generated feature data), the first input data includes first style data indicating the first pronunciation style… ([0006] includes inputting a first piece of sound source data representative of a first sound source, a first piece of style data representative of a first performance style, and first synthesis data representative of first sounding conditions into a synthesis model).
Tachibana and Blaauw are considered analogous in the field of audio synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tachibana to combine the teachings of Blaauw because doing so would allow greater flexibility and improvement in synthesizing different types of audio using instructions from a user by using a statistical prediction model for voice synthesis (Blaauw [0034] the feature data Q are generated by inputting a piece of singer data Xa, a piece of style data Xb, and the synthesis data Xc of the tune into the synthesis model M. This allows the target sound to be generated without voice units. In addition to a piece of singer data Xa and synthesis data Xc, a piece of style data Xb is input to the synthesis model M).
Regarding claim 11, it recites similar limitations as claim 1 and therefore is rejected similarly.
Regarding claim 12, Tachibana teaches: a sound processing system comprising: a sound processing circuit configured to ([0029] the functions of the electronic controller 11 can be realized by a plurality of devices configured separately from each other, or, some or all of the functions of the electronic controller 11 can be realized by a dedicated electronic circuit). The rest of the claim recites similar limitations as claim 1 and therefore is rejected similarly.
Regarding claim 13, the combination of Tachibana and Blaauw teaches: the method according to claim 11. Tachibana further teaches: the first input data is processed using the first estimation model to generate the second portion of the second time-series data ([0028] recursive probability model that estimates the current transition from the history of past transitions is utilized as the transition estimation model M. By applying the transition estimation model M of an arbitrary pronunciation style Q to the musical score data X2, the relative transition of a voice pronouncing the note specified by the musical score data X2 in the pronunciation style Q is generated. In a relative transition generated by the transition estimation model M of each pronunciation style Q, changes in pitch peculiar to said pronunciation style Q can be observed), and the first input data includes the control data and second style data indicating the second pronunciation style ([0036] the voice synthesis module 24 generates the voice signal Z of the synthesized voice whose pitch changes along the combined characteristic transition V generated by the transition processing module 25. That is, the pitch of the voice element selected in accordance with the phoneme of each note is adjusted to follow the combined characteristic transition V).
Blaauw further teaches: wherein the first input data includes control data specifying a synthesis condition of the target sound ([0006] includes inputting a first piece of sound source data representative of a first sound source, a first piece of style data representative of a first performance style, and first synthesis data representative of first sounding conditions into a synthesis model), the first estimation model is created by machine learning using a relationship between the first input data and time-series data ([0031] The synthesis model M is a statistical prediction model having learned relations between the input data Z and the feature data Q. The synthesis model M in the first embodiment is constituted by a deep neural network (DNN)) in each of a plurality of pieces of training data… ([0039] The feature analyzer 24 extracts a series of pieces of feature data Q from the audio signal V of each piece of training data L. In one example, the extracted feature data Q includes a fundamental frequency Qa and a spectral envelope Qb of the audio signal V. The generation of a piece of feature data Q is repeated for each time unit (e.g., 5 milliseconds)).
Regarding claim 14, the combination of Tachibana and Blaauw teaches: the method according to claim 13. Tachibana further teaches: wherein a portion of the sound characteristic in the time-series data generated by the first estimation model is changed with the alteration made to the first portion, to generate the second time-series data ([0033] the user can instruct an addition or change of the specific range R, and the pronunciation style Q of the specific range R. The range setting module 22 adds or changes the specific range R and sets the pronunciation style Q of the specific range R in accordance with the user's instruction, and changes the range data X1 in accordance with said setting).
Regarding claim 15, the combination of Tachibana and Blaauw teaches: the method according to claim 14. Blaauw further teaches: wherein second input data is processed using a second estimation model, the second input data including the control data and one of the first time-series data or the second time-series data ([0091] inputting a second piece of sound source data representative of a second sound source, a second piece of style data representative of a second performance style corresponding to the second sound source, and second synthesis data representative of second sounding conditions into the synthesis model), the second estimation model is created by the machine learning using a relationship between the second input data and pitch data in each of the plurality of pieces of the training data ([0061] The first well-trained model M1 generates intermediate data Y in accordance with input data Z including singer data Xa, style data Xb, and synthesis data Xc. The intermediate data Y represent a series of respective elements related to singing of a tune. Specifically, the intermediate data Y represent a series of pitches (e.g., note names), a series of volumes during the singing, and a series of phonemes), to generate pitch data indicating a time series of a pitch of the target sound ([0077] the piece of singer data Xa corresponds to an example of a piece of sound source data representative of a sound source, the sound sources including speaking persons or musical instruments and the like, in addition to singers. Style data Xb comprehensively represent performance styles that includes speech styles or styles of playing musical instruments, in addition to vocal styles. Synthesis data Xc comprehensively represent sounding conditions including speech conditions (e.g., phonetic identifiers) or performance conditions (e.g., a pitch and a volume for each note) in addition to singing conditions), and a sound signal representing the target sound is generated using one of the first time-series data or the second time-series data, and the generated pitch data ([0091] generating a second audio signal corresponding to the second target sound using the generated second feature data).
Tachibana and Blaauw are considered analogous in the field of audio synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tachibana and Blaauw to further combine the teachings of Blaauw because doing so would allow for improved audio synthesis by training a synthesis model using training data containing different sound characteristics (Blaauw [0044] the learning processor 23 selects the next piece of training data L from the pieces of training data in the memory 12 (Sb1), and performs the update processing (Sb2 to Sb5) with the selected piece of training data L. In other words, the update processing is repeated for each piece of training data L).
Regarding claim 17, the combination of Tachibana and Blaauw teaches: the method according to claim 11. Tachibana further teaches: wherein the sound characteristic is a pitch of the target sound ([0039] the first transition generation module 251 generates the first characteristic transition V1 (that is, a time series of pitch bend) corresponding to a line drawing provided as an instruction in the instruction area D by the user by using the input device 14), and the first portion is a portion of a pitch time series indicated in the first time-series data, which the user instructed to be altered ([0022] the user can instruct the transition of acoustic characteristics, for example, the pitch of the synthesized voice (hereinafter referred to as "first characteristic transition")).
Regarding claim 18, the combination of Tachibana and Blaauw teaches: the method according to claim 11. Tachibana further teaches: wherein the sound characteristic includes an amplitude and a tone of the target sound ([0020] generates synthesized voice that is virtually pronounced in a pronunciation style selected from a plurality of pronunciation styles. A pronunciation style means, for example, a characteristic manner of pronunciation. Specifically, a characteristic related to a temporal change of a feature amount, such as pitch or volume (that is, the pattern of change of the feature amount), is one example of a pronunciation style. Volume corresponds to amplitude.), and the first portion is a portion of a time series of amplitude and tone indicated in the first time-series data, which the user instructed to be altered ([0073] the voice signal Z is generated so as to represent the synthesized voice having the tone which, among a plurality of tones, corresponds to an instruction from the user).
Regarding claim 19, the combination of Tachibana and Blaauw teaches: the method according to claim 11. Tachibana further teaches: wherein the first pronunciation style and the second pronunciation style are each a pronunciation style selected from a plurality of different pronunciation styles ([0020] information processing device 100 according to the first embodiment generates synthesized voice that is virtually pronounced in a pronunciation style selected from a plurality of pronunciation styles) in accordance with the instruction from the user ([0033] The range setting module 22 of FIG. 2 sets the pronunciation style Q for the specific range R within the synthesized musical piece. By appropriately operating the input device 14, the user can instruct an addition or change of the specific range R, and the pronunciation style Q of the specific range R).
Regarding claim 21, it recites similar limitations as claim 13 and therefore is rejected similarly.
Regarding claim 22, it recites similar limitations as claim 14 and therefore is rejected similarly.
Regarding claim 23, it recites similar limitations as claim 15 and therefore is rejected similarly.
Regarding claim 25, it recites similar limitations as claim 17 and therefore is rejected similarly.
Regarding claim 26, it recites similar limitations as claim 18 and therefore is rejected similarly.
Regarding claim 27, it recites similar limitations as claim 19 and therefore is rejected similarly.
Claims 16 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Blaauw as applied to claims 1, 11-15, 17-19, 21-23, and 25-27 above, and further in view of Daido (US 20210256959 A1).
Regarding claim 16, the combination of Tachibana and Blaauw teaches: the method according to claim 15. The combination of Tachibana and Blaauw does not explicitly, but Daido teaches: wherein third input data is processed using a third estimation model, the third input data including either the first time-series data or the second time-series data ([0033] The synthesis model M is a statistical prediction model having learned relations between the input data Z and the feature data Q), and the generated pitch data ([0022] Feature data Q represents features of sound represented by the audio signal V1. A piece of feature data Q in the first embodiment includes a fundamental frequency (a pitch) Qa and a spectral envelope Qb), and wherein the third estimation model is created by machine learning ([0034] The learning processor 26 trains the synthesis model M by machine learning. The machine learning carried out by the learning processor 26 is classified into pre-training and additional training) using a relationship between the third input data and sound signal in each of the plurality of pieces of the training data ([0050] The learning processor 26 trains the synthesis model M by additional training with using training data L2 (Sb2 to Sb4). The training data L2 include the condition data Xb and the feature data Q that are generated by the signal analyzer 21 from the audio signal V1. Pieces of training data L2 stored in the memory 12 can be used for the additional training), to generate the sound signal ([0066] an audio signal V2 generated by the signal generator 25 from the feature data Q generated by the synthesis model M, and (ii) the audio signal V3 generated by the adjustment processor 31 shown in FIG. 7. The audio signal V4 generated by the signal synthesizer 32 is supplied to the sound output device 15).
Tachibana, Blaauw, and Daido are considered analogous in the field of audio synthesis. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tachibana and Blaauw to combine the teachings of Daido because doing so would improve audio synthesis by generating new audio synthesis data by updating a synthesis model using previous generated sound characteristic data and training data (Daido [0075] includes establishing a re-trained synthesis model by additionally training a pre-trained synthesis model for generating feature data representative of acoustic features of an audio signal according to condition data representative of sounding conditions).
Regarding claim 24, it recites similar limitations as claim 16 and therefore is rejected similarly.
Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Blaauw as applied to claims 1, 11-15, 17-19, 21-23, and 25-27 above, and further in view of Walker et al. (US 20170358303 A1; hereinafter referred to as Walker).
Regarding claim 20, the combination of Tachibana and Blaauw teaches: the method according to claim 11. The combination of Tachibana and Blaauw does not explicitly, but Daido teaches: wherein whether the instruction from the user is adequate is determined ([0277] if a response provided by the software application does not indicate that a parameter is valid, the response may indicate that clarification of the parameter is required. For instance, a parameter may be improper (i.e., invalid) and the software application may request additional input from the user), and when the instruction is determined to be inadequate, the first portion is not altered ([0277] in the event the parameter is improper (e.g., the user has specified an invalid type of car), the application may request that a proper (e.g., valid) value for the parameter (e.g., type of car) be provided).
Tachibana, Blaauw, and Walker are considered analogous in the field of audio processing. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Tachibana and Blaauw to combine the teachings of Walker because doing so would allow for invalid user instructions to be handled, leading to less errors in audio synthesis (Walker [0277] based on the response from the software application indicating the parameter is improper, the user device may provide a natural-language query 1002 prompting the user to select a valid parameter).
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Nathan Tengbumroong whose telephone number is (703)756-1725. The examiner can normally be reached Monday - Friday, 11:30 am - 8:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NATHAN TENGBUMROONG/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654