DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 10/30/2025 has been entered.
This communication is in response to the Amendments and Arguments filed on 10/30/2025.
Claims 1-5, 7-10, and 12-15 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner.
Notice of Pre-AIA or AIA Status
The present application is being examined under the pre-AIA first to invent provisions.
Response to Arguments
Applicant's arguments filed 10/30/2025 have been fully considered but they are not persuasive and/or are moot.
Applicant asserts on pgs 11-14 that Yamanashi in view of Pandey and Archibald fails to teach obtaining a repeated audio section according to a sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, as the sub-frames of Archibald are used to detect the peak of the next “sub-frame” and the overlapping audio sections of Pandey are added together rather than extracted. The Examiner respectfully disagrees with this assertion. Pandey teaches that the input signal is segmented using frames with 75% overlap, where the output signal is synthesized using an overlap-add method of the analyzed overlapped frames [0027-8],[0031-2]. A 75% overlap with each shift would result in the last, third, second, and first 25% of the first, second, third, and fourth windows, respectively, being commonly repeated. This reads on the BRI of “obtain a repeated audio section according to a sub-audio that is commonly overlapped among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio”. This is further supported by the as-filed specification in [0027] which states “In some examples, the repeated audio section R is extracted by an overlap-add method.” Therefore, the use of an overlap-add method in Pandey is not inconsistent with BRI of the claims as recited. Archibald is cited to teach that the window is specifically divided into 4 subframes Fig. 4,[0042-3].
Hence, Applicant’s arguments regarding Yamanashi, Pandey, and Archibald are not persuasive.
Applicant’s arguments with respect to amended claim features related to magnitude, phase, masks, and using information obtained from a previous analysis as an input for subsequent analysis have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Please see the updated mappings below citing Casper for further detail.
Claim Objections
Claims 4 and 9 objected to because of the following informalities: both claims recite “mask information” . The Examiner suggests amending the claim(s) to recite –the mask information-- in order to maintain clear antecedent basis. Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 2-5, 7-10, 12, and 13 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
With respect to dependent claims 2 and 3, the claims have been amended to recite “the processor obtains the repeated audio section according to the sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, and discards the remaining sub-frames.” Applicant refers to [0027] to describe how it can be inferred that remaining sub-audios are discarded.
“[0027] In the first operation, the processor 50 performs the operation on the first original sub-audio group VI I by using the speech analysis model 40 and the separator 60. The operation manner is as described above and will not be repeated here. After the operation, a first analyzed audio TIO and hidden layer state information are obtained. Next, in the second operation, the processor 50 uses the hidden layer state information obtained by the first 25 operation and the second original sub-audio group Vl2 as the input, and performs analysis by using the speech analysis model 40 to obtain a second analyzed audio T20. The operation is repeated in this way to obtain a third analyzed audio T30, a fourth analyzed audio T40, ... , and then, the overlapping part of the analyzed audios Tl0-T40 is extracted and output as the repeated audio section R. As shown in the figure, after 4 times of analysis, the overlapping part is the sub-audio t3, so the sub-audio t3 is output as the repeated audio section. In some examples, the repeated audio section R is extracted by an overlap-add method. FIG. 2 is a schematic diagram showing operations according to the disclosure. The working principle of the part not mentioned in the figure is the same as above, and will not be repeated here.”
However, while the specification describes the extraction of repeated audio sections, there is no disclosure regarding what happens to audio sections not chosen as a repeated audio section, or that non-repeated sections are discarded.
Applicant’s arguments on pg 10 further state:
“Paragraph [0027] describes how the sub-audio t3 is used as the repeated audio section. Accordingly, it can be inferred that after the sub-audio t3 is extracted, the remaining sub-audios t4, t5, and t6 are discarded. Furthermore, after extracting the sub-audio t3, the processor subsequently performs analysis on the second analyzed audio T20, the third analyzed audio T30, the fourth analyzed audio T40, and the fifth analyzed audio, extracts the repeated sub-audio t4 as the repeated audio section, and discards the remaining sub-audios t4, t5, and t6.”
And as-filed Fig. 2 shows:
PNG
media_image1.png
162
268
media_image1.png
Greyscale
As disclosed in [0027], and shown in Fig. 2, there is no indication that discarding audio is part of the process. For instance, after t3 is extracted, Applicant states in the arguments that the remaining sub-audios t4-t6 are discarded, however Fig. 2 clearly shows t4-t6 as part of the repeated sub-audio sections that are output. This output of t4-t6, combined with a lack of description of any discarding of data, does not support discarding sub-audio that is not commonly repeated.
The issue with the current claim scope which is not supported by the specification is the following: Based on the changes made to the claims, there is no description to support discarding of remaining sub-audio after obtaining repeated audio sections according to the sub-audio that is commonly repeated.
Hence, the Applicant is suggested to amend the limitations to remove functionality that is not specifically described in the as-filed disclosure.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 2-5, 7-10, 12, and 13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
Claims 2 and 3 recite “the remaining sub-frames”. There is insufficient antecedent basis for this limitation in the claim. In the interest of compact prosecution, the Examiner will interpret the term as –remaining sub-audios--.
Claims 4, 5, 7-10, 12, and 13 are rejected as being dependent upon a rejected base claim.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1, 14, and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yamanashi (U.S. PG Pub No. 2016/0148623), hereinafter Yamanashi, in view of Casper et al. (U.S. PG Pub No. 2023/0232171), hereinafter Casper, in view of Pandey et al. (U.S. PG Pub No. 2016/0336015), hereinafter Pandey, and further in view of Archibald (U.S. PG Pub No. 2016/0336015), hereinafter Archibald.
Regarding claim 1, Yamanashi teaches
A television (a television apparatus [0013]), comprising:
a remote control, configured to send a volume adjustment command (a remote controller, i.e. remote control, able to send a signal to control the television apparatus, i.e. configured to send a…command, where the user can select a volume adjustment mode through the television apparatus, i.e. volume adjustment command [0018],[0035]);
a receiving element, configured to receive the volume adjustment command (the television apparatus has a remote controller receiver, i.e. receiving element, that receives a signal from the remote controller to control the television apparatus, i.e. configured to receive the…command, where the user can select a volume adjustment mode through the television apparatus, i.e. volume adjustment command [0018],[0035]);
a speaker (the television apparatus has a speaker [0017]);
a speech analysis model, configured to obtain an analysis result and hidden layer state information according to a video sound (content may include both video and audio, i.e. video sound, where the audio source separation module uses a voice model and background sound model, i.e. a speech analysis model, extracts an acoustic feature representing features of the voice signal and background sound of the audio signal, i.e. configured to obtain an analysis result…according to a video sound, and uses the acoustic feature to calculate a likelihood of the voice signal and a likelihood of the background signal, i.e. hidden layer state information…according to a video sound [0013],[0022-3],[0027]); and
a processor (the television apparatus has an audio processor, and can be provided with the audio source separation function to serve as the server apparatus [0017],[0044]), configured to:
using a window, a shifting length, a separator and the speech analysis model to perform a Fourier transform on the video sound…, and then an inverse Fourier transform is performed… (the audio source separation module, i.e. separator, uses a voice model and background sound model, i.e. speech analysis model, acquires an audio signal that is part of a content with video, extracts an acoustic feature, calculates a likelihood of voice and background sound in the signal, and extracts an estimated spectrogram of the voice signal and features of the background sound, where the audio signal is divided into frames having a length of 25 ms using a Hamming window, i.e. a window, where the interval is 8ms, and there are multiple frames each of the same length and 50 frames corresponds to 400ms, i.e. shifting length, and the calculations are performed on sets of 50 frames, where features are extracted using a Fourier transform of the audio signal, i.e. perform a Fourier transform on the video sound, and the estimated spectrogram of the voice signal is converted back into a time signal by inverse Fourier transform, i.e. an inverse Fourier transform is performed [0013],[0016],[0020-4],[0026-8]); and
after performing a plurality of operations correspondingly obtain a first analyzed audio, a second analyzed audio, a third analyzed audio, a fourth analyzed audio, and the hidden layer state information…(the audio source separation module acquires an audio signal that is part of a content with video, extracts an acoustic feature, calculates a likelihood of voice and background sound in the signal, i.e. after performing a plurality of operations…correspondingly obtain…the hidden layer state information, and extracts an estimated spectrogram of the voice signal and features of the background sound, i.e. correspondingly obtain…analyzed audio, where the audio signal is divided into frames having a length of 25 ms using a Hamming window, where the interval is 8ms, and there are multiple frames each of the same length and 50 frames corresponds to 400ms, and the calculations are performed on sets of 50 frames, i.e. a first analyzed audio, a second analyzed audio, a third analyzed audio, a fourth analyzed audio, [0013],[0016],[0020-4],[0026-8]);
adjust the volume of the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio according to the volume adjustment command (when the user provides an instruction to change the volume adjustment mode, the volume of the separated audio is changed to the ratio according to the mode, such as volume of voice and background equal, background completely suppressed, or voice completely suppressed [0016],[0035],[0039], and where the audio is separated based on evaluation of zones of 50 frames, i.e. first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio [0026-8],[0030],[0034-5]);
obtain a repeated audio section according to … the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio (stream data is created and sent to the television apparatus, with the volume based on the volume ratio of the selected volume adjustment mode, i.e. obtain a repeated audio section, where the volume of the separated audio is changed to the ratio according to the mode, where the separation is based on an evaluation of zones of 50 frames, i.e. according to … the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio [0016],[0026-8],[0030],[0034-5],[0039-40]); and
control the speaker to output the repeated audio section (the television apparatus plays back and outputs the stream data, i.e. output the repeated audio section, where audio is output by a speaker of the television apparatus, i.e. control the speaker to output [0017],[0039-40]).
While Yamanashi provides using a Fourier transform and inverse Fourier transform to extract spectrogram features and convert a signal back into a time domain, Yamanashi does not specifically teach the use of a Fourier transform to determine magnitude and phase information or further obtaining mask information, and thus does not teach
perform a Fourier transform on the video sound to obtain a magnitude information and a phase information, wherein the speech analysis model analyzes the magnitude information to obtain a mask information, and the separator performs masking on the magnitude information by using the mask information to obtain a target magnitude information, and then an inverse Fourier transform is performed according to the target magnitude information and the phase information to obtain an analyzed audio and hidden layer state information;
wherein, in each analysis process, the speech analysis model uses the hidden layer state information obtained from a previous analysis as an input for a subsequent analysis.
Casper, however, teaches to perform a Fourier transform on the video sound to obtain a magnitude information and a phase information, wherein the speech analysis model analyzes the magnitude information to obtain a mask information, and the separator performs masking on the magnitude information by using the mask information to obtain a target magnitude information, and then an inverse Fourier transform is performed according to the target magnitude information and the phase information to obtain an analyzed audio and hidden layer state information (the neural network, i.e. speech analysis model, takes input audio data and transforms them with a short-time Fourier transform, i.e. to perform a Fourier transform on the video sound, to obtain a time-frequency information where each frequency has a magnitude and phase, i.e. obtain a magnitude information and a phase information, and the neural network processes the audio to determine one or more complex masks, i.e. obtain mask information, that can be applied to the original signal to modify the phase and magnitude of each frequency, i.e. analyzes the magnitude information to obtain a mask information, such as to isolate one or more sound sources by blocking out a particular sound, i.e. separator performs masking on the magnitude information by using the mask information to obtain a target magnitude information, where the network then outputs the clean time-domain signal after having the phase and magnitude of each frequency modified, and where the RNN shares parameters across different parts of the neural network such that a present value of a variable is used at a future time by using updated activations from the current sample, and outputs can be from hidden layers, i.e. hidden layer state information Fig. 10,[0021],[0040],[0054],[0056],[0058],[0062],[0064-6],[0092],[0095],[0098],[0137], [0139]);
wherein, in each analysis process, the speech analysis model uses the hidden layer state information obtained from a previous analysis as an input for a subsequent analysis (the RNN shares parameters across different parts of the neural network such that a present value of a variable is used at a future time by using updated activations from the current sample, and outputs can be from hidden layers, i.e. in each analysis process the speech analysis model uses the hidden layer state information obtained from a previous analysis, and the updated activations from the current sample are used by the computation for subsequent samples, i.e. as an input for a subsequent analysis Fig. 10,[0058],[0137],[0139]).
Where Yamanshi specifically teaches that an inverse Fourier transform turns the time-frequency signal into a time domain signal [0024].
Yamanashi and Casper are analogous art because they are from a similar field of endeavor in performing audio separation to adjust the volume of different parts of the audio. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the using a Fourier transform and inverse Fourier transform to extract spectrogram features and convert a signal back into a time domain teachings of Yamanashi with the use complex masks to modify magnitude and phase information as taught by Casper. It would have been obvious to combine the references to improve a user’s understanding of speech in real-time conversations by processing the audio through a neural network (Casper [0001]).
While Yamanashi in view of Casper provides dividing the audio into frames using a Hamming window, Yamanashi in view of Casper does not specifically teach the window length is 4 times the shifting length the analyzed audios overlap, and thus does not teach
after performing a plurality of operations correspondingly obtain a first analyzed audio, a second analyzed audio, a third analyzed audio, a fourth analyzed audio…wherein a window length is 4 times the shifting length…;
obtain a repeated audio section according to a sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio.
Pandey, however, teaches after performing a plurality of operations correspondingly obtain a first analyzed audio, a second analyzed audio, a third analyzed audio, a fourth analyzed audio …wherein a window length is 4 times the shifting length… (the input signal is segmented using L-sample frames with 75% overlap, where the window length is L and the window shift is S=L/4, i.e. a window length is 4 times the shifting length, and the windowed input audio is processed for spectral modification using various blocks, i.e. after performing a plurality of operations correspondingly obtain a first analyzed audio, a second analyzed audio, a third analyzed audio, a fourth analyzed audio [0027-8],[0031-2]);
obtain a repeated audio section according to a sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio (an output signal is resynthesized by processing windows of the input signal, i.e. the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, and overlap-add, where the frames have L samples with a 75% overlap, i.e. obtain a repeated audio section according to a sub-audio that is commonly repeated [0027-8]).
Yamanashi, Casper, and Pandey are analogous art because they are from a similar field of endeavor in modifying audio according to user needs. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the dividing the audio into frames using a Hamming window teachings of Yamanashi, as modified by Casper, with the use of a modified Hamming window and a window shift ¼ the length of the window as taught by Pandey. It would have been obvious to combine the references to use a compression function that compensates for the abnormal loudness growth function of a hearing impaired user (Pandey [0017]).
While Yamanashi in view of Casper and Pandey provides a shifting length 4 times the window length and a 75% overlap, Yamanashi in view of Casper and Pandey does not specifically teach the analyzed audio includes four sub-audios, and thus does not teach
the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio respectively include four sub-audios.
Archibald, however, teaches the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio respectively include four sub-audios (audio samples are divided into a sequence of sub-frames, where the sub-frames are grouped as frames, i.e. the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, where there are four sub-frames per frame, i.e. analyzed audio respectively include four sub-audios, and the three last sub-frames of the earlier frame overlap with the 3 first sub-frames of the next frame Fig. 4,[0042-3]).
Yamanashi, Casper, Pandey, and Archibald, are analogous art because they are from a similar field of endeavor in modifying audio volume to improve user experience. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the a shifting length 4 times the window length and a 75% overlap teachings of Yamanashi, as modified by Casper and Pandey, with a frame being divided into four sub-frames where two successive frames have 3 overlapping sub-frames between them as taught by Archibald. It would have been obvious to combine the references to enable better discrimination of speech and non-speech in a noise detector, as well as gain control at the sub-frame level (Archibald [0015-6],[0058-9]).
Regarding claim 14, Yamanashi in view of Casper, Pandey, and Archibald teaches claim 1, and Yamanashi further teaches
the volume adjustment command comprises a plurality of mode commands, and the mode commands respectively have different volume adjustment ratios (the user can select a volume adjustment mode from a set of modes, i.e. the volume adjustment command comprises a plurality of mode commands, where each mode has a different ratio of volumes of the voice to the background sound, i.e. mode commands respectively have different volume adjustment ratios [0035],[0039]).
Regarding claim 15, Yamanashi in view of Casper, Pandey, and Archibald teaches claim 14, and Casper further teaches
the remote control has a plurality of mode buttons corresponding to the mode commands (the user may have the ability to use one of several dials or slides on an application GUI of a smartphone connected wirelessly to the hearing device, i.e. the remote control has a plurality of mode buttons, to dial up or down the volume of a specific signal compared to the other signals, i.e. corresponding to the mode commands [0065],[0089-91]).
Where Yamanashi teaches that the different modes are volume ratios [0035],[0039].
And where the motivation to combine is the same as previously presented.
Claim(s) 2-5, 7-10, 12, and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yamanashi, in view of Casper, in view of Pandey, in view of Archibald, and further in view of Laroche (“Synthesis of Sinusoids via Non-Overlapping Inverse Fourier Transform”, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 4, JULY 2000), hereinafter Laroche.
Regarding claim 2, Yamanashi in view of Casper, Pandey, and Archibald teaches claim 1, and Yamanashi further teaches
obtains a plurality of target analyzed sub-audios and corresponding non-target analyzed sub-audios, performs volume adjustment on each of the target analyzed sub-audios according to the volume adjustment command and mixes the volume-adjusted target analyzed sub-audio with the corresponding non-target analyzed sub-audio to obtain the analyzed audios (the audio source separation module, with a voice model and background sound model, i.e. using the speech analysis model and the separator, divides the audio signal into frames having a length of 25ms, and predetermined zones of 400ms/50 frames, then separates the audio source into voice and background sound for the predetermined zones, i.e. obtains a plurality of target analyzed sub-audios and corresponding non-target analyzed sub-audios, where the volume of the separated audio is changed to the ratio according to the mode, such as volume of voice and background equal, background completely suppressed, or voice completely suppressed, i.e. performs volume adjustment on each of the target analyzed sub-audios according to the volume adjustment command, and stream data is created with the volume based on the volume ratio of the selected volume adjustment mode, i.e. mixes the volume-adjusted target analyzed sub-audio with the corresponding non-target analyzed sub-audio to obtain the analyzed audios [0016],[0027-8],[0035],[0039-40]).
Where Pandey teaches obtains the repeated audio section according to the sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio…(an output signal is resynthesized by processing windows of the input signal, i.e. the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, and overlap-add, where the frames have L samples with a 75% overlap, i.e. obtain a repeated audio section according to a sub-audio that is commonly overlapped [0027-8]).
While Yamanashi in view of Casper, Pandey, and Archibald provides obtaining overlapped sub-frames, Yamanashi in view of Casper, Pandey, and Archibald does not specifically teach discarding remaining sub-frames, and thus does not teach
discards --remaining sub-audios--.
Laroche, however, teaches discards --remaining sub-audios—(the boundary samples at the beginning and end of a frame, i.e. remaining sub-audio, are discarded (Sec. 3C and 3D)). Where the beginning boundary would be, for example t0-t2 as per Fig. 2 of the instant application.
Yamanashi, Casper, Pandey, Archibald, and Laroche are analogous art because they are from a similar field of endeavor in analysis and modification of audio or speech signals. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the obtaining overlapped sub-frames teachings of Yamanashi, as modified by Casper, Pandey, and Archibald with the discarding of boundary samples of frames as taught by Laroche. It would have been obvious to combine the references to remove samples that are more likely to have higher levels of distortion than those in the middle of a frame (Laroche (Sec. 3C and 3D)).
Regarding claim 3, Yamanashi in view of Casper, Pandey, and Archibald teaches claim 1, and Yamanashi further teaches
using the speech analysis model and an separator, obtains a plurality of target analyzed sub-audios, performs volume adjustment on each of the target analyzed sub-audios according to the volume adjustment command and mixes the volume-adjusted target analyzed sub-audio with the … sound to obtain the analyzed audios (the audio source separation module, with a voice model and background sound model, i.e. using the speech analysis model and an separator, divides the audio signal into frames having a length of 25ms, and predetermined zones of 400ms/50 frames, then separates the audio source into voice and background sound for the predetermined zones, i.e. obtains a plurality of target analyzed sub-audios, where the volume of the separated audio is changed to the ratio according to the mode, such as volume of voice and background equal, background completely suppressed, or voice completely suppressed, i.e. performs volume adjustment on each of the target analyzed sub-audios according to the volume adjustment command, and stream data is created with the volume based on the volume ratio of the selected volume adjustment mode, i.e. mixes the volume-adjusted target analyzed sub-audio…to obtain the analyzed audios [0016],[0027-8],[0035],[0039-40]).
Where Casper teaches mixes the volume-adjusted target analyzed sub-audio with the video sound… (the model may mix the estimate of the target, i.e. mixes the volume-adjusted target analyzed sub-audio, with the original signal, i.e. with the video sound [0021],[0066-8],[0126]).
Where Pandey teaches obtains the repeated audio section according to the sub-audio that is commonly repeated among the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio… (an output signal is resynthesized by processing windows of the input signal, i.e. the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, and overlap-add, where the frames have L samples with a 75% overlap, i.e. obtain a repeated audio section according to a sub-audio that is commonly overlapped [0027-8]).
And where the motivation to combine is the same as previously presented.
While Yamanashi in view of Casper, Pandey, and Archibald provides obtaining overlapped sub-frames, Yamanashi in view of Casper, Pandey, and Archibald does not specifically teach discarding remaining sub-frames, and thus does not teach
discards --remaining sub-audios--.
Laroche, however, teaches discards --remaining sub-audios—(the boundary samples at the beginning and end of a frame, i.e. remaining sub-audio, are discarded (Sec. 3C and 3D)). Where the beginning boundary would be, for example t0-t2 as per Fig. 2 of the instant application.
Yamanashi, Casper, Pandey, Archibald, and Laroche are analogous art because they are from a similar field of endeavor in analysis and modification of audio or speech signals. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the obtaining overlapped sub-frames teachings of Yamanashi, as modified by Casper, Pandey, and Archibald with the discarding of boundary samples of frames as taught by Laroche. It would have been obvious to combine the references to remove samples that are more likely to have higher levels of distortion than those in the middle of a frame (Laroche (Sec. 3C and 3D)).
Regarding claims 4 and 9, Yamanashi in view of Casper, Pandey, Archibald, and Laroche teaches claims 3 and 2, and Casper further teaches
using the speech analysis model to obtain a plurality of pieces of mask information, and the separator obtains the target analyzed sub-audios according to each of pieces of the mask information and the video sound (a neural network is trained to isolate one or more sound sources by learning to generate a complex mask for an audio clip, such as a mask that isolates a target clip of speech from background noise, and operating on the model in 4ms sample lengths, i.e. using the speech analysis model to obtain a plurality of pieces of mask information, and when the mask is applied to data comprising both speech and background noise, i.e. according to each of pieces of the mask information and the video sound, the model can estimate a signal containing only the target content, such as speech, i.e. separator obtains the target analyzed sub-audios [0054],[0056-8]).
Where the motivation to combine is the same as previously presented.
Regarding claims 5 and 10, Yamanashi in view of Pandey, Archibald, and Casper teaches claims 4 and 9, and Casper further teaches
the operation is performed according to the analyzed audio, the speech analysis model and the hidden layer state information generated by the previous operation (the RNN shares parameters across different parts of the neural network such that a present value of a variable is used at a future time by using updated activations from the current sample, and outputs can be from hidden layers, i.e. performed according to the analyzed audio, the speech analysis model and the hidden layer state information, and the updated activations from the current sample are used by the computation for subsequent samples, i.e. generated by the previous operation Fig. 10,[0058],[0137],[0139]).
Where the motivation to combine is the same as previously presented.
Regarding claims 7 and 12, Yamanashi in view of Casper, Pandey, Archibald, and Laroche teaches claims 5 and 10, and Yamanashi further teaches
the volume adjustment command comprises a target volume adjustment command (a remote controller able to send a signal to control the television apparatus, i.e. command, where the user can select a volume adjustment mode through the television apparatus, i.e. volume adjustment command, which can include whether voice and background sound are output at the same volume, voice is completely suppressed, or background sound is completely suppressed, i.e. target volume adjustment command [0018],[0035]); and
Casper further teaches the remote control has a target volume adjustment button for sending the target volume adjustment command (the user may have the ability to use a dial or slide on an application GUI of a smartphone connected wirelessly to the hearing device, i.e. the remote control has a…button, to dial up or down the volume of different signals, i.e. a target volume adjustment button for sending the target volume adjustment command [0065],[0089-91]).
Where the motivation to combine is the same as previously presented.
Regarding claims 8 and 13, Yamanashi in view of Casper, Pandey, Archibald, and Laroche teaches claims 7 and 12, and Yamanashi further teaches
divides the video sound into a plurality of continuous original sub-audio groups, each of the original sub-audio groups comprises continuous sub-audios…(the audio source separation module divides the audio signal into frames having a length of 25ms, and predetermined zones of 400ms/50 frames, then separates the audio source into voice and background sound for the predetermined zones, i.e. divides the video sound into a plurality of continuous original sub-audio groups, each of the original sub-audio groups comprises continuous sub-audios [0016],[0027-8]); and
…sequentially obtains the original sub-audio groups and performs the plurality of operations by using the speech analysis model (the audio source separation module, with a voice model and background sound model, i.e. using the speech analysis model, divides the audio signal into frames having a length of 25ms, and predetermined zones of 400ms/50 frames, i.e. sequentially obtains the original sub-audio groups, then separates the audio source into voice and background sound for the predetermined zones and adjusts the volume ratios [0016],[0027-8],[0035],[0039-40]).
Archibald further teaches divides the…sound into a plurality of continuous original sub-audio groups, each of the original sub-audio groups comprises continuous sub-audios, and a tail sub-audio in the original sub-audio group is the same as a head sub-audio in the next original sub-audio group (audio samples are divided into a sequence of sub-frames, where the sub-frames are grouped as frames, i.e. the first analyzed audio, the second analyzed audio, the third analyzed audio, and the fourth analyzed audio, where there are four sub-frames per frame, i.e. analyzed audio respectively include four sub-audios, and the three last sub-frames of the earlier frame overlap with the 3 first sub-frames of the next frame Fig. 4,[0042-3]).
Where the motivation to combine is the same as previously presented.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NICOLE A K SCHMIEDER/Primary Examiner, Art Unit 2659