Last updated: April 19, 2026
Application No. 18/068,494
Speech Enhancement Based on Metadata Associated with Audio Content

Final Rejection §103
Filed
Dec 19, 2022
Examiner
GAY, SONIA L
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Sonos Inc.
OA Round
2 (Final)
Interview Optional

— +11.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 855 resolved cases, 2023–2026
Examiner Intelligence

GAY, SONIA L View full profile →
Grants 82% — above average
Career Allow Rate
701 granted / 855 resolved
+20.0% vs TC avg
Moderate +11% lift
Without
With
+11.4%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
33 currently pending
Career history
888
Total Applications
across all art units
Statute-Specific Performance

§101
10.2%
-29.8% vs TC avg
§103
50.6%
+10.6% vs TC avg
§102
11.9%
-28.1% vs TC avg
§112
13.9%
-26.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 855 resolved cases
Office Action

§103
DETAILED ACTION
This action is in response to the amendment filed on 10/08/2025.

Response to Amendment
Applicant’s amendment filed on 10/08/2025 have been entered. Claims 1, 3 – 5, 10, 13, 16, 18 – 22 and  29 have been amended. Claims 9 and 25 have been canceled. Claim 32 has been added. Claims 1 – 8, 10 – 24 and 26 – 32 are still pending in this application, with claims 1 and 16 being independent.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3, 5, 10 – 12, 16, 18, 26 - 28  and 32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464) and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”).
For claim 1, Hirsch discloses a playback device (audio output device) (Abstract; [0030]) comprising: for audio content, determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content (“when an audio stream is playing 410, the audio content type may be identified through metadata associated with the audio stream. For example, the metadata may be contained within the audio file itself or may be ascertained from the operating system of the audio output device”, Fig.4, 411; [0063]) for each of one or more portions of the audio content determined to comprise speech, identify one or more audio playback parameters  (parameters for multiband compression including ratio, threshold and gain) for application to the at least one portion of the audio content determined to comprise speech (Fig.4,412; [0058] [0061 – 0063] [0065] [0068] [0069]); and play back the audio content, wherein playing back the audio content comprises applying the identified one or more audio playback parameters to the at least one portion of the audio content determined to comprise speech (Fig.4,413 and 414; [0063] [0065]). 
Yet, Hirsch fails teach the following, the playback devices comprises at least one network interface, one or more processors, and  a tangible, non-transitory computer-readable media; and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors; for each of one or more portions of the at least one portion of the audio content determined to comprise speech, apply a speech recognition algorithm to the respective portion of the audio content determined to comprise speech to identify (i) one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech, wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech; and the one or more playback parameters are applied to the one or more first sub-portions of the audio content identified as comprising speech.
However, Curtis discloses a system and method for audio content recognition (Abstract), wherein an audio output device (A/V playback device, Fig.1, 114 and Fig.5, 500; [0051] [0106]) comprises the following, wherein the audio output device performs audio processing based on type of audio content ([0080] [0082] [0085]): at least one network interface (transceiver, Fig.1,116 and Fig.5, 524; [0051] [0054] [0114]); one or more processors (Fig.1, 142 and Fig.5, 504;[0051]) [0107]); a tangible, non-transitory computer-readable media (Fig.1, 140 and Fig.5, 508; [0051] [0110] [0118]); and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors ([0118]).
Additionally, Kosaka discloses a method for  detecting speech sounds in movie sequences, comprising the following: applying a speech recognition algorithm  (VAD) to a movie content comprising speech (Figure 1) to identify one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech (instrumental sound/music, singing, silence, noise) (III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions), wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech (Figure 1).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify Hirsch’s teachings with Curtis’ teachings so that the playback device comprises the following for the purpose of configuring the playback device to accept, process and output received audio data (Curtis, [0003 – 0006]): at least one network interface; one or more processors; a tangible, non-transitory computer-readable media; and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch and Curtis in the same way that Kosaka’s invention has been improved to achieve the following, predictable results for the purpose of improving the provision of content-specific, personalized audio replay on consumer devices (Hirsch, [0002 – 0007]): the device further comprises a VAD, speech recognition algorithm; the VAD, speech recognition algorithm is further applied to the movie content (Hirsch, [0065]) to identify (i) one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech, wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech; and the one or more playback parameters are further applied to the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, Figure 6; [0063] [0065]).

For claim 3, Hirsch and Kosaka further disclose, wherein the one or more audio playback parameters comprise an amplitude of one or more frequency ranges of the audio content (Hirsch, dynamic range compression is performed on the audio content, wherein dynamic range compression modifies the amplitude of a signal in a subband … the output sound level  of a compression system is different than an input sound level, Fig.5; [0017] [0018] [0058] [0062] [0064] [0076]) and wherein the program instructions that are executable by the one or more processors such that the playback device is configured to play back the audio content, wherein playing back the audio content comprises applying the identified one or more audio playback parameters to the one or more first sub-portions of the audio content identified as comprising speech, comprise program instructions that are executable by the one or more processors such that the playback device is configured to: adjust an amplitude of one or more frequency ranges of the audio content during playback of the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, dynamic range compression is performed on the audio content, wherein dynamic range compression modifies the amplitude of a signal in a subband… the output sound level  of a compression system is different than an input sound level, Fig.5; [0017] [0018] [0058] [0062 – 0064] [0076]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions).

For claim 5, Hirsch and Kosaka further disclose, wherein: applying the identified one or more playback parameters to the one or more first sub-portions of the audio content identified as comprising speech comprises applying one or more filters to the one or more first sub-portions of the audio content identified as comprising speech during playback, wherein one or more filters are configured to attenuate frequencies outside of a defined frequency range (Hirsch, Fig.5; [0064]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions).


For claims 10 and 26, Hirsch further discloses wherein the program instructions that are executable by the one or more processors such that the playback device is configured to determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content comprise program instructions that are executable by the one or more processors such that the playback device is configured to: apply a speech recognition algorithm (Hirsch, speech detection algorithm) to the audio content to identify at least one segment of the audio content comprising speech (Hirsch, [0063]) ; and for the at least one segment of the audio content identified as comprising speech, check whether the at least one segment includes speech based at least in part on metadata associated with the audio content (Hirsch, [0063]).

For claims 11 and 27, Hirsch further discloses, wherein the program instructions comprise program instructions that are executable by the one or more processors such that the playback device is configured to: for the audio content, (i) determine at least one portion of the audio content comprising scene-specific audio (Hirsch, audio associated with video) based at least in part on the metadata associated with the audio content (Hirsch, Fig.4,411; [0029] [0063]) and (ii) for the at least one portion of the audio content determined to comprise scene-specific audio, (a) identify at least one audio playback parameter for application to the at least one portion of the audio content comprising scene-specific audio (Hirsch, Fig.4,412; [0029] [0058] [0061 – 0063] [0065] [0068] [0069]), and (ii) apply the identified at least one audio playback parameter to the at least one portion of the audio content determined to comprise scene- specific audio during playback of the audio content. (Hirsch, Fig.4,413 and 414; [0063] [0065]).

For claims 12 and 28, Hirsch further discloses, wherein the program instructions comprise program instructions that are executable by the one or more processors such that the playback device is configured to: for the audio content, (i) determine at least one portion of the audio content lacking speech based at least in part on the metadata associated with the audio content (music, Fig.4, 410;  [0063]), and (ii) for the at least one portion of the audio content determined to lack speech, (a) identify at least one audio playback parameter for application to the at least one portion of the audio content determined to lack speech (Fig.4,411; [0063] [0065]) , and (ii) apply the identified at least one audio playback parameter to the at least one portion of the audio content determined to lack speech during playback of the audio content (Fig.4, 413 and 414; [0063] [0065]).

For claim 16, Hirsch discloses a computing system (Abstract; [0030]) comprising: for audio content, determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content (“when an audio stream is playing 410, the audio content type may be identified through metadata associated with the audio stream. For example, the metadata may be contained within the audio file itself or may be ascertained from the operating system of the audio output device”, Fig.4, 411; [0063]) for each of one or more portions of the audio content determined to comprise speech, identify one or more audio playback parameters  (parameters for multiband compression including ratio, threshold and gain) for application to the at least one portion of the audio content determined to comprise speech (Fig.4,412; [0058] [0061 – 0063] [0065] [0068] [0069]); and play back the audio content, wherein playing back the audio content comprises applying the identified one or more audio playback parameters to the at least one portion of the audio content determined to comprise speech (Fig.4,413 and 414; [0063] [0065]). 
Yet, Hirsch fails teach the following: the computing system comprises the following, wherein the computing system causes a play back device to play back the audio content; at least one network interface; one or more processors; a tangible, non-transitory computer-readable media; and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors; for each of one or more portions of the at least one portion of the audio content determined to comprise speech, apply a speech recognition algorithm to the respective portion of the audio content determined to comprise speech to identify (i) one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech, wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech; and the one or more playback parameters are applied to the one or more first sub-portions of the audio content identified as comprising speech.
However, Curtis discloses a system and method for audio content recognition (Abstract), wherein a computing system (A/V playback device, Fig.1, 114 and Fig.5, 500; [0051] [0106]) comprises the following, to cause a play back device (external speaker, Fig.1, 118) to play back audio content ([0053]), wherein the audio output device performs audio processing based on type of audio content ([0080] [0082] [0085]): at least one network interface (transceiver, Fig.1,116 and Fig.5, 524; [0051] [0054] [0114]); one or more processors (Fig.1, 142 and Fig.5, 504;[0051]) [0107]); a tangible, non-transitory computer-readable media (Fig.1, 140 and Fig.5, 508; [0051] [0110] [0118]); and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors ([0118]).
Additionally, Kosaka discloses a method for  detecting speech sounds in movie sequences, comprising the following: applying a speech recognition algorithm  (VAD) to a movie content comprising speech (Figure 1) to identify one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech (instrumental sound/music, singing, silence, noise) (III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions), wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech (Figure 1).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify Hirsch’s teachings with Curtis’ teachings so that the playback device comprises the following for the purpose of configuring the computing system device to accept, process and output received audio data, wherein the received audio data is output using a playback device (Curtis, [0003 – 0006] [0053]): at least one network interface; one or more processors; a tangible, non-transitory computer-readable media; and program instructions stored in the tangible, non-transitory computer-readable media that are executable by the one or more processors.
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch and Curtis in the same way that Kosaka’s invention has been improved to achieve the following, predictable results for the purpose of improving the provision of content-specific, personalized audio replay on consumer devices (Hirsch, [0002 – 0007]): the device further comprises a VAD, speech recognition algorithm; the VAD, speech recognition algorithm is further applied to the movie content (Hirsch, [0065]) to identify (i) one or more first sub-portions of the audio content comprising speech and (ii) one or more second sub-portions of the audio content lacking speech, wherein a first sub-portion of the audio content identified as comprising speech has a shorter duration than the respective portion of the audio content determined to comprise speech; and the one or more playback parameters are further applied to the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, Figure 6; [0063] [0065]).
 
For claim 18, Hirsch, Curtis and Kosaka further disclose wherein the program instructions that are executable by the one or more processors such that the computing system is configured to cause a playback device to play back the audio content with the identified one or more audio playback parameters applied to the one or more first sub-portion of the audio content identified as comprising speech comprise program instructions that are executable by the one or more processors such that the computing system is configured to: apply the one or more audio playback parameters the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, [0063] [0065]) (Curtis, [0053] [0080 – 0082] [0085]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions); and after applying the one or more audio playback parameters to t the one or more first sub-portions of the audio content identified as comprising speech, transmit the audio content to the playback device  (Hirsch, [0063] [0065]) (Curtis, [0053] [0064 – 0069])(Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions).

For claim 32, Hirsch and Kosaka further disclose, wherein playing back the audio content further comprises playing back the one or more second sub-portions of the audio content identified as lacking speech without applying the identified one or more audio playback parameters to all or a portion of each of the one or more second sub-portions of the audio content identified as lacking speech (Hirsch, The parameters are applied to the speech/voice, Fig.4,413 and 414, Figure 6; [0063] [0065]) (Kosaka, The movie content comprises singing, speech and instrumental music, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions).

Claim(s) 2, 6-8, 17, 23 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”) and further in view of Scheirer et al. (US 2015/0237454) (“Scheirer”).
For claims 2 and 17, the combination of Hirsch, Curtis and Kosaka fails to teach, wherein the metadata associated with the audio content comprises closed caption data associated with the audio content.
However, Scheirer discloses an audio system and method for manipulating audio material for playback (Abstract), comprising the following: determining a type of audio content using metadata, wherein the metadata associated with audio content comprises closed caption data associated with the audio content ([0003 – 0005] [0023]); and selecting a set of audio processing instructions based on the content type ([0006]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Scheirer’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]) (Scheirer, [0001] [0002]): the metadata associated with the audio content comprises closed caption data associated with the audio content.

For claims 6 and 23, the combination of Hirsch, Curtis and Kosaka further discloses, wherein the program instructions that are executable by the one or more processors such that the playback device is configured to determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content comprise program instructions that are executable by the one or more processors such that the playback device is configured to: determine at least one portion of the audio content comprising speech based on a datafile received via the at least one network interface (Hirsch, [0063]) (Curtis, Fig.2, 220 and 225; [0065] [0066]). 
Yet, the combination of Hirsch, Curtis and Kosaka fails to teach, wherein the datafile comprises closed caption data that is time-aligned to the audio content.
However, Scheirer discloses an audio system  and method for manipulating audio material for playback (Abstract), comprising the following: determining a type of audio content using metadata, wherein the metadata associated with audio content comprises closed caption data which is time-aligned with audio content ([0003 – 0005] [0023]); and selecting a set of audio processing instructions based on the content type ([0006]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Scheirer’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]) (Scheirer, [0001] [0002]): the audio content is determined based on metadata received with the audio content, wherein the metadata comprises closed caption data which is time-aligned with the audio content..

For claim 7, the combination of Hirsch, Curtis and Kosaka further discloses, wherein the program instructions that are executable by the one or more processors such that the playback device is configured to determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content comprise program instructions that are executable by the one or more processors such that the playback device is configured to: determine at least one portion of the audio content comprising speech based on data received via a High-Definition Multimedia Interface (HDMI) Audio Return Channel (ARC) connection between the playback device and a video display device  (Hirsch, [0063]) (Curtis, Fig.2, 220 and 225; [0065] [0066]). 
Yet, the combination of Hirsch, Curtis and Kosaka fails to teach that audio content is determined based on received, closed caption data.
However, Scheirer discloses an audio system  and method for manipulating audio material for playback (Abstract), comprising the following: determining a type of audio content using metadata, wherein the metadata associated with audio content comprises closed caption data associated with the audio content ([0003 – 0005] [0023]); and selecting a set of audio processing instructions based on the content type ([0006]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Scheirer’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]) (Scheirer, [0001] [0002]): the audio content is determined based on metadata received with the audio content, wherein the metadata comprises closed caption data.

For claims 8 and 24, the combination of Hirsch, Curtis and Kosaka further discloses, wherein the program instructions that are executable by the one or more processors such that the playback device is configured to determine at least one portion of the audio content comprising speech based at least in part on metadata associated with the audio content comprise program instructions that are executable by the one or more processors such that the playback device is configured to: determine at least one portion of the audio content comprising speech based on data contained within the metadata associated with the audio content (Hirsch, [0063]), wherein the metadata and the auto content are received via the at least one network interface(High-Definition Multimedia Interface (HDMI) Audio Return Channel (ARC) connection between the playback device and a video display device)  (Hirsch, [0063]) (Curtis, Fig.2, 220 and 225; [0065] [0066]). 
Yet, the combination of Hirsch, Curtis and Kosaka fails to teach that audio content is determined based on closed caption data contained within the metadata.
However, Scheirer discloses an audio system  and method for manipulating audio material for playback (Abstract), comprising the following: determining a type of audio content using metadata, wherein the metadata associated with audio content comprises closed caption data associated with the audio content ([0003 – 0005] [0023]); and selecting a set of audio processing instructions based on the content type ([0006]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Scheirer’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]) (Scheirer, [0001] [0002]): the audio content is determined based on metadata received with the audio content, wherein the metadata comprises closed caption data.

Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”) and further in view of Kuivalainen et al. (US 2021/0382679) (“Kuivalainen”)
For claim 4, the combination of Hirsch, Curtis and Kosaka further discloses, wherein the program instructions that are executable by the one or more processors such that the playback device is configured to adjust an amplitude of one or more frequency ranges of the audio content during playback of the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, dynamic range compression is performed on the audio content, wherein dynamic range compression modifies the amplitude of a signal in a subband… the output sound level  of a compression system is different than an input sound level, Fig.5; [0017] [0018] [0058] [0062 – 0064] [0076]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions).
Yet the combination of Hirsch, Curtis and Kosaka fails to teach that the playback device is configured to at least one of (i) increase the amplitude of the audio content within a first frequency range during playback of the one or more first sub-portions of the audio content identified as comprising speech or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of t the one or more first sub-portions of the audio content identified as comprising speech.
However, Kuivalainen discloses a system and method for adaptively modulating audio content (Abstract), comprising the following: increase the amplitude of the audio content within a first frequency range during playback of the at least one portion of the audio content (dynamic range compression involving upward compression, [0004]) or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of the at least one portion of the audio content (dynamic range compression involving downward compression, [0004]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Kuivalainen’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]): (i) increase the amplitude of the audio content within a first frequency range during playback of the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, perform dynamic range compression on speech, or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of the one or more first sub-portions of the audio content identified as comprising speech.

Claim(s) 13, 14, 29 and 30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”) and further in view of Asayama et al. (US 2009/0262256) (“Asayama”).
For claims 13 and 29, the combination of Hirsch, Curtis and Kosaka fails to teach, wherein the program instructions comprise program instructions that are executable by the one or more processors such that the playback device is configured to: after receiving a playback adjustment command after applying the identified one or more audio playback parameters to the one or more first sub-portions of the audio content identified as comprising speech during playback of the audio content, generate one or more modified audio playback parameters based on the playback adjustment command; and for an individual first sub-portions of the audio content identified as comprising speech played after receiving the playback adjustment command, apply the one or more modified audio playback parameters to the individual first sub-portions of the audio content identified as comprising speech during playback of the audio content.
However, Asayama discloses a video/sound output device and external speaker control device (Abstract), comprising the following: a playback adjustment command is received during playback of audio content of audio content ([0105 – 0107] [0111 – 0113] [0123] [0136 – 0139] [0153 – 0156]); one or more modified audio playback parameters are generated based on the playback adjustment command ([0157] [0158]); and apply the one or more modified playback parameters to the audio content ([0157] [0158]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis and Kosaka in the same way that Asayama’s invention has been improved to achieve the following predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]): after receiving a playback adjustment command after applying the identified one or more audio playback parameters to the one or more first sub-portions of the audio content identified as comprising speech during playback of the audio content, generate one or more modified audio playback parameters based on the playback adjustment command; and for an individual first sub-portions of the audio content identified as comprising speech played after receiving the playback adjustment command, apply the one or more modified audio playback parameters to the individual first sub-portions of the audio content identified as comprising speech during playback of the audio content.

For claims 14 and 30, Asayama further discloses, wherein the playback adjustment command comprises at least one of (i) a volume change (Asayama, [0153  - 0158]), or (ii) an equalization change.

Claim(s) 15 and 31 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”) and further in view of Yang et al. (US 2021/0241759) (‘Yang”).
For claims 15 and 31, the combination of Hirsch, Curtis and Kosaka fails to teach , wherein the program instructions comprise program instructions that are executable by the one or more processors such that the playback device is configured to: for a specific portion of the audio content comprising speech that includes a wake word associated with a voice assistant service, informing a wake word detection algorithm to disregard the wake word within the speech contained in that specific portion of the audio content.
	However, Yang discloses a system and method for ignoring a wakeword received at a speech-enabled listening device (Abstract), comprising the following: receiving a specific portion of audio content comprising speech that includes a wake word associated with a voice assistant service  ([0002-0004] [0035 - 0038]); and informing a wake word detection algorithm (wakeword spotter, Fig.4, 118; [0031]) to disregard the wakeword with the speech contained in the specific portion of the audio content (setting an ignore flag to true as informing the wake word spotter to ignore the wake word, [0037] [0039 -  0041]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch,  Curtis and Kosaka in the same way that Yang’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]), wherein the playback system further comprises a speech enabled device (Yang, [0002 – 0004] [0035]): for a specific portion of the audio content comprising speech that includes a wake word associated with a voice assistant service, informing a wake word detection algorithm to disregard the wake word within the speech contained in that specific portion of the audio content.

Claims 19, 20 and 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”) and further in view of Porter et al. (US 2017/0207762) (“Porter”).
For claim 19, the combination of Hirsch, Curtis and Kosaka fails to teach, wherein the program instructions that are executable by the one or more processors such that the computing system is configured to cause a playback device to play back the audio content with the identified one or more audio playback parameters applied to the one or more first sub-portions of the audio content identified as comprising speech program instructions that are executable by the one or more processors such that the computing system is configured to: generate a set of audio playback control instructions that are time-aligned to the audio content, wherein the playback control instructions comprise instructions to apply the identified one or more audio playback parameters to the one or more first sub-portions of the audio content identified as comprising speech; and transmit, to the playback device, the set of audio playback control instructions that are time-aligned to the audio content.
However, Porter discloses a system and method for the purpose of correcting an unknown content audio signal that is input to an audio power amplifier (Abstract), comprising the following:  generating a set of audio playback control instructions (correction values of parameters of audio signal processing blocks including dynamic range compression and gain) that are time-aligned to audio content (the correction values are time-aligned to the user content signal that is played back, [0014] [0016 – 0020] [0022]  [0030 – 0034]), wherein the playback control instructions comprise instructions to apply the identified one or more audio playback parameters to the at least one portion of the audio content during playback of the audio content  ([0016] [0017]); and transmit, to playback devices (audio signal processing blocks) the set of playback control instructions that are time-aligned to the audio content (Fig.1; [0016] [0017] [0022] [0024] [0031]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch,  Curtis and Kosaka in the same way that Porter’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by efficiently (preventing overcompensating) adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]) (Porter, [0001] [0003]):  further generate a set of audio playback control instructions (correction values) that are time-aligned to the audio content, wherein the playback control instructions comprise instructions to apply the identified one or more audio playback parameters to the one or more first sub-portions of the audio content identified as comprising speech; and transmit, to the playback device (Hirsch, [0063] [0063] (Curtis, external speaker, wherein the external speaker may comprise the audio processing blocks including DRC, Fig.1, 118; [0053]), the set of audio playback control instructions that are time-aligned to the audio content.

For claim 20, Hirsch, Curtis, Kosaka and Porter further disclose, wherein the one or more audio playback parameters comprise an amplitude of one or more frequency ranges of the audio content, and wherein the set of audio playback control instructions that are time-aligned to the audio content comprise playback control instructions that are executable by the playback device such that the playback device is configured to: adjust an amplitude of one or more frequency ranges of the audio content during playback of the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, dynamic range compression is performed on the audio content, wherein dynamic range compression modifies the amplitude of a signal in a subband … the output sound level  of a compression system is different than an input sound level, Fig.5; [0017] [0018] [0058] [0062] [0064] [0076]) (Curtis, [0053]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions)(Porter, instructions comprise correction values of parameters of audio signal processing blocks including dynamic range compression and gain, [0014] [0016 – 0020] [0022]  [0030 – 0034]).

For claim 22, Hirsch, Curtis, Kosaka and Porter further disclose, wherein playback parameters comprise a set of one or more equalization settings (Hirsch, [0063] [00653] (Porter, [0026]) and wherein the set of audio playback control instructions that are time-aligned to the audio content comprise playback control instructions that are executable by the playback device such that the playback device is configured to apply one or more filters to the one or more first sub-portions of the audio content identified as comprising speech during playback (Hirsch, [0063] [0065]) (Curtis, [0053]) (Kosaka, Figure 1; III Proposed methods, IV. VAD Algorithm V. Experimental Conditions and VI Results and Discussions)(Porter, 0014] [0016 – 0020] [0022]  [0030 – 0034]), wherein one or more filters are configured to attenuate frequencies outside of a defined frequency range (Hirsch, Fig.5; [0064]) (Curtis, [0016]).

Claim 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hirsch et al. (US 2021/0326099) (“Hirsch”) in view of Curtis (US 2020/0089464), and further in view of Kosaka et al. (“Improving Voice Activity Detection for Multimodal Movie Dialogue Corpus”) (“Kosaka”), and further in view of Porter et al. (US 2017/0207762) (“Porter”) and further in of Kuivalainen et al. (US 2021/0382679) (“Kuivalainen”)
For claim 21, Hirsch, Curtis, Kosaka and Porter further disclose, wherein the one or more audio playback parameters comprise an amplitude of one or more frequency ranges of the audio content during playback of the one or more first sub-portions of the audio content identified as comprising speech (Hirsch, dynamic range compression is performed on the audio content, wherein dynamic range compression modifies the amplitude of a signal in a subband… the output sound level  of a compression system is different than an input sound level, Fig.5; [0017] [0018] [0058] [0062 – 0064] [0076]) (Porter, dynamic range compression and gain, [0014] [0016 – 0020] [0022]  [0030 – 0034]).
 Yet, the combination of Hirsch, Curtis, Kosaka and Porter fails to teach that the set of audio playback control instructions that are time-aligned to the audio content comprise playback control instructions that are executable by the playback device such that the playback device is configured to at least one of (i) increase the amplitude of the audio content within a first frequency range during playback of the one or more first sub-portions of the audio contend identified as comprising speech or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of the one or more first sub-portions of the audio contend identified as comprising speech.
However, Kuivalainen discloses a system and method for adaptively modulating audio content (Abstract), comprising the following: increase the amplitude of the audio content within a first frequency range during playback of the at least one portion of the audio content (dynamic range compression involving upward compression, [0004]) or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of the at least one portion of the audio content (dynamic range compression involving downward compression, [0004]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the invention disclosed by the combination of Hirsch, Curtis, Kosaka and Porter in the same way that Kuivalainen’s invention has been improved to achieve the following, predictable results for the purpose of improving user experience by adapting the processing of audio to be output by a playback system based on content type (Hirsch, [0001] [0007]): (i) increase the amplitude of the audio content within a first frequency range during playback of t the one or more first sub-portions of the audio contend identified as comprising speech (Hirsch, perform dynamic range compression on speech, or (ii) decrease the amplitude of the audio content within a second frequency range different than the first frequency range during playback of t the one or more first sub-portions of the audio contend identified as comprising speech.


Response to Arguments
Applicant’s amendment filed on 10/08/2025 with respect to claim(s) 1 – 8,10 – 24 and 26 – 32 have been considered but are moot in view of the new ground(s) of rejection.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657
Read full office action
Prosecution Timeline

Dec 19, 2022
Application Filed
Apr 05, 2025
Non-Final Rejection — §103
Oct 08, 2025
Response Filed
Dec 31, 2025
Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/065,406
Patent 12602617
DATA MANUFACTURING FRAMEWORKS FOR SYNTHESIZING SYNTHETIC TRAINING DATA TO FACILITATE TRAINING A NATURAL LANGUAGE TO LOGICAL FORM MODEL
2y 5m to grant Granted Apr 14, 2026
18/136,634
Patent 12602408
STREAMING OF NATURAL LANGUAGE (NL) BASED OUTPUT GENERATED USING A LARGE LANGUAGE MODEL (LLM) TO REDUCE LATENCY IN RENDERING THEREOF
2y 5m to grant Granted Apr 14, 2026
18/390,675
Patent 12602539
PROACTIVE ASSISTANCE VIA A CASCADE OF LLMS
2y 5m to grant Granted Apr 14, 2026
18/467,276
Patent 12596708
SYSTEMS AND METHODS FOR AUTOMATED CODE GENERATION FOR CALCULATION BASED ON ASSOCIATED FORMAL SPECIFICATIONS
2y 5m to grant Granted Apr 07, 2026
18/209,100
Patent 12591604
INTELLIGENT ASSISTANT
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
82%
Grant Probability
93%
With Interview (+11.4%)
3y 0m
Median Time to Grant
Moderate
PTA Risk
Based on 855 resolved cases by this examiner. Grant probability derived from career allow rate.
Speech Enhancement Based on Metadata Associated with Audio Content

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email